Boosting Source Code Learning with Text-Oriented Data Augmentation: An Empirical Study
Dong Z., Hu Q., Guo Y., Zhang Z., Zhao J.
Proceedings - 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion, QRS-C 2023, pp. 383-392, 2023
Recent studies have shown surprising results of source code learning, which applies deep neural networks (DNNs) to various software engineering tasks. Like other DNN-based domains, source code learning also requires massive high-quality training data to achieve the success of these applications. In practice, data augmentation is a technique that produces additional training data to boost the model training and has been widely adopted in other domains (e.g. computer vision). However, the existing practice of data augmentation in source code learning is limited to simple syntax-preserved methods, such as code refactoring. In this paper, based on the insight that source code can be represented sequentially as text data, we take an early step to investigate whether data augmentation methods originally for texts are effective for source code learning. To that end, we focus on code classification tasks and conduct a comprehensive empirical study on four critical code problems and four DNN architectures to assess the effectiveness of 8 data augmentation methods. Our results identify the data augmentation methods that can produce more accurate models for source code learning and show that the data augmentation methods are still useful even if they slightly break the syntax of source code.
doi:10.1109/QRS-C60940.2023.00017