Parallelization of Data Science Tasks, an Experimental Overview
Castro O., Bruneau P., Sottet J.S., Torregrossa D.
ACM International Conference Proceeding Series, pp. 483-490, 2022
The practice of data science and machine learning often involves training many kinds of models, for inferring some target variable, or extracting structured knowledge from data. Training procedures generally require lengthy and intensive computations, so a natural step for data scientists is to try to accelerate these procedures, typically through parallelization as supported by multiple CPU cores and GPU devices. In this paper, we focus on Python libraries commonly used by machine learning practitioners, and propose a case-based experimental approach to overview mainstream tools for software acceleration. For each use case, we highlight and quantify the optimizations from the baseline implementations to the optimized versions. Finally, we draw a taxonomy of the tools and techniques involved in our experiments, and identify common pitfalls, in view to provide actionable guidelines to data scientists and code optimization tools developers.