On the suitability of hugging face hub for empirical studies
Ait A., Cánovas Izquierdo J.L., Cabot J.
Empirical Software Engineering, vol. 30, n° 2, art. no. 57, 2025
Context: Empirical studies in software engineering mainly rely on the data available on code-hosting platforms, being GitHub the most representative. Nevertheless, in the last years, the emergence of Machine Learning (ML) has led to the development of platforms specifically designed for hosting ML-based projects, with Hugging Face Hub (HFH) as the most popular one. So far, there have been no studies evaluating the potential of HFH for such studies. Objective: We aim at performing an exploratory study of the current state of HFH and its suitability to be used as a source platform for empirical studies. Method: We conduct a qualitative and quantitative analysis of HFH. The former will be performed by comparing the features of HFH with those of other code-hosting platforms, such as GitHub and GitLab. The latter will be performed by analyzing the data available in HFH. Results: We propose a feature framework to characterize HFH and report on the current usage of the platform, both in terms of number and types of projects (and surrounding community) and the features they mostly rely on. Conclusions: The results confirm that HFH offers enough features and diverse enough data to be the source of relevant empirical studies on the development, evolution and usage of AI-related projects. The results also triggered a discussion on aspects of HFH that should be considered when performing such empirical studies.
doi:10.1007/s10664-024-10608-8