What is it about?

The recent success of machine learning (ML) has led to an explosive growth of systems and applications built by an ever-growing community of system builders and data science (DS) practitioners. This quickly shifting panorama, however, is challenging for system builders and practitioners alike to follow. In this paper, we set out to capture this panorama through a wide-angle lens, performing the largest analysis of DS projects to date, focusing on questions that can advance our understanding of the field and determine investments. Specifically, we download and analyze (a) over 8M notebooks publicly available on GitHub and (b) over 2M enterprise ML pipelines developed within Microsoft. Our analysis includes coarse-grained statistical characterizations, fine-grained analysis of libraries and pipelines, and comparative studies across datasets and time. We report a large number of measurements for our readers to interpret and draw actionable conclusions on (a) what system builders should focus on to better serve practitioners and (b) what technologies should practitioners rely on.

Featured Image

Why is it important?

It is helpful to ground new research and engineering effort in this space as it provides a measurable distribution of activity/interest in the area of data science.

Read the Original

This page is a summary of: Data Science Through the Looking Glass, ACM SIGMOD Record, July 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3552490.3552496.
You can read the full text:

Read

Contributors

The following have contributed to this page