Will Epperson
/ PhD Student at CMU

Strategies for Reuse and Sharing among Data Scientists in Software Teams

Will Epperson, April Yi Wang, Robert DeLine, Steven M. Drucker

Five unique strategies are used to reuse and share analysis code in data science. Personal reuse strategies like reusing one's own code are common, whereas using template notebooks is more rare and dependent on tool support.

Abstract

Effective sharing and reuse practices have long been hallmarks of proficient software engineering. Yet the exploratory nature of data science presents new challenges and opportunities to support sharing and reuse of analysis code. To better understand current practices, we conducted interviews (N=17) and a survey (N=132) with data scientists at Microsoft, and extract five commonly used strategies for sharing and reuse of past work: personal analysis reuse, personal utility libraries, team shared analysis code, team shared template notebooks, and team shared libraries. We also identify factors that encourage or discourage data scientists from sharing and reusing. Our participants described obstacles to reuse and sharing including a lack of incentives to create shared code, difficulties in making data science code modular, and a lack of tool interoperability. We discuss how future tools might help meet these needs.

Citation

Strategies for Reuse and Sharing among Data Scientists in Software Teams
Will Epperson, April Yi Wang, Robert DeLine, Steven M. Drucker
Interviews and a survey with 149 data scientists at Microsoft revealed five distinct strategies for sharing and reusing analysis code along with factors that encourage or discourage reuse.
ICSE 22: ACM International Conference on Software Engineering (ICSE). Pittsburgh, PA, 2022.
Project PDF Recording Slides