Will Epperson
/ PhD Student at CMU

Diff in the Loop: Supporting Data Comparison in Exploratory Data Analysis

April Yi Wang, Will Epperson, Robert DeLine, Steven M. Drucker

As users iterate on their data during analysis, they can use DITL to compare data snapshots. Every time users successfully execute code we save a snapshot (A). Users can compare the code using traditional code diffing tools. Additionally, users can also use DITL to compare data iterations with interactive visualizations, descriptive statistics, and data preview (B). Users can choose three ways to visualize the differences in each column: the delta view (C), opacity view (D), and parallel view (E).

Abstract

Data science is characterized by evolution: since data science is exploratory, results evolve from moment to moment; since it can be collaborative, results evolve as the work changes hands. While existing tools help data scientists track changes in code, they provide less support for understanding the iterative changes that the code produces in the data. We explore the idea of visualizing differences in datasets as a core feature of exploratory data analysis, a concept we call Diff in the Loop (DITL). We evaluated DITL in a user study with 16 professional data scientists and found it helped them understand the implications of their actions when manipulating data. We summarize these findings and discuss how the approach can be generalized to different data science workflows.

Citation

Diff in the Loop: Supporting Data Comparison in Exploratory Data Analysis
April Yi Wang, Will Epperson, Robert DeLine, Steven M. Drucker
Diff in the Loop supports tracking, comparing, and visualizing differences in datasets during iterative data analysis.
SIGCHI 22: ACM Symposium on Computer Human Interaction (CHI). New Orleans, LA, 2022.
Project PDF