Materials for the paper 'I am no sure, but...': Expert Practices that Enable Effective Code Comprehension in Data Science. This repository includes user study data for the paper
This is a screenshot of our research result
This research study the effective methods that novice data scientists can adopt to enhance their understanding of pre-written data analytical programs. We conducted user studies with five novice data scientists and four expert data scientists. In each study, participants were presented with a pre-written data analytical program and asked to use think-aloud method to explain their thought processes. Based on their responses, we performed both quantitative and qualitative analyses. The materials in this repository include the pre-written code we asked our participants to analyze.
-
buoy.txt: dataset that the program interact with
-
data_analytical_problem.ipynb: pre-written data analytical program
-
saver_func.py: helper function that download the entire dataframe visualize every entry in html
- Clone the repo
git clone https://github.com/dstl-lab/Code-Comprehension-User-Study.git
- Install environment using conda
conda env create -f environment.yaml
| Task | Time | Time out of 60 | Description |
|---|---|---|---|
| Intro | 5 | 5 | Open data_analytical_problem.ipynb and introduce the scenario. |
| Task 0 & Task 1 | 10 | 5 - 15 | Participants spent 10 minutes on understanding Task 0 and Task 1. |
| Rating & Follow-up 1 | 5 | 15 - 20 | Participants rated the difficulties and answered follow-up questions for Task 0 and Task 1. |
| Task 2 | 10 | 20 - 30 | Participants spent 10 minutes on understanding Task 2. |
| Rating & Follow-up 2 | 5 | 30 - 35 | Participants rated the difficulties and answered follow-up questions for Task 2. |
| Task 3 | 10 | 35 - 45 | Participants spent 10 minutes on understanding Task 3. |
| Rating & Follow-up 3 | 5 | 45 - 50 | Participants rated the difficulties and answered follow-up questions for Task 3. |
| Interview | 10 | 50 - 60 | Concluding interview to gather additional comments and feedback. |
Regardless of their level of expertise, participants were represented with the same notebook. Each task in the notebook represents a different stage of the data analytical pipeline: Task 0 represents data cleaning; Task 1 represents missing value assessment; Task 3 represents data imputation; and Task 4 represents evaluating the imputation results.
Participants were asked the following questions during the interview phase of the study (due to time constrain, part of the questions were being asked):
- What information are you trying to gather that you couldn’t from just the default Pandas output?
- Which rows and columns would you add to the smaller table in order to not need to refer to the larger table? (pick as many as you see fit)
- What are you thinking about here? What additional information would make this immediate problem easier to solve?
- What did you find easy or difficult about this task?
- What did you find to be the most effective way to understand the code?
- When you looked broadly at the full table from the save() function, how did you know what to look for?
- Let’s look at two of your HTML tables. How did you know what to look at on these tables specifically?
- Do you have any feedback, comments, or questions?
After each task, participants were asked to complete a survey describing their feelings on the task in Google form (Template). We also asked for two synonymous data scientists to evaulate participants' response based on the rubric we proved.
All of the data related to the research can be found in study_data folder.
Demographic Information: participants' demographic informationSelf Evaluation information: participants self report on the task they have completedAssessment on participants: performance assessment on participants from two other data scientists
Sam Lau - @github_profile - lau@ucsd.edu
Christopher Lum - @github_profile - cslum@ucsd.edu
Guoxuan Xu - @github_profile - g7xu@ucsd.edu
- We would like to thank all nine participants who voluntarily joined our user study, contributing valuable insights that enriched our research
- Apperciate constributor of this readme template