Skip to content

Materials for the paper 'I am no sure, but...': Expert Practices that Enable Effective Code Comprehension in Data Science. This repository includes user study data for the paper

Notifications You must be signed in to change notification settings

dstl-lab/Code-Comprehension-User-Study

Repository files navigation

Code Comprehension User Study Program

Materials for the paper 'I am no sure, but...': Expert Practices that Enable Effective Code Comprehension in Data Science. This repository includes user study data for the paper

About The Research

Product Name Screen Shot This is a screenshot of our research result



This research study the effective methods that novice data scientists can adopt to enhance their understanding of pre-written data analytical programs. We conducted user studies with five novice data scientists and four expert data scientists. In each study, participants were presented with a pre-written data analytical program and asked to use think-aloud method to explain their thought processes. Based on their responses, we performed both quantitative and qualitative analyses. The materials in this repository include the pre-written code we asked our participants to analyze.

(back to top)

User Study Protocol

Files Descriptions

  • buoy.txt: dataset that the program interact with

  • data_analytical_problem.ipynb: pre-written data analytical program

  • saver_func.py: helper function that download the entire dataframe visualize every entry in html

Set Up

  1. Clone the repo
    git clone https://github.com/dstl-lab/Code-Comprehension-User-Study.git
  2. Install environment using conda
    conda env create -f environment.yaml

Procedure

Task Time Time out of 60 Description
Intro 5 5 Open data_analytical_problem.ipynb and introduce the scenario.
Task 0 & Task 1 10 5 - 15 Participants spent 10 minutes on understanding Task 0 and Task 1.
Rating & Follow-up 1 5 15 - 20 Participants rated the difficulties and answered follow-up questions for Task 0 and Task 1.
Task 2 10 20 - 30 Participants spent 10 minutes on understanding Task 2.
Rating & Follow-up 2 5 30 - 35 Participants rated the difficulties and answered follow-up questions for Task 2.
Task 3 10 35 - 45 Participants spent 10 minutes on understanding Task 3.
Rating & Follow-up 3 5 45 - 50 Participants rated the difficulties and answered follow-up questions for Task 3.
Interview 10 50 - 60 Concluding interview to gather additional comments and feedback.

Notebook Details

Regardless of their level of expertise, participants were represented with the same notebook. Each task in the notebook represents a different stage of the data analytical pipeline: Task 0 represents data cleaning; Task 1 represents missing value assessment; Task 3 represents data imputation; and Task 4 represents evaluating the imputation results.

Interview Questions

Participants were asked the following questions during the interview phase of the study (due to time constrain, part of the questions were being asked):

  • What information are you trying to gather that you couldn’t from just the default Pandas output?
  • Which rows and columns would you add to the smaller table in order to not need to refer to the larger table? (pick as many as you see fit)
  • What are you thinking about here? What additional information would make this immediate problem easier to solve?
  • What did you find easy or difficult about this task?
  • What did you find to be the most effective way to understand the code?
  • When you looked broadly at the full table from the save() function, how did you know what to look for?
  • Let’s look at two of your HTML tables. How did you know what to look at on these tables specifically?
  • Do you have any feedback, comments, or questions?

Participant Responses

After each task, participants were asked to complete a survey describing their feelings on the task in Google form (Template). We also asked for two synonymous data scientists to evaulate participants' response based on the rubric we proved.

Study Data

All of the data related to the research can be found in study_data folder.

(back to top)

Contact

Sam Lau - @github_profile - lau@ucsd.edu

Christopher Lum - @github_profile - cslum@ucsd.edu

Guoxuan Xu - @github_profile - g7xu@ucsd.edu

Paper Link

(back to top)

Acknowledgments

(back to top)

About

Materials for the paper 'I am no sure, but...': Expert Practices that Enable Effective Code Comprehension in Data Science. This repository includes user study data for the paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •