This repo contains the code and analysis used in this blogpost.
If you are interested in running our complete analysis end-to-end, starting with the raw 1M+ notebook dataset, the dataset can be downloaded from the UCSD Library Digital Collections. Then you can follow the instructions at the top of notebook/1-raw-notebook-processing.ipynb to clean the data, or download the cleaned data via Google Drive and run the processing yourself. Note that the processing pipeline takes around 20 minutes to run.
If you are only interested in the analysis portion, we have distilled the dataset down to a smaller dataset that contains the count of pandas API usage across each notebook. This smaller dataset (filtered_token_breakdown.csv) can be downloaded at this link. Once you have downloaded the dataset, you can place it in the data/ folder and follow the analysis in notebook/2-pandas-usage-analysis.ipynb.
If you have any questions or feedback on our blog post or analysis, please send us an email at contact@ponder.io.