Skip to content

Analysis of common pandas functions across 1M+ Github notebooks

License

Notifications You must be signed in to change notification settings

ponder-org/pandas-API-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pandas-API-analysis

This repo contains the code and analysis used in this blogpost.

Running data processing pipeline from raw corpus

If you are interested in running our complete analysis end-to-end, starting with the raw 1M+ notebook dataset, the dataset can be downloaded from the UCSD Library Digital Collections. Then you can follow the instructions at the top of notebook/1-raw-notebook-processing.ipynb to clean the data, or download the cleaned data via Google Drive and run the processing yourself. Note that the processing pipeline takes around 20 minutes to run.

Running only the analysis notebook

If you are only interested in the analysis portion, we have distilled the dataset down to a smaller dataset that contains the count of pandas API usage across each notebook. This smaller dataset (filtered_token_breakdown.csv) can be downloaded at this link. Once you have downloaded the dataset, you can place it in the data/ folder and follow the analysis in notebook/2-pandas-usage-analysis.ipynb.

Questions?

If you have any questions or feedback on our blog post or analysis, please send us an email at contact@ponder.io.

About

Analysis of common pandas functions across 1M+ Github notebooks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published