diff --git a/README.md b/README.md index 71d31d3..8bfcb08 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,12 @@ This folder provides an all-encompassing working structure for empirical papers. -It organizes every step of the process: merging and cleaning (several) data sets, performing analyses (tables, figures, regressions), writing the paper and talks themselves, and submitting it to journals. +It organizes every step of the process: merging and cleaning (several) datasets, performing analyses (tables, figures, regressions), writing the paper and talks themselves, and submitting it to journals. **To use it**: follow the setup instructions below. ## Summary -0. Workflow +0. The Big Picture 1. Requirements 2. Setup 3. Folders @@ -17,7 +17,18 @@ It organizes every step of the process: merging and cleaning (several) data sets 8. Principles 9. Further Reading -## 0. Workflow +## 0. The Big Picture + +* Researchers are "idea entrepreneurs": we do R&D (coming up with new theories and empirical tests), we do fundraising (applying to grants), we manage teams (coauthors, RAs), we do product development (writing code, testing), we do sales (giving talks and publishing). +* Empirical research is exactly analogous to software development; and GitHub is the best place for managing that. See Section 5 below. + +Pipeline: raw data + code -> analyses -> products +Pipeline: raw data -> clean data -> assemble -> run analysis -> products + +* Each arrow involves a input/output step with associated code. +* The more general the step, the more upstream it lives, even in separate "data" folder or coming from Data Basis. +* We assemble data based on the desired observation level (and perhaps other analyses' metadata). E.g. one at `municipality-year` level, one at `person` level. +* Each product then has a dependency graph with data and code. This whole dependency graph should be versioned. ![](extra/workflow.png) @@ -48,7 +59,7 @@ You're good to go. This repository is now ready for the standard workflow descri ##### `/code` -- Versioned folder containing code that builds data and performs analyses. +- Versioned folder containing code that cleans data, assembles datasets, and performs analyses. - All output data should be redirected into `/output/data/`, with one data file per observation level. - All output logs should be redirected into `/output/logs/`. - Other output files should be redirected into `/output/tables/` or `/output/figures/`. @@ -62,19 +73,18 @@ You're good to go. This repository is now ready for the standard workflow descri - Symbolic link to non-versioned folder with input data. - Any original data source should be included here in clean and normalized form. - Only include cleaned files. Raw external files should be cleaned in each data source specific folder. -- These data sets will then be manipulated and merged by the files in `/code`. +- These datasets will then be manipulated and merged by the files in `/code`. ##### `/output` -- Symbolic link to non-versioned folder with output data. -- Holds built data sets in `/output/data/`, to be then used in analysis code. -- Contains all analysis objects generated by files in `/code`. -- Will then serve as source for the generation of `.tex` files inside `/products/`. +- Symbolic link to non-versioned folder with output data and other files. +- Holds built datasets in `/output/data/`, to be then used in analysis code. +- Holds analyses in `/output/tables/` or `/output/figures/` generated by files in `/code`, to be incorporated into products. ##### `/tmp` - Symbolic link to non-versioned folder with temporary files. -- Contains any temporary file created during the manipulation of input data sets or the analysis routine. +- Contains any temporary file created during the manipulation of input datasets or the analysis routine. ##### `/extra` @@ -92,13 +102,13 @@ You're good to go. This repository is now ready for the standard workflow descri ##### `run_paper.py` - Automates the whole paper construction. -- Runs everything in a pre-specified order, from beginning (building data sets) to end (compiling `.tex` files). +- Runs everything in a pre-specified order, from beginning (building datasets) to end (compiling `.tex` files). - Keeps clear what should be run when. - Also cleans `/output` and `/tmp` folders before running other code. ##### `/code/get_input.py` -- Erases any file inside `/input` and copies any original data set from outside sources. +- Erases any file inside `/input` and copies any original dataset from outside sources. - Ensures consistency across original data generation and data building for paper. ## 5. Leveraging on Github capabilities @@ -153,7 +163,7 @@ Use tags: `review`, `build`, `analysis`, `writing`, `negative replies`. * Update it continuously. It will discipline your work. - Keep two folders: `/papers`, and `/data`, as shown in the workflow. 1. Data. - * Each folder within `/data` is a data set. + * Each folder within `/data` is a dataset. * Use the same structure for cleaning these datasets (e.g. `/code`, `/input`, `/output`, `/tmp`) 2. Papers. * Each folder within `/papers` is paper. diff --git a/code/build.do b/code/build.do index b774233..2279aba 100644 --- a/code/build.do +++ b/code/build.do @@ -30,17 +30,10 @@ log using "output/build.log", replace // main //----------------------------------------------------------------------------// -// build - do "code/sub/build_datasets.do" +// clean input files + do "code/sub/build_clean_input_files.do" -// merge - do "code/sub/build_merge.do" - -// covariates - do "code/sub/build_covariates.do" - -// compress and Save - compress - save "output/data/data.dta", replace +// assemble data at + do "code/sub/build_assemble_.do" log close diff --git a/code/sub/build_assemble_.do b/code/sub/build_assemble_.do new file mode 100644 index 0000000..25874fd --- /dev/null +++ b/code/sub/build_assemble_.do @@ -0,0 +1,25 @@ +//----------------------------------------------------------------------------// +// +// paper: +// +// do.file: build_assemple +// +// author(s): +// +//----------------------------------------------------------------------------// + +//-------------------------// +// merge tables +//-------------------------// + +//-------------------------// +// generate columns +//-------------------------// + +//-------------------------// +// compress and save +//-------------------------// + +compress + +save "output/data/data_.dta", replace diff --git a/code/sub/build_datasets.do b/code/sub/build_clean_input_files.do similarity index 84% rename from code/sub/build_datasets.do rename to code/sub/build_clean_input_files.do index 55296e3..17ed89f 100644 --- a/code/sub/build_datasets.do +++ b/code/sub/build_clean_input_files.do @@ -2,7 +2,7 @@ // // paper: // -// do.file: build_datasets +// do.file: build_clean_input_files // // author(s): // diff --git a/code/sub/build_covariates.do b/code/sub/build_covariates.do deleted file mode 100644 index dd4af34..0000000 --- a/code/sub/build_covariates.do +++ /dev/null @@ -1,12 +0,0 @@ -//----------------------------------------------------------------------------// -// -// project: -// -// do.file: build_covariates -// -// author(s): -// -//----------------------------------------------------------------------------// - - - diff --git a/code/sub/build_merge.do b/code/sub/build_merge.do deleted file mode 100644 index fbafbea..0000000 --- a/code/sub/build_merge.do +++ /dev/null @@ -1,9 +0,0 @@ -//----------------------------------------------------------------------------// -// -// paper: -// -// do.file: build_merge -// -// author(s): -// -//----------------------------------------------------------------------------// diff --git a/output/figures/.keep b/output/figures/.keep deleted file mode 100644 index e69de29..0000000 diff --git a/output/tables/.keep b/output/tables/.keep deleted file mode 100644 index e69de29..0000000 diff --git a/setup.sh b/setup.sh index b306372..110749c 100644 --- a/setup.sh +++ b/setup.sh @@ -3,7 +3,9 @@ NON_VERSIONED_PATH="" REPO_PATH="" -ln -shF "${NON_VERSIONED_PATH}/input" "${REPO_PATH}/input" -ln -shF "${NON_VERSIONED_PATH}/output/data" "${REPO_PATH}/output/data" -ln -shF "${NON_VERSIONED_PATH}/output/logs" "${REPO_PATH}/output/logs" -ln -shF "${NON_VERSIONED_PATH}/tmp" "${REPO_PATH}/tmp" +ln -shF "${NON_VERSIONED_PATH}/input" "${REPO_PATH}/input" +ln -shF "${NON_VERSIONED_PATH}/output/data" "${REPO_PATH}/output/data" +ln -shF "${NON_VERSIONED_PATH}/output/tables" "${REPO_PATH}/output/tables" +ln -shF "${NON_VERSIONED_PATH}/output/figures" "${REPO_PATH}/output/figures" +ln -shF "${NON_VERSIONED_PATH}/output/logs" "${REPO_PATH}/output/logs" +ln -shF "${NON_VERSIONED_PATH}/tmp" "${REPO_PATH}/tmp"