Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 23 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@

This folder provides an all-encompassing working structure for empirical papers.

It organizes every step of the process: merging and cleaning (several) data sets, performing analyses (tables, figures, regressions), writing the paper and talks themselves, and submitting it to journals.
It organizes every step of the process: merging and cleaning (several) datasets, performing analyses (tables, figures, regressions), writing the paper and talks themselves, and submitting it to journals.

**To use it**: follow the setup instructions below.

## Summary
0. Workflow
0. The Big Picture
1. Requirements
2. Setup
3. Folders
Expand All @@ -17,7 +17,18 @@ It organizes every step of the process: merging and cleaning (several) data sets
8. Principles
9. Further Reading

## 0. Workflow
## 0. The Big Picture

* Researchers are "idea entrepreneurs": we do R&D (coming up with new theories and empirical tests), we do fundraising (applying to grants), we manage teams (coauthors, RAs), we do product development (writing code, testing), we do sales (giving talks and publishing).
* Empirical research is exactly analogous to software development; and GitHub is the best place for managing that. See Section 5 below.

Pipeline: raw data + code -> analyses -> products
Pipeline: raw data -> clean data -> assemble -> run analysis -> products

* Each arrow involves a input/output step with associated code.
* The more general the step, the more upstream it lives, even in separate "data" folder or coming from Data Basis.
* We assemble data based on the desired observation level (and perhaps other analyses' metadata). E.g. one at `municipality-year` level, one at `person` level.
* Each product then has a dependency graph with data and code. This whole dependency graph should be versioned.

![](extra/workflow.png)

Expand Down Expand Up @@ -48,7 +59,7 @@ You're good to go. This repository is now ready for the standard workflow descri

##### `/code`

- Versioned folder containing code that builds data and performs analyses.
- Versioned folder containing code that cleans data, assembles datasets, and performs analyses.
- All output data should be redirected into `/output/data/`, with one data file per observation level.
- All output logs should be redirected into `/output/logs/`.
- Other output files should be redirected into `/output/tables/` or `/output/figures/`.
Expand All @@ -62,19 +73,18 @@ You're good to go. This repository is now ready for the standard workflow descri
- Symbolic link to non-versioned folder with input data.
- Any original data source should be included here in clean and normalized form.
- Only include cleaned files. Raw external files should be cleaned in each data source specific folder.
- These data sets will then be manipulated and merged by the files in `/code`.
- These datasets will then be manipulated and merged by the files in `/code`.

##### `/output`

- Symbolic link to non-versioned folder with output data.
- Holds built data sets in `/output/data/`, to be then used in analysis code.
- Contains all analysis objects generated by files in `/code`.
- Will then serve as source for the generation of `.tex` files inside `/products/`.
- Symbolic link to non-versioned folder with output data and other files.
- Holds built datasets in `/output/data/`, to be then used in analysis code.
- Holds analyses in `/output/tables/` or `/output/figures/` generated by files in `/code`, to be incorporated into products.

##### `/tmp`

- Symbolic link to non-versioned folder with temporary files.
- Contains any temporary file created during the manipulation of input data sets or the analysis routine.
- Contains any temporary file created during the manipulation of input datasets or the analysis routine.

##### `/extra`

Expand All @@ -92,13 +102,13 @@ You're good to go. This repository is now ready for the standard workflow descri
##### `run_paper.py`

- Automates the whole paper construction.
- Runs everything in a pre-specified order, from beginning (building data sets) to end (compiling `.tex` files).
- Runs everything in a pre-specified order, from beginning (building datasets) to end (compiling `.tex` files).
- Keeps clear what should be run when.
- Also cleans `/output` and `/tmp` folders before running other code.

##### `/code/get_input.py`

- Erases any file inside `/input` and copies any original data set from outside sources.
- Erases any file inside `/input` and copies any original dataset from outside sources.
- Ensures consistency across original data generation and data building for paper.

## 5. Leveraging on Github capabilities
Expand Down Expand Up @@ -153,7 +163,7 @@ Use tags: `review`, `build`, `analysis`, `writing`, `negative replies`.
* Update it continuously. It will discipline your work.
- Keep two folders: `/papers`, and `/data`, as shown in the workflow.
1. Data.
* Each folder within `/data` is a data set.
* Each folder within `/data` is a dataset.
* Use the same structure for cleaning these datasets (e.g. `/code`, `/input`, `/output`, `/tmp`)
2. Papers.
* Each folder within `/papers` is paper.
Expand Down
15 changes: 4 additions & 11 deletions code/build.do
Original file line number Diff line number Diff line change
Expand Up @@ -30,17 +30,10 @@ log using "output/build.log", replace
// main
//----------------------------------------------------------------------------//

// build
do "code/sub/build_datasets.do"
// clean input files
do "code/sub/build_clean_input_files.do"

// merge
do "code/sub/build_merge.do"

// covariates
do "code/sub/build_covariates.do"

// compress and Save
compress
save "output/data/data.dta", replace
// assemble data at <observation_level>
do "code/sub/build_assemble_<observation_level>.do"

log close
25 changes: 25 additions & 0 deletions code/sub/build_assemble_<observation_level>.do
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
//----------------------------------------------------------------------------//
//
// paper:
//
// do.file: build_assemple
//
// author(s):
//
//----------------------------------------------------------------------------//

//-------------------------//
// merge tables
//-------------------------//

//-------------------------//
// generate columns
//-------------------------//

//-------------------------//
// compress and save
//-------------------------//

compress

save "output/data/data_<observation_level>.dta", replace
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
//
// paper:
//
// do.file: build_datasets
// do.file: build_clean_input_files
//
// author(s):
//
Expand Down
12 changes: 0 additions & 12 deletions code/sub/build_covariates.do

This file was deleted.

9 changes: 0 additions & 9 deletions code/sub/build_merge.do

This file was deleted.

Empty file removed output/figures/.keep
Empty file.
Empty file removed output/tables/.keep
Empty file.
10 changes: 6 additions & 4 deletions setup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
NON_VERSIONED_PATH=""
REPO_PATH=""

ln -shF "${NON_VERSIONED_PATH}/input" "${REPO_PATH}/input"
ln -shF "${NON_VERSIONED_PATH}/output/data" "${REPO_PATH}/output/data"
ln -shF "${NON_VERSIONED_PATH}/output/logs" "${REPO_PATH}/output/logs"
ln -shF "${NON_VERSIONED_PATH}/tmp" "${REPO_PATH}/tmp"
ln -shF "${NON_VERSIONED_PATH}/input" "${REPO_PATH}/input"
ln -shF "${NON_VERSIONED_PATH}/output/data" "${REPO_PATH}/output/data"
ln -shF "${NON_VERSIONED_PATH}/output/tables" "${REPO_PATH}/output/tables"
ln -shF "${NON_VERSIONED_PATH}/output/figures" "${REPO_PATH}/output/figures"
ln -shF "${NON_VERSIONED_PATH}/output/logs" "${REPO_PATH}/output/logs"
ln -shF "${NON_VERSIONED_PATH}/tmp" "${REPO_PATH}/tmp"