Corporate Decision-Making and Quantitative Analysis - Assignment I: Audit market concentration in the EU
Adopting the Open Science Workflow and TRR 266 Template for Reproducible Empirical Accounting Research
This repository provides an infrastructure for an open science-oriented empirical project, specifically targeted at the empirical accounting research community. It features a project exploring audit firms’ market shares in terms of the number of public interest entity (PIE) statutory audits for the year 2021 across EU countries. The project showcases a reproducible workflow integrating Python scripts and data analysis, requiring access to the research platform WRDS, which provides access to a variety of different datasets. This assignment is an empirical replication following open science principles. Through this repository, you will be equipped to evaluate the role of accounting in corporate decision-making while learning to gather, prepare, and analyze relevant data using tools and platforms essential for collaborative and reproducible research.
The task involves accessing and retrieving data from the Audit Analytics Database through WRDS, which adds complexity as it requires both understanding WRDS dataset structure and writing scripts to pull the data. Reproducing a table from a seminal paper necessitates a deep understanding of the paper’s methodology and thorough attention to detail to match the results. Additionally, the project output includes documentation of the steps and explicit assumptions made. The paper (and presentation) output files present the findings, compare them with the paper key results and discuss any differences observed.
Even if you are not specifically interested in audit firms’ market shares (who wouldn’t be?) or do not have access to WRDS Databases, the codebase provided in this repository will give you a clear understanding of how to structure a reproducible empirical project. The template and workflow used here are designed to ensure transparency and reproducibility, making it a valuable resource for any empirical accounting research project.
The default branch, only_python, is a stripped-down version of the template containing only the Python workflow. This branch was cloned from the TRR 266 Template for Reproducible Empirical Accounting Research (TREAT) repository, focusing solely on the Python workflow and utulizing the Python libraries listed in the requirements.txt file.
You start by setting up few tools on your system:
-
If you are new to Python, follow the Real Python installation guide that gives a good overview of how to set up Python on your system.
-
Additionally, you will also need to setup an Integrated Development Environment (IDE) or a code editor. We recommend using VS Code, please follow the Getting started with Python in VS Code Guide.
-
You wll also need Quarto, a scientific and technical publishing system used for documentation purposes of this project. Please follow the Quarto installation guide to install Quarto on your system. I recommend downloading the Quarto Extension for enhanced functionality, which streamlines the workflow and ensures professional documentation quality for this project.
-
Finally, you will also need to have
makeinstalled on your system, if you want to use it. It reads instructions from aMakefileand helps automate the execution of these tasks, ensuring that complex workflows are executed correctly and efficiently.- For Linux users this is usually already installed.
- For MacOS users, you can install
makeby runningbrew install makein the terminal. - For Windows users, there are few options to install
makeand they are dependent on how you have setup your system. For example, if you have installed the Windows Subsystem for Linux (WSL), you can installmakeby runningsudo apt-get install makein the terminal. If not you are probably better of googling how to installmakeon Windows and follow a reliable source.
Next, explore the repository to familiarize yourself with its folders and files in them:
-
config: This directory holds configuration files that are being called by the program scripts in thecodedirectory. We try to keep the configurations separate from the code to make it easier to adjust the workflow to your needs. In this project,pull_data_cfg.yamlfile outlines the variables and settings needed to extract the necessary Transparency Report data from the Audit Analytics database. Theprepare_data_cfg.yamlfile specifies the configurations for preprocessing and cleaning the data before analysis, ensuring consistency and accuracy in the dataset and following the paper filtration requirements. Thedo_analysis_cfg.yamlfile contains the parameters and settings used for performing the final analysis on the extracted financial data. -
code: This directory holds program scripts that are being called to pull data from WRDS directly from python, prepare the data, run the analysis and create the output file (a replicated (pickle) output). Using pickle instead of Excel is more preferable as it is a more Pythonic data format, enabling faster read and write operations, preserving data types more accurately, and providing better compatibility with Python data structures and libraries. -
data: A directory where data is stored. It is used to organize and manage all data files involved in the project, ensuring a clear separation between external, pulled, and generated data sources. Go through the sub-directories and a README file that explains their purpose. -
doc: This directory contains Quarto files (.qmd) that include text and program instructions for the paper and presentation (not rendered in this project due to task instruction - however, feel free to use the presentation template and adjust it to your needs). These files are rendered through the Quarto process using Python and the VS Code extension, integrating code, results, and literal text seamlessly.
Important
Make use of significantly enhanced LaTeX table formatting for refined and customizable paper output!
Warning
While generating the presentation, you may notice that some sections and subsections might not have the correct beamer formatting applied. This is due to the color coding in the beamer_theme_trr266.sty file, which might need further adjustments. The current output is based on the template provided and further customization may be required to ensure consistency across all slides.
Tip
Download the VSCode Extension for duplicating files. This will streamline your workflow by allowing you to duplicate files directly within Visual Studio Code, rather than manually copying and pasting in Finder (Mac) or File Explorer (Windows). 😉
Tip
Another quite fresh tip to synchronise vertical or horizontal scrolling in splitted view in VS Code. To engage it, type in the Command Palette the action name Toggle Locked Scrolling Across Editors. It is very useful if you are aligning the config file with the according python file, for example. 👩💻
You also see an output directory but it is empty. Why? Because the output paper and presentation are created locally on your computer.
Assuming that you have WRDS access, Python, Vs Code, Quarto and make installed, this should be relatively straightforward. Refer to the setup instructions in section above.
Important
- To access the Transparency Report data needed for this project, use the Audit Analytics Database available through the WRDS (Wharton Research Data Services) platform. WRDS acts as a gateway, offering tools for data extraction and analysis, and consolidates multiple data sources for academic and corporate research.
- In order to access the Audit Analytics Database through WRDS, complete this form, if not yet registered for WRDS. Ensure that you create an account with your institutional (university) login. If you are from Humboldt-Universität zu Berlin, contact the University Library to get your account request approved. After setting up Two-factor authentication (2FA) and accepting the terms of use, you will be set to go with WRDS Databases.
- Unfortunately, WRDS does not typically provide direct access to historical snapshots of databases. The data available through WRDS is usually the most current version (latest update in September 2024). To access a specific historical version like the 2022 version, contact the data vendor through WRDS support directly to inquire about the possibility of accessing historical snapshots.
- Click on the
Use this templatebutton on the top right of the repository and chooseCreate a new repository. Give the repository a name, a description and choose whether it should be public or private. Click onCreate repository. - You can now clone the repository to your local machine. Open the repository in Vs Code and open a new terminal.
- It is advisable to create a virtual environment for the project:
python3 -m venv venv # You can do this by running the command in the terminal
# This will create a virtual environment in the `venv` directory.
source venv/bin/activate # Activate the virtual environment by running this on Linux and Mac OS
# venv\Scripts\activate.bat # If you are using Windows - command prompt
# venv/Script/Activate.ps1 # If you are using Windows - PowerShell and have allowed script executionYou can deactivate the virtual environment by running deactivate.
- With an active virtual environment, you can install the required packages by running
pip install -r requirements.txtin the terminal. This will install the required packages for the project in the virtual environment. - Copy the file _secrets.env to secrets.env in the project main directory. Edit it by adding your WRDS credentials.
Note
Note that inability to see the password while typing is standard behavior for security reasons. When prompted, type your password even though it won’t be displayed and press Enter. When WRDS prompts you to create a .pgpass file, it’s asking if you want to store your login credentials for easier future access. Answer ‘y’ to create the file now and follow the instructions, or ‘n’ if you prefer to enter your password each time or create the file manually later.
Tip
I have included an intermediate check step using the code/python/test_wrds_connection.py file to ensure that WRDS access is secure and functional before running the main program script. Run it first to ensure the connection to WRDS has been successful.
- Run 'make all' in the terminal. I use the Makefile Tools extension in VS Code to run the makefile and create the necessary output files to the
outputdirectory. I highly recommend using the Makefile! Otherwise, you can run the following commands in the terminal:
python code/python/pull_wrds_data.py
python code/python/prepare_data.py
python code/python/do_analysis.py
quarto render doc/paper.qmd
mv doc/paper.pdf output
rm -f doc/paper.ttt doc/paper.fff
quarto render doc/presentation.qmd
mv doc/presentation.pdf output
rm -f doc/presentation.ttt doc/presentation.fff- Eventually, you will be greeted with the two files in the
outputdirectory: "paper.pdf" (and "presentation.pdf"). You have successfully used an open science resource and reproduced the analysis. Congratulations! 🥳
This code base, adapted from TREAT, should give you an overview on how the template is supposed to be used for my specific project and how to structure a reproducible empirical project. To start a new reproducible project on audit firms’ market shares for PIE statutory audits based on this repo, follow these steps:
- Clone the repository by clicking “Use this Template” at the top of the file list on GitHub.
- Remove any files that you don’t need for your specific project.
- Over time, you can fork this repository and customize it to develop a personalized template that fits your workflow and preferences.
Tip
In case you need to work with additional variables other than stated in this project, I recommend using the Manuals and Overviews - Excel Dictionaries that are helpful for users interested in viewing the data structures and definitions as presented by Audit Analytics.
This project utilizes the template used in collaborative research center TRR 266 Accounting for Transparency, that is centered on workflows that are typical in the accounting and finance domain.
The repository is licensed under the MIT license. I would like to give the following credit:
This repository was built based on the ['treat' template for reproducible research](https://github.com/trr266/treat).
💡 If you’re new to collaborative workflows for scientific computing, here are some helpful texts:
- Christensen, Freese and Miguel (2019): Transparent and Reproducible Social Science Research, Chapter 11: https://www.ucpress.edu/book/9780520296954/transparent-and-reproducible-social-science-research
- Gentzkow and Shapiro (2014): Code and data for the social sciences: a practitioner’s guide, https://web.stanford.edu/~gentzkow/research/CodeAndData.pdf
- Wilson, Bryan, Cranston, Kitzes, Nederbragt and Teal (2017): Good enough practices in scientific computing, PLOS Computational Biology 13(6): 1-20, https://doi.org/10.1371/journal.pcbi.1005510