- Clone this GitHub repo
- Download, unzip and put the
USCensus1990.data.txtin the main directory of this repo.https://archive.ics.uci.edu/dataset/116/us+census+data+1990 - Install Docker and deploy the databases
- Make sure that docker is installed on your system.
https://docs.docker.com/engine/install/ - Deploy the databases by running
docker compose up -d - Verify the database containers are running
docker ps
- Make sure that docker is installed on your system.
- Make a Python virtual environment (recommended)
python -m venv- Install the dependencies of this repo
pip install -r requirements.txt
- Populate the databases with the sample data
- Run
ingestion_test.pyandcensus_ingest_mongo.pyto load in the Census dataset - Run
ecommerce_ingest.pyandecommerce_ingest_postgres.pyto load the ecommerce dataset into each database
- Run
- Run the streamlit application
streamlit run app.py- streamlit will list the URLs of where to reach the app
- The default is to
localhost:8501and also on your local subnet
- Populate the test data to see the differences in database performance via
python populate_query_logs.py - Evaluate results!
In this repo are also a series of individual performance tests that you can run. For example, to see the improvements of indexing a SQL database, you can run python indexing_test.py. Some of my results are in the results.md document.
The presentation of this workshop is available as a PowerPoint and is available as part of this repo, it gives a general motivation for understanding your data flow and database selection. Database selection is an active choice that needs to be made by the discerning data scientist. Understanding the actual mechanics of data storage, transport, and processing is a key piece of education that is missed in data science programs.
- A good overview of database systems and where I got some images:
https://cs186berkeley.net/notes/note17/