Give understanding and practical experience on how to read data from database tables to Pandas dataframe and write it back.
- Test assignment during job application for DA/DE roles
- Data transformation where it’s hard or impossible to use SQL
- Migrating data between different databases
Ensure you have the following software installed on your laptop:
- git
- Python
- Docker
- DBeaver
- Visual Studio Code (dark theme and extensions: Python, Jupyter, Rainbow CSV)
- Account in Snowflake
- Clone the repo and access it using VS Code
- Generate data
- Load data to PostgreSQL
- Read data from PostgreSQL to Pandas dataframe
- Modify data in Pandas dataframe and write it back to PostgreSQL
- Speed up export to PostgreSQL
- Read data from Snowflake to Pandas dataframe
- Transfer data from PostgreSQL to Snowflake
- Project downloaded from github
- Project is opened in VS Code
-
Open the folder where you want to locate the project repo
-
Run the terminal inside the folder and clone the project repo
git clone https://github.com/surfalytics/data-projects.git
-
Open VS Code and navigate to the cloned project folder
-
Double-check you installed Python, Jupyter, Rainbow CSV extensions for VS Code
- Generated
sales_data.csvfile in the./docker/source-data/folder
- In the
dockerfolder, createsource-datasubfolder - Run the
generate_data.pyfile - Ensure the
csvfile is created - Open it in VS Code to observe it and see how the Rainbow CSV extension works
- Generated csv file is exported to the PostgreSQL database
- You accessed the database using DBeaver and checked # of rows = 3 mln and data is correct
- Launch Docker Desktop application
- Open terminal in VS Code and navigate to the docker directory
- Run the following command to build the docker image
docker build -t postgres-image .
- Check the
postgres-imageimage is created
docker image list
- Run the following command to run a container
docker run --name postgres-cont -d -p 5433:5432 postgres-image
- Check the
postgres-contis created
docker ps
-
Open DBeaver application
-
Create new connection (to the newly created PostgreSQL database in Docker)
-
Click New Connection > Select PostgreSQL and fill in the settings:
Host: localhost
Port: 5433
Database: sales
Username: postgres
Password: 1
-
Click
Test Connection. In case of success, you should seeConnected. If so, clickFinish -
In DBeaver, in the newly created connection, ensure you see the
salesdatabase, themigrationschema, thesales_datatable inside it. -
Open SQL Editor and run the following queries:
SELECT COUNT(*) FROM migration.sales_data;
You should see 3,000,000 as a result
- Run the following query to observe the imported data:
SELECT * FROM migration.sales_data LIMIT 100;
It’s just to ensure the data is imported and looks as expected
- Data from PostgreSQL is imported to Pandas dataframe and printed to the screen.
-
In VS Code, open terminal
-
Create virtual environment for the project:
python -m venv project_venv
- Activate the environment:
.\project_venv\Scripts\activate
- Install requires python packages:
pip install pandas sqlalchemy psycopg2-binary "snowflake-connector-python[pandas]"
-
Navigate to the
migration.ipynbfile. Ensure it’s properly displayed. -
Set the kernel to use the newly created
project_venvenvironment -
Run the first cell
-
Ensure you can see the displayed Pandas dataframe with the data from PostgreSQL
- The new Pandas dataframe created and it contains the new
shipping_durationcalculated column - You printed the new dataframe
- You wrote the data from the new dataframe back to PostgreSQL
-
Run the second cell
-
Ensure you see printed dataframe and it contains the new
shipping_durationcolumn -
Uncomment the last line in the cell and run the cell again
-
Track the time how long it took to execute the code in the cell. Later, we will try to speed it up.
-
In DBeaver, refresh the tables and check if the new
updated_sales_datatable appeared. -
Check the number of rows and print the first 100 rows to check if the new
shipping_durationcolumn is there
- It take noticeably less time to upload data from Pandas dataframe to Pandas dataframe
In DBeaver, create schema for the updated table by running the following SQL:
CREATE TABLE migration.fast_updated_sales_data (
row_id INT PRIMARY KEY,
order_id VARCHAR(255),
order_date DATE,
ship_date DATE,
ship_mode VARCHAR(50),
customer_id VARCHAR(50),
customer_name VARCHAR(255),
segment VARCHAR(50),
country VARCHAR(50),
city VARCHAR(50),
state VARCHAR(50),
postal_code VARCHAR(20),
region VARCHAR(50),
product_id VARCHAR(255),
category VARCHAR(50),
sub_category VARCHAR(50),
product_name VARCHAR(255),
sales NUMERIC(10, 2),
quantity INT,
discount NUMERIC(5, 2),
profit NUMERIC(10, 2),
shipping_duration INT
);-
Run the third cell in the
migration.ipynbfile -
Track the execution time and compare it to the time in the previous step.
-
In DBeaver, check if the
fast_updated_sales_datatable was populated.
- Data from Snowflake is imported to Pandas dataframe and printed to the screen
-
Log in to your Snowflake account
-
In VS Code, open the
migration.ipynbfile and go the the fourth cell -
Update the connection credentials using your account data
-
Run the fourth cell. If everything was done correctly, you should see the table data from Snowflake
- Data from PostgreSQL is uploaded to Snowflake
- In Snowflake, create your personal database and schema
CREATE DATABASE IF NOT EXISTS <your_name>_MIGRATION;
CREATE SCHEMA IF NOT EXISTS <your_name>_MIGRATION.RAW;- In VS Code, go to the last, fifth, cell and update the snowflake connection credentials with your account details
snowflake_conn = snowflake.connector.connect(
user='snowflake_user',
password='snowflake_password',
account='snowflake_account',
warehouse='snowflake_warehouse',
database='<your_name>_MIGRATION', #Put your db name here
schema='RAW')
-
Run the fifth cell
-
In Snowflake, check that your database was added the new table with the data.
- See more detailed instructions with screenshots in this article - https://python.plainenglish.io/bridging-databases-with-python-from-database-tables-to-pandas-and-back-again-1ae5ad4e2e1f