Hello and welcome to the repository for my Data Engineering project during my fellowship at Insight Data Science in New York 2018.
- Reduce staffing costs by providing a centralized monitoring dashboard to view multiple patient's ECG signals at once.
- Index and store raw physiological signals for downstream analysis.
Hospitals collect and process a lot of signals and are now moving towards storing them. My project aims to process and display ECG signals in near real time from patients. These ECG signals are processed every minute to analyze the beat-to-beat heart rate (HR) of each patient in the hospital. These signals are also indexed and stored for use as inputs when developing machine learning models to predict adverse physiological events to improve patient care.
My pipeline loads in ECG timeseries data from an S3 bucket which contains separate files for each patient. The files are ingested line by line simulating sampling of the ECG signals by my kafka brokers. The brokers produce messages into a topic which is subscribed to by two spark streaming clusters. The first spark streaming cluster has a 2-second mini-batch interval which groups the signals from each patient and saves them to my PostgreSQL database. The second cluster has a 60-second mini-batch interval to measure the number of beats over the batch period to calculate heart rate per minute over time and save them to my database. Botht the time series ECG samples and the calculated heart rate's are displayed on my dash front end.
- To setup clusters run setup/setup_cluster.sh.
- To setup database follow setup/db_setup.txt.
- To setup the airflow job follow setup/run_airflow.txt
- To start the pipeline follow the instructions in src/README.md
- Unittests can be run using the run_unittests.sh file in the ./test folder. Results are stored in ./test/results.txt.
- four m4.large EC2 instances for Kafka producers
- four m4.large EC2 instances for Spark Streaming (2s mini-batch) which writes to postgres database
- four m4.large EC2 instances for Spark Streaming (60s mini-batch) which calculates heart rate over 60s period
- one m4.large EC2 instance for TimescaleDB/PostgreSQL database
- one t2.micro EC2 instance for Dash app

