This repository contains the complete data management pipeline for the DCDM Group 2 coursework. The project integrates high-throughput knockout mouse phenotyping data from the International Mouse Phenotyping Consortium (IMPC) with human disease ontologies.
The solution consists of three main components:
- ETL Pipeline: A series of R scripts to clean, normalise, and categorise raw phenotypic data.
- Relational Database: A 3NF-normalised MySQL schema connecting Genes, Procedures, Parameters, Analyses, and Diseases.
- Interactive Dashboard: An R Shiny application for exploring genotype-phenotype associations, performing dimensionality reduction (PCA), and visualising gene clusters.
DCDM_Group2/
├── config/ # Configuration files for file paths
│ └── paths.yaml
├── database2/ # Database artifacts and SQL dumps
│ └── DUMP # MySQL database dump file
├── outputs/ # Generated CSVs for SQL import and intermediate data
├── scripts/ # Source code for Data Cleaning, SQL Prep, and Shiny App
│ ├── IMPC_APP.R
│ ├── R_script_for _SQL_import.Rmd # Final preparation of CSVs for SQL LOAD DATA
│ ├── STEP1_rawdata_establish.Rmd # Initial raw data loading
│ ├── STEP2_cleaning_Disease_information.Rmd # Cleaning disease ontology data
│ ├── STEP2_cleaning_IMPC_Procedure.Rmd # Cleaning procedure metadata
│ ├── STEP2_cleaning_parameter_description.Rmd# Text mining parameter descriptions
│ └── STEP3_parameter_groupings.Rmd # Categorising parameters (e.g., "Metabolism")
├── SMOKE_TEST.txt # System integrity test file
└── .gitignore # Git ignore configuration
- R & RStudio: (Version 4.0+)
- MySQL Server: (Version 8.0+)
- R Packages:
tidyverse(Data manipulation)shiny(Web framework)DT(Interactive tables)plotly(Interactive charts)viridis(Colour scales)readr,yaml(File IO)
The scripts in the scripts/ directory should be run in the following order to reproduce the dataset:
- Raw Data Setup: Run
STEP1_rawdata_establish.Rmdto load the source files. - Cleaning Modules: Run the three
STEP2scripts to clean Procedures, Disease info, and Parameter descriptions. - Categorisation: Run
STEP3_parameter_groupings.Rmdto generate the logical groupings for the phenotypes. - SQL Preparation: Run
R_script_for _SQL_import.Rmd. This generates the final normalised CSV files (e.g.,Genes_for_SQL.csv,Analyses_for_SQL.csv) in theoutputs/folder.
- Create the database
DCDM_IMPC_Group2in MySQL. - Execute the table creation SQL commands (found inside
R_script_for _SQL_import.Rmdor the SQL dump). - Import the CSV files generated in the previous step using
LOAD DATA INFILEcommands.- Note: Ensure
local_infileis enabled if loading from client.
- Note: Ensure
To explore the data:
- Open
scripts/app.Rin RStudio. - Ensure the input CSVs (
IMPC_cleaned3.csv,IMPC_parameter_groupings.csv) are accessible to the script. - Click "Run App".
-
Gene View: Univariate jitter plots showing significance scores (
$-\log_{10} P$ ) across phenotype categories. - Phenotype View: Comparative horizontal bar charts ranking genes by effect size for specific parameters.
- Clustering: Hierarchical clustering heatmaps identifying pleiotropic gene signatures.
- PCA: Principal Component Analysis to visualise gene separation based on phenotypic profiles.
DCDM Group 2