This project develops logistic regression models in both Python and R to predict passenger survival on the Titanic. Each implementation is containerized using Docker to ensure reproducibility and portability. The instructions below explain how to download the data and run both containers step by step.
titanic-disaster/
│
├── data/ # CSV files (download manually from Kaggle)
├── src/
│ ├── code/ # Python implementation
│ │ └── main.py
│ └── r/ # R implementation
│ └── main.R
│ └── install_packages.R # R dependencies
│
├── Dockerfile # Python container configuration
├── Dockerfile_R # R container configuration
├── requirements.txt # Python dependencies
└── README.md # Project documentation
Download the dataset from the official Kaggle Titanic competition page:
URL: https://www.kaggle.com/competitions/titanic/code
- train.csv
- test.csv
- gender_submission.csv
- Visit the Kaggle link above and log in.
- Click the Data tab and select Download All.
- Extract the ZIP file.
- Move the three CSV files into your local project directory under
titanic-disaster/data/.
The final structure should look like this:
titanic-disaster/
└── data/
├── train.csv
├── test.csv
└── gender_submission.csv
Run the following command from the project root directory:
docker build -t titanic-app .docker run --rm -it titanic-appThe container will:
- Load and clean the training data
- Display data summaries and missing value statistics
- Train a logistic regression model
- Output model coefficients and intercepts
- Display training and test accuracy
All progress and results are printed directly in the terminal.
From the project root directory:
docker build -t titanic-r -f Dockerfile_R .docker run --rm -it -v "%cd%/data:/app/data" titanic-rNote: This command might not work for Non-Windows systems. For macOS / Linux, replace %cd% with $(pwd).
The R container will:
- Load and clean the Titanic dataset
- Display missing value summaries and cleaned data overview
- Train a logistic regression model using
glm - Display model coefficients and training accuracy
- Output test prediction summaries and test accuracy
- The
datadirectory is excluded from version control. You must download the CSV files manually before running the containers. - Both containers access the same dataset through the mounted
/app/datadirectory. - Docker caching ensures efficient rebuilds as dependencies are reinstalled only when package files change.