This project demonstrates the use of Machine Learning Pipelines in Python using scikit-learn to predict survival on the Titanic dataset. The pipeline handles all preprocessing steps and applies a classifier in a clean, reproducible way.
βββ Machine pipeline.ipynb # Main notebook with model building pipeline
βββ predict using pipeline.ipynb # Notebook to use trained pipeline for predictions
βββ pipe.pkl # Trained pipeline saved as a pickle file
βββ README.md # Project documentation
- Complete ML pipeline including preprocessing and model training
- Handling missing values and categorical encoding
- Pipeline serialization using
joblib - Inference using the saved pipeline
- Simple and extendable structure
Install dependencies using:
pip install -r requirements.txtYouβll need:
scikit-learnpandasnumpyjoblibmatplotlib(optional for visualizations)
The dataset used is the classic Titanic dataset.
It includes features such as Pclass, Sex, Age, Fare, and survival labels (Survived).
The pipeline includes the following steps:
- Imputation: Filling missing values (e.g., age, embarked).
- Encoding: Converting categorical variables (
Sex,Embarked) using OneHotEncoding. - Feature Scaling: StandardScaler for numeric features.
- Feature Selection: (Optional) using
SelectKBest. - Classification: Using
RandomForestClassifier.
-
Train the model: Open
Machine pipeline.ipynband run all cells. This notebook creates the pipeline, trains it, and saves it topipe.pkl. -
Predict using the saved model: Open
predict using pipeline.ipynbto load the trained model and make predictions on new or test data.
import joblib
import pandas as pd
pipe = joblib.load("pipe.pkl")
new_data = pd.DataFrame([{
"Pclass": 3,
"Sex": "male",
"Age": 22,
"Parch": 0,
"Embarked": "S"
}])
prediction = pipe.predict(new_data)
print("Survived" if prediction[0] == 1 else "Did not survive")- Kaggle for the dataset.
- scikit-learn for the pipeline and modeling tools.