A beginner-friendly AI/ML project that predicts whether two users are likely to meet offline based on their social interaction patterns.
This project uses machine learning to analyze social connection patterns and predict if two users will meet in person. It's designed to be educational and accessible for beginners in AI/ML.
- Synthetic Dataset Generation: Creates realistic social interaction data with 5000+ user pairs
- Multiple ML Models: Implements both Logistic Regression and Random Forest classifiers
- Comprehensive Evaluation: Provides accuracy, precision, recall, and ROC-AUC metrics
- Interactive Predictions: Function to predict new user pairs with confidence scores
- Rich Visualizations: Histograms, confusion matrices, and data distribution plots
- Model Persistence: Saves trained models for future use
- Python 3.7 or higher
- pip package manager
- Clone or download this project
- Install required packages:
pip install -r requirements.txtSimply run the main script:
py social_connection_predictor.pyRun the beautiful Streamlit web app:
streamlit run streamlit_app.pyThe web app will open automatically in your browser at http://localhost:8501
Note: On Windows, use py instead of python if you have multiple Python versions installed.
- Generate a synthetic dataset
- Train both models
- Compare performance
- Create visualizations
- Save the best model
- Demonstrate predictions on example data
- Provide interactive sliders for input
- Show real-time predictions
- Display beautiful visualizations
- Include confidence gauges
- Show model performance metrics
- Offer sample predictions
The synthetic dataset includes the following features for each user pair:
| Feature | Type | Description | Range |
|---|---|---|---|
chat_freq |
int | Number of chats per week | 0-20 |
response_time |
float | Average reply time in minutes | 1-1440 |
events_attended |
int | Number of common events attended | 0-10 |
similarity_index |
float | Similarity score between users | 0-1 |
location_proximity |
float | Distance between users in km | 0.1-100 |
met_offline |
int | Target: Did they meet offline? | 0 or 1 |
- Simple, interpretable linear model
- Good baseline for binary classification
- Fast training and prediction
- Ensemble method using multiple decision trees
- Often performs better on complex patterns
- Provides feature importance insights
After running the script, you'll get:
social_connection_model.pkl- The best performing trained modelsocial_connection_scaler.pkl- Feature scaler for preprocessingfeature_distributions.png- Data visualization plotsconfusion_matrices.png- Model performance matrices
You can use the trained model to predict new user pairs:
import joblib
import numpy as np
# Load the saved model and scaler
model = joblib.load('social_connection_model.pkl')
scaler = joblib.load('social_connection_scaler.pkl')
# Example: Predict for a new user pair
def predict_user_pair(chat_freq, response_time, events_attended,
similarity_index, location_proximity):
features = np.array([[chat_freq, response_time, events_attended,
similarity_index, location_proximity]])
features_scaled = scaler.transform(features)
prediction = model.predict(features_scaled)[0]
probability = model.predict_proba(features_scaled)[0][1]
return prediction, probability
# Example usage
pred, prob = predict_user_pair(
chat_freq=10, # 10 chats per week
response_time=30, # 30 minutes average response
events_attended=3, # 3 common events
similarity_index=0.8, # High similarity
location_proximity=5 # 5 km apart
)
print(f"Will they meet offline? {'Yes' if pred == 1 else 'No'}")
print(f"Confidence: {prob:.3f}")This project teaches:
- Data Generation: Creating realistic synthetic datasets
- Data Preprocessing: Scaling and splitting data
- Model Training: Implementing multiple ML algorithms
- Model Evaluation: Understanding different performance metrics
- Model Comparison: Choosing the best performing model
- Model Persistence: Saving and loading trained models
- Visualization: Creating informative plots and charts
- Prediction Pipeline: Building end-to-end prediction systems
You can easily modify the project:
- Dataset Size: Change
n_samplesingenerate_synthetic_dataset() - Feature Weights: Adjust the scoring logic in dataset generation
- Model Parameters: Modify hyperparameters in model training
- Visualizations: Add more plots or change styling
- New Features: Add additional social interaction features
social_connection_predictor.py
├── generate_synthetic_dataset() # Creates realistic data
├── preprocess_data() # Scales and splits data
├── train_logistic_regression() # Trains LR model
├── train_random_forest() # Trains RF model
├── create_visualizations() # Generates plots
├── predict_new_user_pair() # Makes predictions
└── main() # Orchestrates everything
To extend this project, consider:
- Real Data: Replace synthetic data with real social media data
- More Models: Try SVM, Neural Networks, or XGBoost
- Feature Engineering: Create new features from existing ones
- Cross-Validation: Implement k-fold cross-validation
- Hyperparameter Tuning: Use GridSearch or RandomSearch
- Web Interface: Build a Flask/Django web app
- API: Create a REST API for predictions
Feel free to fork this project and submit pull requests for improvements!
This project is open source and available under the MIT License.
Happy Learning! 🚀
This project is designed to be educational and beginner-friendly. The synthetic data is created for demonstration purposes only.