From 3118d4a89f56bc12f31c746076d395613fc93a0d Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 11 Oct 2025 04:26:11 +0000 Subject: [PATCH 1/3] Initial plan From 9b38140e64abf52b72081a86b93fbf30aaf7bc89 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 11 Oct 2025 04:36:55 +0000 Subject: [PATCH 2/3] Add templates folder with 9 ML project templates and update README Co-authored-by: macanderson <542881+macanderson@users.noreply.github.com> --- README.md | 73 +++- templates/README.md | 179 +++++++++ templates/anomaly_detection.ipynb | 382 ++++++++++++++++++ templates/clustering_models.ipynb | 364 +++++++++++++++++ templates/computer_vision_models.ipynb | 489 +++++++++++++++++++++++ templates/language_model.ipynb | 361 +++++++++++++++++ templates/machine_learning_preset.ipynb | 291 ++++++++++++++ templates/neural_network_model.ipynb | 360 +++++++++++++++++ templates/reinforcement_learning.ipynb | 421 +++++++++++++++++++ templates/sentiment_analysis_model.ipynb | 393 ++++++++++++++++++ templates/time_series_analysis.ipynb | 477 ++++++++++++++++++++++ 11 files changed, 3789 insertions(+), 1 deletion(-) create mode 100644 templates/README.md create mode 100644 templates/anomaly_detection.ipynb create mode 100644 templates/clustering_models.ipynb create mode 100644 templates/computer_vision_models.ipynb create mode 100644 templates/language_model.ipynb create mode 100644 templates/machine_learning_preset.ipynb create mode 100644 templates/neural_network_model.ipynb create mode 100644 templates/reinforcement_learning.ipynb create mode 100644 templates/sentiment_analysis_model.ipynb create mode 100644 templates/time_series_analysis.ipynb diff --git a/README.md b/README.md index af0b574..d7dfe2e 100644 --- a/README.md +++ b/README.md @@ -49,6 +49,75 @@ Master feature scaling for better model performance: - Common scaling mistakes - **Key Techniques:** StandardScaler, MinMaxScaler, RobustScaler, proper pipeline usage +## 🎨 ML Project Templates + +The `templates/` directory contains starter templates for various machine learning projects. These templates provide ready-to-use Jupyter notebooks with complete workflows for different ML tasks: + +### [1. Machine Learning Preset](templates/machine_learning_preset.ipynb) +General-purpose ML template for classification tasks: +- Data loading and exploration +- Preprocessing and feature engineering +- Model training and comparison +- Hyperparameter tuning +- Model evaluation and saving + +### [2. Neural Network Model](templates/neural_network_model.ipynb) +Deep learning template using TensorFlow/Keras: +- Neural network architecture design +- Training with callbacks (early stopping, learning rate reduction) +- Batch normalization and dropout +- Performance visualization + +### [3. Language Model](templates/language_model.ipynb) +NLP template for language modeling: +- Text preprocessing and tokenization +- LSTM-based language model +- Text generation +- Word embeddings + +### [4. Sentiment Analysis Model](templates/sentiment_analysis_model.ipynb) +Template for sentiment classification: +- Text cleaning and preprocessing +- TF-IDF feature extraction +- Traditional ML and deep learning approaches +- Model comparison and evaluation + +### [5. Clustering Models](templates/clustering_models.ipynb) +Unsupervised learning template: +- Multiple clustering algorithms (K-Means, DBSCAN, Hierarchical) +- Optimal cluster determination (elbow method, silhouette score) +- Cluster visualization +- PCA for dimensionality reduction + +### [6. Reinforcement Learning](templates/reinforcement_learning.ipynb) +RL template with Q-Learning and DQN: +- Environment setup +- Q-Learning implementation +- Deep Q-Network (DQN) +- Training and evaluation + +### [7. Anomaly Detection](templates/anomaly_detection.ipynb) +Template for outlier detection: +- Multiple anomaly detection algorithms +- Isolation Forest, One-Class SVM, LOF +- Anomaly score visualization +- Model comparison + +### [8. Time Series Analysis](templates/time_series_analysis.ipynb) +Time series forecasting template: +- Time series decomposition (trend, seasonality, residuals) +- Stationarity testing +- ARIMA modeling +- LSTM for time series +- Forecast visualization + +### [9. Computer Vision Models](templates/computer_vision_models.ipynb) +Image classification template: +- CNN architecture from scratch +- Data augmentation +- Transfer learning with pre-trained models +- Model evaluation and visualization + ## 🚀 Getting Started ### Prerequisites @@ -73,7 +142,9 @@ pip install -r requirements.txt jupyter notebook ``` -4. Open any notebook from the `notebooks/` directory and start learning! +4. Explore the notebooks: + - **Debugging Notebooks**: Open any notebook from the `notebooks/` directory to learn about debugging ML problems + - **Project Templates**: Browse the `templates/` directory for ready-to-use project starter templates ## 📋 Requirements diff --git a/templates/README.md b/templates/README.md new file mode 100644 index 0000000..8d346d3 --- /dev/null +++ b/templates/README.md @@ -0,0 +1,179 @@ +# ML Project Templates + +This directory contains ready-to-use Jupyter notebook templates for various machine learning projects. Each template provides a complete workflow from data loading to model evaluation, designed to help you quickly start your ML projects. + +## 📋 Available Templates + +### 1. Machine Learning Preset (`machine_learning_preset.ipynb`) +**Use Case:** General classification tasks, scikit-learn projects + +**What's Included:** +- Data loading and exploration +- Train-test splitting +- Feature scaling +- Multiple model comparison (Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, SVM) +- Hyperparameter tuning with GridSearchCV +- Model evaluation and visualization + +**Best For:** Tabular data, classification problems, beginners learning ML + +--- + +### 2. Neural Network Model (`neural_network_model.ipynb`) +**Use Case:** Deep learning classification tasks + +**What's Included:** +- Feedforward neural network with TensorFlow/Keras +- Batch normalization and dropout +- Training with callbacks (early stopping, learning rate reduction) +- Training history visualization +- Model saving and loading + +**Best For:** Complex classification problems, deep learning enthusiasts + +--- + +### 3. Language Model (`language_model.ipynb`) +**Use Case:** Text generation, language modeling + +**What's Included:** +- Text tokenization and preprocessing +- LSTM-based language model +- Word embeddings +- Text generation function +- Training and evaluation + +**Best For:** NLP projects, text generation tasks + +--- + +### 4. Sentiment Analysis Model (`sentiment_analysis_model.ipynb`) +**Use Case:** Text classification, sentiment analysis + +**What's Included:** +- Text preprocessing and cleaning +- TF-IDF feature extraction +- Traditional ML models (Logistic Regression, Naive Bayes, SVM) +- LSTM-based deep learning approach +- Model comparison and evaluation + +**Best For:** Product reviews, social media sentiment, customer feedback analysis + +--- + +### 5. Clustering Models (`clustering_models.ipynb`) +**Use Case:** Unsupervised learning, customer segmentation + +**What's Included:** +- Multiple clustering algorithms (K-Means, DBSCAN, Hierarchical, Gaussian Mixture) +- Elbow method for optimal K +- Silhouette score analysis +- Cluster visualization +- PCA for dimensionality reduction + +**Best For:** Customer segmentation, pattern discovery, exploratory data analysis + +--- + +### 6. Reinforcement Learning (`reinforcement_learning.ipynb`) +**Use Case:** Sequential decision making, game AI + +**What's Included:** +- Simple grid world environment +- Q-Learning implementation +- Deep Q-Network (DQN) +- Training and evaluation +- Performance visualization + +**Best For:** Game AI, robotics, optimization problems + +--- + +### 7. Anomaly Detection (`anomaly_detection.ipynb`) +**Use Case:** Fraud detection, outlier identification + +**What's Included:** +- Multiple anomaly detection algorithms (Isolation Forest, One-Class SVM, LOF, Elliptic Envelope) +- Anomaly score visualization +- Model comparison +- Confusion matrix analysis + +**Best For:** Fraud detection, network intrusion detection, quality control + +--- + +### 8. Time Series Analysis (`time_series_analysis.ipynb`) +**Use Case:** Forecasting, trend analysis + +**What's Included:** +- Time series decomposition +- Stationarity testing +- ARIMA modeling +- LSTM for time series +- Forecast visualization and evaluation + +**Best For:** Stock price prediction, demand forecasting, weather prediction + +--- + +### 9. Computer Vision Models (`computer_vision_models.ipynb`) +**Use Case:** Image classification, object recognition + +**What's Included:** +- CNN architecture from scratch +- Data augmentation +- Transfer learning setup (VGG16, ResNet50, MobileNetV2) +- Model training and evaluation +- Prediction visualization + +**Best For:** Image classification, object detection, visual recognition tasks + +--- + +## 🚀 Quick Start + +1. **Choose a template** that matches your project needs +2. **Copy the template** to your working directory +3. **Customize** the data loading section for your dataset +4. **Run the cells** to build and train your model +5. **Adapt** the architecture and parameters to your specific requirements + +## 💡 Tips for Using Templates + +- **Start Simple**: Begin with the basic template structure before adding complexity +- **Customize**: Modify hyperparameters, architectures, and preprocessing steps for your data +- **Experiment**: Try different models and configurations to find what works best +- **Save Your Work**: Use the model saving sections to preserve your trained models +- **Iterate**: Use these templates as starting points and improve based on your results + +## 📚 Integration with Debugging Notebooks + +These templates work great in conjunction with the debugging notebooks in the `notebooks/` directory: + +- Build your model using a **template** +- Debug issues using the **debugging notebooks** +- Apply fixes back to your **template-based project** + +## 🤝 Contributing + +Have a new template idea? Found improvements? Contributions are welcome! + +1. Create a new template following the existing structure +2. Document the use case and included features +3. Submit a pull request + +## 📝 Template Structure + +Each template follows a consistent structure: + +1. **Import Libraries** - All necessary packages +2. **Load Data** - Example data loading with placeholders for your data +3. **Preprocessing** - Data cleaning and preparation +4. **Model Building** - Architecture definition +5. **Training** - Model training with best practices +6. **Evaluation** - Performance metrics and visualization +7. **Saving** - Model persistence (optional) + +--- + +**Ready to start your ML project? Choose a template and begin building!** 🚀 diff --git a/templates/anomaly_detection.ipynb b/templates/anomaly_detection.ipynb new file mode 100644 index 0000000..7fbf547 --- /dev/null +++ b/templates/anomaly_detection.ipynb @@ -0,0 +1,382 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Anomaly Detection Template\n", + "\n", + "This template provides a starting point for anomaly detection projects.\n", + "\n", + "## Features:\n", + "- Data preprocessing for anomaly detection\n", + "- Multiple anomaly detection algorithms\n", + "- Model evaluation and comparison\n", + "- Visualization of anomalies\n", + "- Threshold tuning" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Data manipulation\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "# Visualization\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# Anomaly detection algorithms\n", + "from sklearn.ensemble import IsolationForest\n", + "from sklearn.svm import OneClassSVM\n", + "from sklearn.neighbors import LocalOutlierFactor\n", + "from sklearn.covariance import EllipticEnvelope\n", + "\n", + "# Preprocessing and evaluation\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.decomposition import PCA\n", + "from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve\n", + "\n", + "# Settings\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "np.random.seed(42)\n", + "\n", + "# Set visualization style\n", + "sns.set_style('whitegrid')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load and Explore Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load your data\n", + "# df = pd.read_csv('your_data.csv')\n", + "\n", + "# Example: Create sample data with anomalies\n", + "from sklearn.datasets import make_blobs\n", + "\n", + "# Normal data\n", + "X_normal, _ = make_blobs(n_samples=300, centers=1, n_features=2, \n", + " cluster_std=1.0, random_state=42)\n", + "\n", + "# Anomalies\n", + "X_anomalies = np.random.uniform(low=-8, high=8, size=(20, 2))\n", + "\n", + "# Combine\n", + "X = np.vstack([X_normal, X_anomalies])\n", + "y_true = np.hstack([np.zeros(len(X_normal)), np.ones(len(X_anomalies))]) # 0=normal, 1=anomaly\n", + "\n", + "df = pd.DataFrame(X, columns=['feature_1', 'feature_2'])\n", + "df['label'] = y_true\n", + "\n", + "print(f\"Dataset shape: {df.shape}\")\n", + "print(f\"\\nAnomaly ratio: {y_true.mean():.2%}\")\n", + "print(\"\\nFirst few rows:\")\n", + "print(df.head())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize data\n", + "plt.figure(figsize=(10, 6))\n", + "normal_points = df[df['label'] == 0]\n", + "anomaly_points = df[df['label'] == 1]\n", + "\n", + "plt.scatter(normal_points['feature_1'], normal_points['feature_2'], \n", + " c='blue', alpha=0.6, label='Normal', s=50)\n", + "plt.scatter(anomaly_points['feature_1'], anomaly_points['feature_2'], \n", + " c='red', alpha=0.8, label='Anomaly', s=100, marker='x')\n", + "plt.xlabel('Feature 1')\n", + "plt.ylabel('Feature 2')\n", + "plt.title('Data Distribution with Anomalies')\n", + "plt.legend()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Data Preprocessing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Prepare features\n", + "X_data = df[['feature_1', 'feature_2']].values\n", + "\n", + "# Scale features\n", + "scaler = StandardScaler()\n", + "X_scaled = scaler.fit_transform(X_data)\n", + "\n", + "print(f\"Scaled data shape: {X_scaled.shape}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Apply Anomaly Detection Algorithms" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize anomaly detection models\n", + "contamination = 0.1 # Expected proportion of anomalies\n", + "\n", + "models = {\n", + " 'Isolation Forest': IsolationForest(contamination=contamination, random_state=42),\n", + " 'One-Class SVM': OneClassSVM(nu=contamination, kernel='rbf', gamma='auto'),\n", + " 'Local Outlier Factor': LocalOutlierFactor(contamination=contamination, novelty=True),\n", + " 'Elliptic Envelope': EllipticEnvelope(contamination=contamination, random_state=42)\n", + "}\n", + "\n", + "# Train and predict\n", + "predictions = {}\n", + "for name, model in models.items():\n", + " # Fit and predict\n", + " if name == 'Local Outlier Factor':\n", + " model.fit(X_scaled)\n", + " y_pred = model.predict(X_scaled)\n", + " else:\n", + " y_pred = model.fit_predict(X_scaled)\n", + " \n", + " # Convert predictions: -1 (anomaly) to 1, 1 (normal) to 0\n", + " y_pred = np.where(y_pred == -1, 1, 0)\n", + " predictions[name] = y_pred\n", + " \n", + " print(f\"{name}: Detected {y_pred.sum()} anomalies\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Evaluate Models" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Evaluate each model\n", + "from sklearn.metrics import precision_score, recall_score, f1_score\n", + "\n", + "print(\"Model Performance Comparison:\")\n", + "print(\"=\" * 80)\n", + "\n", + "results = []\n", + "for name, y_pred in predictions.items():\n", + " precision = precision_score(y_true, y_pred)\n", + " recall = recall_score(y_true, y_pred)\n", + " f1 = f1_score(y_true, y_pred)\n", + " \n", + " results.append({\n", + " 'Model': name,\n", + " 'Precision': precision,\n", + " 'Recall': recall,\n", + " 'F1-Score': f1\n", + " })\n", + " \n", + " print(f\"\\n{name}:\")\n", + " print(f\" Precision: {precision:.4f}\")\n", + " print(f\" Recall: {recall:.4f}\")\n", + " print(f\" F1-Score: {f1:.4f}\")\n", + "\n", + "results_df = pd.DataFrame(results)\n", + "print(\"\\n\" + \"=\" * 80)\n", + "print(results_df.to_string(index=False))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Visualize Detection Results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize all model predictions\n", + "fig, axes = plt.subplots(2, 2, figsize=(15, 12))\n", + "axes = axes.ravel()\n", + "\n", + "for idx, (name, y_pred) in enumerate(predictions.items()):\n", + " # Separate normal and anomaly predictions\n", + " normal_mask = y_pred == 0\n", + " anomaly_mask = y_pred == 1\n", + " \n", + " # Plot\n", + " axes[idx].scatter(X_data[normal_mask, 0], X_data[normal_mask, 1], \n", + " c='blue', alpha=0.6, label='Normal', s=50)\n", + " axes[idx].scatter(X_data[anomaly_mask, 0], X_data[anomaly_mask, 1], \n", + " c='red', alpha=0.8, label='Detected Anomaly', s=100, marker='x')\n", + " \n", + " axes[idx].set_xlabel('Feature 1')\n", + " axes[idx].set_ylabel('Feature 2')\n", + " axes[idx].set_title(f'{name}')\n", + " axes[idx].legend()\n", + " axes[idx].grid(True)\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Confusion Matrix Analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Plot confusion matrices for all models\n", + "fig, axes = plt.subplots(2, 2, figsize=(15, 12))\n", + "axes = axes.ravel()\n", + "\n", + "for idx, (name, y_pred) in enumerate(predictions.items()):\n", + " cm = confusion_matrix(y_true, y_pred)\n", + " sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],\n", + " xticklabels=['Normal', 'Anomaly'],\n", + " yticklabels=['Normal', 'Anomaly'])\n", + " axes[idx].set_xlabel('Predicted')\n", + " axes[idx].set_ylabel('Actual')\n", + " axes[idx].set_title(f'{name} - Confusion Matrix')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. Anomaly Scores and Thresholding" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Get anomaly scores from Isolation Forest\n", + "iso_forest = IsolationForest(contamination=contamination, random_state=42)\n", + "iso_forest.fit(X_scaled)\n", + "anomaly_scores = -iso_forest.score_samples(X_scaled) # Negative for easier interpretation\n", + "\n", + "# Plot anomaly score distribution\n", + "plt.figure(figsize=(15, 5))\n", + "\n", + "# Score distribution\n", + "plt.subplot(1, 2, 1)\n", + "plt.hist(anomaly_scores[y_true == 0], bins=50, alpha=0.7, label='Normal', color='blue')\n", + "plt.hist(anomaly_scores[y_true == 1], bins=50, alpha=0.7, label='Anomaly', color='red')\n", + "plt.xlabel('Anomaly Score')\n", + "plt.ylabel('Frequency')\n", + "plt.title('Anomaly Score Distribution')\n", + "plt.legend()\n", + "plt.grid(True)\n", + "\n", + "# Scatter plot with scores\n", + "plt.subplot(1, 2, 2)\n", + "scatter = plt.scatter(X_data[:, 0], X_data[:, 1], c=anomaly_scores, \n", + " cmap='RdYlBu_r', s=50, alpha=0.6)\n", + "plt.colorbar(scatter, label='Anomaly Score')\n", + "plt.xlabel('Feature 1')\n", + "plt.ylabel('Feature 2')\n", + "plt.title('Data Points Colored by Anomaly Score')\n", + "plt.grid(True)\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9. Save Model (Optional)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Save best model\n", + "# import pickle\n", + "# with open('anomaly_detector.pkl', 'wb') as f:\n", + "# pickle.dump(iso_forest, f)\n", + "# with open('scaler.pkl', 'wb') as f:\n", + "# pickle.dump(scaler, f)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/templates/clustering_models.ipynb b/templates/clustering_models.ipynb new file mode 100644 index 0000000..30e45ee --- /dev/null +++ b/templates/clustering_models.ipynb @@ -0,0 +1,364 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Clustering Models Template\n", + "\n", + "This template provides a starting point for unsupervised clustering analysis.\n", + "\n", + "## Features:\n", + "- Data preparation for clustering\n", + "- Multiple clustering algorithms\n", + "- Optimal cluster number determination\n", + "- Cluster visualization\n", + "- Evaluation metrics" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Data manipulation\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "# Visualization\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "from mpl_toolkits.mplot3d import Axes3D\n", + "\n", + "# Clustering algorithms\n", + "from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering, MeanShift\n", + "from sklearn.mixture import GaussianMixture\n", + "\n", + "# Preprocessing and evaluation\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.decomposition import PCA\n", + "from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score\n", + "\n", + "# Settings\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "np.random.seed(42)\n", + "\n", + "# Set visualization style\n", + "sns.set_style('whitegrid')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load and Explore Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load your data\n", + "# df = pd.read_csv('your_data.csv')\n", + "\n", + "# Example: Create sample data\n", + "from sklearn.datasets import make_blobs\n", + "X, y_true = make_blobs(n_samples=500, centers=4, n_features=2, \n", + " cluster_std=0.8, random_state=42)\n", + "\n", + "df = pd.DataFrame(X, columns=['feature_1', 'feature_2'])\n", + "\n", + "print(f\"Dataset shape: {df.shape}\")\n", + "print(\"\\nFirst few rows:\")\n", + "print(df.head())\n", + "\n", + "# Visualize raw data\n", + "plt.figure(figsize=(10, 6))\n", + "plt.scatter(df['feature_1'], df['feature_2'], alpha=0.6)\n", + "plt.xlabel('Feature 1')\n", + "plt.ylabel('Feature 2')\n", + "plt.title('Raw Data Distribution')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Data Preprocessing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Check for missing values\n", + "print(\"Missing values:\")\n", + "print(df.isnull().sum())\n", + "\n", + "# Scale features\n", + "scaler = StandardScaler()\n", + "X_scaled = scaler.fit_transform(df)\n", + "\n", + "print(f\"\\nScaled data shape: {X_scaled.shape}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Determine Optimal Number of Clusters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Elbow method\n", + "inertias = []\n", + "silhouette_scores = []\n", + "K_range = range(2, 11)\n", + "\n", + "for k in K_range:\n", + " kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)\n", + " kmeans.fit(X_scaled)\n", + " inertias.append(kmeans.inertia_)\n", + " silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))\n", + "\n", + "# Plot elbow curve\n", + "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))\n", + "\n", + "ax1.plot(K_range, inertias, 'bo-')\n", + "ax1.set_xlabel('Number of Clusters (k)')\n", + "ax1.set_ylabel('Inertia')\n", + "ax1.set_title('Elbow Method')\n", + "ax1.grid(True)\n", + "\n", + "ax2.plot(K_range, silhouette_scores, 'ro-')\n", + "ax2.set_xlabel('Number of Clusters (k)')\n", + "ax2.set_ylabel('Silhouette Score')\n", + "ax2.set_title('Silhouette Score vs K')\n", + "ax2.grid(True)\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Apply Clustering Algorithms" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Choose optimal k based on elbow method\n", + "optimal_k = 4\n", + "\n", + "# Initialize clustering algorithms\n", + "clustering_algorithms = {\n", + " 'K-Means': KMeans(n_clusters=optimal_k, random_state=42, n_init=10),\n", + " 'Hierarchical': AgglomerativeClustering(n_clusters=optimal_k),\n", + " 'DBSCAN': DBSCAN(eps=0.5, min_samples=5),\n", + " 'Gaussian Mixture': GaussianMixture(n_components=optimal_k, random_state=42)\n", + "}\n", + "\n", + "# Apply each algorithm\n", + "results = {}\n", + "for name, algorithm in clustering_algorithms.items():\n", + " if name == 'Gaussian Mixture':\n", + " labels = algorithm.fit_predict(X_scaled)\n", + " else:\n", + " labels = algorithm.fit_predict(X_scaled)\n", + " \n", + " results[name] = labels\n", + " \n", + " # Calculate metrics (if more than 1 cluster)\n", + " if len(np.unique(labels)) > 1:\n", + " silhouette = silhouette_score(X_scaled, labels)\n", + " print(f\"{name}:\")\n", + " print(f\" Silhouette Score: {silhouette:.4f}\")\n", + " print(f\" Number of clusters: {len(np.unique(labels))}\")\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Visualize Clustering Results" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize all clustering results\n", + "fig, axes = plt.subplots(2, 2, figsize=(15, 12))\n", + "axes = axes.ravel()\n", + "\n", + "for idx, (name, labels) in enumerate(results.items()):\n", + " axes[idx].scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis', alpha=0.6)\n", + " axes[idx].set_title(f'{name} Clustering')\n", + " axes[idx].set_xlabel('Feature 1 (scaled)')\n", + " axes[idx].set_ylabel('Feature 2 (scaled)')\n", + " \n", + " # Add cluster centers for K-Means\n", + " if name == 'K-Means':\n", + " centers = clustering_algorithms[name].cluster_centers_\n", + " axes[idx].scatter(centers[:, 0], centers[:, 1], c='red', s=200, \n", + " marker='X', edgecolors='black', label='Centroids')\n", + " axes[idx].legend()\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Detailed Analysis of Best Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use K-Means as example (change to your preferred algorithm)\n", + "best_model = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)\n", + "cluster_labels = best_model.fit_predict(X_scaled)\n", + "\n", + "# Add cluster labels to dataframe\n", + "df['cluster'] = cluster_labels\n", + "\n", + "# Cluster statistics\n", + "print(\"Cluster Statistics:\")\n", + "print(df.groupby('cluster').mean())\n", + "\n", + "print(\"\\nCluster Sizes:\")\n", + "print(df['cluster'].value_counts().sort_index())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize cluster characteristics\n", + "fig, axes = plt.subplots(1, 2, figsize=(15, 5))\n", + "\n", + "# Cluster distribution\n", + "cluster_counts = df['cluster'].value_counts().sort_index()\n", + "axes[0].bar(cluster_counts.index, cluster_counts.values)\n", + "axes[0].set_xlabel('Cluster')\n", + "axes[0].set_ylabel('Count')\n", + "axes[0].set_title('Cluster Size Distribution')\n", + "\n", + "# Box plot for each feature by cluster\n", + "df_melted = df.melt(id_vars='cluster', var_name='feature', value_name='value')\n", + "sns.boxplot(data=df_melted, x='cluster', y='value', hue='feature', ax=axes[1])\n", + "axes[1].set_title('Feature Distribution by Cluster')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. Dimensionality Reduction for Visualization (Optional)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# If you have high-dimensional data, use PCA for visualization\n", + "pca = PCA(n_components=2)\n", + "X_pca = pca.fit_transform(X_scaled)\n", + "\n", + "plt.figure(figsize=(10, 6))\n", + "scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='viridis', alpha=0.6)\n", + "plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')\n", + "plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')\n", + "plt.title('Clusters in PCA Space')\n", + "plt.colorbar(scatter, label='Cluster')\n", + "plt.show()\n", + "\n", + "print(f\"Total variance explained: {pca.explained_variance_ratio_.sum():.2%}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9. Save Results (Optional)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Save model\n", + "# import pickle\n", + "# with open('clustering_model.pkl', 'wb') as f:\n", + "# pickle.dump(best_model, f)\n", + "\n", + "# Save clustered data\n", + "# df.to_csv('clustered_data.csv', index=False)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/templates/computer_vision_models.ipynb b/templates/computer_vision_models.ipynb new file mode 100644 index 0000000..f81b667 --- /dev/null +++ b/templates/computer_vision_models.ipynb @@ -0,0 +1,489 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Computer Vision Models Template\n", + "\n", + "This template provides a starting point for computer vision projects.\n", + "\n", + "## Features:\n", + "- Image data loading and preprocessing\n", + "- CNN architecture design\n", + "- Transfer learning with pre-trained models\n", + "- Data augmentation\n", + "- Model evaluation and visualization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Data manipulation\n", + "import numpy as np\n", + "import pandas as pd\n", + "import os\n", + "\n", + "# Visualization\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# Image processing\n", + "from PIL import Image\n", + "\n", + "# TensorFlow/Keras\n", + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "from tensorflow.keras import layers, models\n", + "from tensorflow.keras.preprocessing.image import ImageDataGenerator\n", + "from tensorflow.keras.applications import VGG16, ResNet50, MobileNetV2\n", + "from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau\n", + "\n", + "# Metrics\n", + "from sklearn.metrics import classification_report, confusion_matrix\n", + "\n", + "# Settings\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "np.random.seed(42)\n", + "tf.random.set_seed(42)\n", + "\n", + "print(f\"TensorFlow version: {tf.__version__}\")\n", + "print(f\"GPU available: {tf.config.list_physical_devices('GPU')}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load and Explore Image Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example: Load MNIST dataset (replace with your own image data)\n", + "# For custom data: use ImageDataGenerator with flow_from_directory\n", + "\n", + "(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()\n", + "\n", + "# Convert to RGB (add channel dimension)\n", + "X_train = np.expand_dims(X_train, -1)\n", + "X_test = np.expand_dims(X_test, -1)\n", + "\n", + "# Normalize pixel values\n", + "X_train = X_train.astype('float32') / 255.0\n", + "X_test = X_test.astype('float32') / 255.0\n", + "\n", + "# One-hot encode labels\n", + "num_classes = 10\n", + "y_train_cat = keras.utils.to_categorical(y_train, num_classes)\n", + "y_test_cat = keras.utils.to_categorical(y_test, num_classes)\n", + "\n", + "print(f\"Training data shape: {X_train.shape}\")\n", + "print(f\"Test data shape: {X_test.shape}\")\n", + "print(f\"Number of classes: {num_classes}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize sample images\n", + "fig, axes = plt.subplots(2, 5, figsize=(12, 6))\n", + "axes = axes.ravel()\n", + "\n", + "for i in range(10):\n", + " axes[i].imshow(X_train[i].squeeze(), cmap='gray')\n", + " axes[i].set_title(f'Label: {y_train[i]}')\n", + " axes[i].axis('off')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Data Augmentation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create data generators with augmentation\n", + "train_datagen = ImageDataGenerator(\n", + " rotation_range=10,\n", + " width_shift_range=0.1,\n", + " height_shift_range=0.1,\n", + " zoom_range=0.1,\n", + " validation_split=0.2\n", + ")\n", + "\n", + "test_datagen = ImageDataGenerator()\n", + "\n", + "# Fit on training data\n", + "train_datagen.fit(X_train)\n", + "\n", + "print(\"Data augmentation configured\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize augmented images\n", + "sample_image = X_train[0:1]\n", + "fig, axes = plt.subplots(2, 5, figsize=(12, 6))\n", + "axes = axes.ravel()\n", + "\n", + "for i, ax in enumerate(axes):\n", + " augmented = next(train_datagen.flow(sample_image, batch_size=1))[0]\n", + " ax.imshow(augmented.squeeze(), cmap='gray')\n", + " ax.set_title(f'Augmented {i+1}')\n", + " ax.axis('off')\n", + "\n", + "plt.suptitle('Data Augmentation Examples')\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Build CNN Model from Scratch" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def create_cnn_model(input_shape, num_classes):\n", + " \"\"\"Create a CNN model from scratch.\"\"\"\n", + " model = models.Sequential([\n", + " # First convolutional block\n", + " layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),\n", + " layers.BatchNormalization(),\n", + " layers.MaxPooling2D((2, 2)),\n", + " layers.Dropout(0.25),\n", + " \n", + " # Second convolutional block\n", + " layers.Conv2D(64, (3, 3), activation='relu'),\n", + " layers.BatchNormalization(),\n", + " layers.MaxPooling2D((2, 2)),\n", + " layers.Dropout(0.25),\n", + " \n", + " # Third convolutional block\n", + " layers.Conv2D(128, (3, 3), activation='relu'),\n", + " layers.BatchNormalization(),\n", + " layers.MaxPooling2D((2, 2)),\n", + " layers.Dropout(0.25),\n", + " \n", + " # Flatten and dense layers\n", + " layers.Flatten(),\n", + " layers.Dense(128, activation='relu'),\n", + " layers.BatchNormalization(),\n", + " layers.Dropout(0.5),\n", + " layers.Dense(num_classes, activation='softmax')\n", + " ])\n", + " \n", + " return model\n", + "\n", + "# Create model\n", + "input_shape = X_train.shape[1:]\n", + "cnn_model = create_cnn_model(input_shape, num_classes)\n", + "\n", + "# Display model architecture\n", + "cnn_model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Compile and Train CNN Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Compile model\n", + "cnn_model.compile(\n", + " optimizer=keras.optimizers.Adam(learning_rate=0.001),\n", + " loss='categorical_crossentropy',\n", + " metrics=['accuracy']\n", + ")\n", + "\n", + "# Define callbacks\n", + "early_stopping = EarlyStopping(\n", + " monitor='val_loss',\n", + " patience=5,\n", + " restore_best_weights=True\n", + ")\n", + "\n", + "reduce_lr = ReduceLROnPlateau(\n", + " monitor='val_loss',\n", + " factor=0.5,\n", + " patience=3,\n", + " min_lr=1e-7\n", + ")\n", + "\n", + "callbacks = [early_stopping, reduce_lr]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Train model\n", + "history = cnn_model.fit(\n", + " X_train, y_train_cat,\n", + " validation_split=0.2,\n", + " epochs=20,\n", + " batch_size=128,\n", + " callbacks=callbacks,\n", + " verbose=1\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Transfer Learning with Pre-trained Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Note: For grayscale images, convert to RGB for pre-trained models\n", + "# X_train_rgb = np.repeat(X_train, 3, axis=-1)\n", + "# X_test_rgb = np.repeat(X_test, 3, axis=-1)\n", + "\n", + "# For this example, we'll show the structure\n", + "# Uncomment and modify for actual use\n", + "\n", + "# def create_transfer_model(base_model_name='MobileNetV2', num_classes=10):\n", + "# \"\"\"Create transfer learning model.\"\"\"\n", + "# # Load pre-trained model\n", + "# if base_model_name == 'MobileNetV2':\n", + "# base_model = MobileNetV2(weights='imagenet', include_top=False, input_shape=(224, 224, 3))\n", + "# elif base_model_name == 'ResNet50':\n", + "# base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))\n", + "# \n", + "# # Freeze base model\n", + "# base_model.trainable = False\n", + "# \n", + "# # Add custom layers\n", + "# model = models.Sequential([\n", + "# base_model,\n", + "# layers.GlobalAveragePooling2D(),\n", + "# layers.Dense(256, activation='relu'),\n", + "# layers.Dropout(0.5),\n", + "# layers.Dense(num_classes, activation='softmax')\n", + "# ])\n", + "# \n", + "# return model\n", + "\n", + "# transfer_model = create_transfer_model()\n", + "# transfer_model.summary()\n", + "\n", + "print(\"Transfer learning example structure shown above (commented out)\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Visualize Training History" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Plot training history\n", + "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))\n", + "\n", + "# Loss\n", + "ax1.plot(history.history['loss'], label='Training Loss')\n", + "ax1.plot(history.history['val_loss'], label='Validation Loss')\n", + "ax1.set_xlabel('Epoch')\n", + "ax1.set_ylabel('Loss')\n", + "ax1.set_title('Model Loss')\n", + "ax1.legend()\n", + "ax1.grid(True)\n", + "\n", + "# Accuracy\n", + "ax2.plot(history.history['accuracy'], label='Training Accuracy')\n", + "ax2.plot(history.history['val_accuracy'], label='Validation Accuracy')\n", + "ax2.set_xlabel('Epoch')\n", + "ax2.set_ylabel('Accuracy')\n", + "ax2.set_title('Model Accuracy')\n", + "ax2.legend()\n", + "ax2.grid(True)\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. Evaluate Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Evaluate on test set\n", + "test_loss, test_accuracy = cnn_model.evaluate(X_test, y_test_cat, verbose=0)\n", + "print(f\"Test Loss: {test_loss:.4f}\")\n", + "print(f\"Test Accuracy: {test_accuracy:.4f}\")\n", + "\n", + "# Make predictions\n", + "y_pred = cnn_model.predict(X_test)\n", + "y_pred_classes = np.argmax(y_pred, axis=1)\n", + "\n", + "# Classification report\n", + "print(\"\\nClassification Report:\")\n", + "print(classification_report(y_test, y_pred_classes))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Confusion matrix\n", + "plt.figure(figsize=(10, 8))\n", + "cm = confusion_matrix(y_test, y_pred_classes)\n", + "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')\n", + "plt.xlabel('Predicted')\n", + "plt.ylabel('Actual')\n", + "plt.title('Confusion Matrix')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9. Visualize Predictions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize predictions\n", + "fig, axes = plt.subplots(2, 5, figsize=(15, 7))\n", + "axes = axes.ravel()\n", + "\n", + "for i in range(10):\n", + " idx = np.random.randint(0, len(X_test))\n", + " axes[i].imshow(X_test[idx].squeeze(), cmap='gray')\n", + " axes[i].set_title(f'True: {y_test[idx]}, Pred: {y_pred_classes[idx]}')\n", + " axes[i].axis('off')\n", + " \n", + " # Color code correct/incorrect\n", + " if y_test[idx] == y_pred_classes[idx]:\n", + " axes[i].spines['bottom'].set_color('green')\n", + " axes[i].spines['top'].set_color('green')\n", + " axes[i].spines['left'].set_color('green')\n", + " axes[i].spines['right'].set_color('green')\n", + " else:\n", + " axes[i].spines['bottom'].set_color('red')\n", + " axes[i].spines['top'].set_color('red')\n", + " axes[i].spines['left'].set_color('red')\n", + " axes[i].spines['right'].set_color('red')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10. Save Model (Optional)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Save model\n", + "# cnn_model.save('computer_vision_model.h5')\n", + "# cnn_model.save('computer_vision_model') # SavedModel format\n", + "\n", + "# Load model\n", + "# loaded_model = keras.models.load_model('computer_vision_model.h5')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/templates/language_model.ipynb b/templates/language_model.ipynb new file mode 100644 index 0000000..d288fc4 --- /dev/null +++ b/templates/language_model.ipynb @@ -0,0 +1,361 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Language Model Template\n", + "\n", + "This template provides a starting point for building language models using transformers and NLP techniques.\n", + "\n", + "## Features:\n", + "- Text preprocessing and tokenization\n", + "- Word embeddings\n", + "- Language model architecture\n", + "- Text generation\n", + "- Model evaluation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Data manipulation\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "# Visualization\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# NLP libraries\n", + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "from tensorflow.keras import layers\n", + "from tensorflow.keras.preprocessing.text import Tokenizer\n", + "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", + "\n", + "# Settings\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "np.random.seed(42)\n", + "tf.random.set_seed(42)\n", + "\n", + "print(f\"TensorFlow version: {tf.__version__}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load and Prepare Text Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example: Sample text corpus\n", + "# In practice, load your text data from files\n", + "corpus = [\n", + " \"Natural language processing is fascinating.\",\n", + " \"Machine learning models can understand text.\",\n", + " \"Deep learning revolutionized NLP.\",\n", + " \"Transformers are state-of-the-art models.\",\n", + " \"Language models predict the next word.\",\n", + " \"Text generation requires large datasets.\",\n", + " \"Neural networks learn patterns in data.\",\n", + " \"Pre-trained models save training time.\",\n", + " \"Fine-tuning adapts models to specific tasks.\",\n", + " \"Natural language understanding is challenging.\"\n", + "]\n", + "\n", + "# For real projects:\n", + "# with open('your_text_file.txt', 'r') as f:\n", + "# corpus = f.readlines()\n", + "\n", + "print(f\"Corpus size: {len(corpus)} sentences\")\n", + "print(\"\\nSample sentences:\")\n", + "for i, sentence in enumerate(corpus[:3]):\n", + " print(f\"{i+1}. {sentence}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Text Preprocessing and Tokenization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize tokenizer\n", + "tokenizer = Tokenizer(oov_token='')\n", + "tokenizer.fit_on_texts(corpus)\n", + "\n", + "# Get vocabulary size\n", + "vocab_size = len(tokenizer.word_index) + 1\n", + "print(f\"Vocabulary size: {vocab_size}\")\n", + "\n", + "# Convert text to sequences\n", + "sequences = tokenizer.texts_to_sequences(corpus)\n", + "\n", + "print(\"\\nExample sequence:\")\n", + "print(f\"Original: {corpus[0]}\")\n", + "print(f\"Tokenized: {sequences[0]}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create input sequences for language modeling\n", + "input_sequences = []\n", + "for sequence in sequences:\n", + " for i in range(1, len(sequence)):\n", + " n_gram_sequence = sequence[:i+1]\n", + " input_sequences.append(n_gram_sequence)\n", + "\n", + "# Pad sequences\n", + "max_sequence_len = max([len(seq) for seq in input_sequences])\n", + "input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')\n", + "\n", + "print(f\"Total training sequences: {len(input_sequences)}\")\n", + "print(f\"Max sequence length: {max_sequence_len}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Split into input and labels\n", + "X = input_sequences[:, :-1]\n", + "y = input_sequences[:, -1]\n", + "\n", + "# One-hot encode labels\n", + "y = keras.utils.to_categorical(y, num_classes=vocab_size)\n", + "\n", + "print(f\"Input shape: {X.shape}\")\n", + "print(f\"Output shape: {y.shape}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Build Language Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def create_language_model(vocab_size, max_sequence_len, embedding_dim=100):\n", + " \"\"\"Create a simple LSTM-based language model.\"\"\"\n", + " model = keras.Sequential([\n", + " layers.Embedding(vocab_size, embedding_dim, input_length=max_sequence_len-1),\n", + " layers.LSTM(150, return_sequences=True),\n", + " layers.Dropout(0.2),\n", + " layers.LSTM(100),\n", + " layers.Dense(vocab_size, activation='softmax')\n", + " ])\n", + " \n", + " return model\n", + "\n", + "# Create model\n", + "model = create_language_model(vocab_size, max_sequence_len, embedding_dim=100)\n", + "\n", + "# Display model architecture\n", + "model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Compile and Train Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Compile model\n", + "model.compile(\n", + " optimizer='adam',\n", + " loss='categorical_crossentropy',\n", + " metrics=['accuracy']\n", + ")\n", + "\n", + "print(\"Model compiled successfully!\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Train model\n", + "history = model.fit(\n", + " X, y,\n", + " epochs=100,\n", + " batch_size=32,\n", + " validation_split=0.2,\n", + " verbose=1\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Visualize Training History" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Plot training history\n", + "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))\n", + "\n", + "# Loss\n", + "ax1.plot(history.history['loss'], label='Training Loss')\n", + "ax1.plot(history.history['val_loss'], label='Validation Loss')\n", + "ax1.set_xlabel('Epoch')\n", + "ax1.set_ylabel('Loss')\n", + "ax1.set_title('Model Loss')\n", + "ax1.legend()\n", + "ax1.grid(True)\n", + "\n", + "# Accuracy\n", + "ax2.plot(history.history['accuracy'], label='Training Accuracy')\n", + "ax2.plot(history.history['val_accuracy'], label='Validation Accuracy')\n", + "ax2.set_xlabel('Epoch')\n", + "ax2.set_ylabel('Accuracy')\n", + "ax2.set_title('Model Accuracy')\n", + "ax2.legend()\n", + "ax2.grid(True)\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Text Generation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def generate_text(seed_text, next_words, model, max_sequence_len):\n", + " \"\"\"Generate text using the trained model.\"\"\"\n", + " for _ in range(next_words):\n", + " # Tokenize seed text\n", + " token_list = tokenizer.texts_to_sequences([seed_text])[0]\n", + " \n", + " # Pad sequence\n", + " token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')\n", + " \n", + " # Predict next word\n", + " predicted = np.argmax(model.predict(token_list, verbose=0), axis=-1)\n", + " \n", + " # Find word for predicted token\n", + " output_word = \"\"\n", + " for word, index in tokenizer.word_index.items():\n", + " if index == predicted:\n", + " output_word = word\n", + " break\n", + " \n", + " # Append to seed text\n", + " seed_text += \" \" + output_word\n", + " \n", + " return seed_text\n", + "\n", + "# Generate text\n", + "seed_texts = [\"Natural language\", \"Machine learning\", \"Deep learning\"]\n", + "\n", + "print(\"Generated text:\")\n", + "for seed in seed_texts:\n", + " generated = generate_text(seed, 5, model, max_sequence_len)\n", + " print(f\"\\nSeed: '{seed}'\")\n", + " print(f\"Generated: '{generated}'\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. Save Model (Optional)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Save model\n", + "# model.save('language_model.h5')\n", + "\n", + "# Save tokenizer\n", + "# import pickle\n", + "# with open('tokenizer.pkl', 'wb') as f:\n", + "# pickle.dump(tokenizer, f)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/templates/machine_learning_preset.ipynb b/templates/machine_learning_preset.ipynb new file mode 100644 index 0000000..d27e620 --- /dev/null +++ b/templates/machine_learning_preset.ipynb @@ -0,0 +1,291 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Machine Learning Preset Template\n", + "\n", + "This template provides a starting point for general machine learning projects.\n", + "\n", + "## Features:\n", + "- Data loading and exploration\n", + "- Data preprocessing\n", + "- Model training and evaluation\n", + "- Hyperparameter tuning\n", + "- Model comparison" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Data manipulation\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "# Visualization\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# Machine Learning\n", + "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n", + "from sklearn.preprocessing import StandardScaler, LabelEncoder\n", + "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n", + "\n", + "# Models\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n", + "from sklearn.svm import SVC\n", + "\n", + "# Settings\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "np.random.seed(42)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load and Explore Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load your data\n", + "# df = pd.read_csv('your_data.csv')\n", + "\n", + "# Example: Create sample data\n", + "from sklearn.datasets import make_classification\n", + "X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, \n", + " n_redundant=5, random_state=42)\n", + "df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])\n", + "df['target'] = y\n", + "\n", + "# Display basic information\n", + "print(\"Dataset shape:\", df.shape)\n", + "print(\"\\nFirst few rows:\")\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Data statistics\n", + "df.describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Check for missing values\n", + "print(\"Missing values:\")\n", + "df.isnull().sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Data Preprocessing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Separate features and target\n", + "X = df.drop('target', axis=1)\n", + "y = df['target']\n", + "\n", + "# Split data\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", + "\n", + "# Scale features\n", + "scaler = StandardScaler()\n", + "X_train_scaled = scaler.fit_transform(X_train)\n", + "X_test_scaled = scaler.transform(X_test)\n", + "\n", + "print(f\"Training set size: {X_train.shape}\")\n", + "print(f\"Test set size: {X_test.shape}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Model Training and Evaluation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize models\n", + "models = {\n", + " 'Logistic Regression': LogisticRegression(random_state=42),\n", + " 'Decision Tree': DecisionTreeClassifier(random_state=42),\n", + " 'Random Forest': RandomForestClassifier(random_state=42),\n", + " 'Gradient Boosting': GradientBoostingClassifier(random_state=42),\n", + " 'SVM': SVC(random_state=42)\n", + "}\n", + "\n", + "# Train and evaluate each model\n", + "results = {}\n", + "for name, model in models.items():\n", + " # Train\n", + " model.fit(X_train_scaled, y_train)\n", + " \n", + " # Predict\n", + " y_pred = model.predict(X_test_scaled)\n", + " \n", + " # Evaluate\n", + " accuracy = accuracy_score(y_test, y_pred)\n", + " results[name] = accuracy\n", + " \n", + " print(f\"{name}: {accuracy:.4f}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Visualize model comparison\n", + "plt.figure(figsize=(10, 6))\n", + "plt.bar(results.keys(), results.values())\n", + "plt.xlabel('Model')\n", + "plt.ylabel('Accuracy')\n", + "plt.title('Model Comparison')\n", + "plt.xticks(rotation=45)\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Hyperparameter Tuning" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example: Tune Random Forest\n", + "param_grid = {\n", + " 'n_estimators': [50, 100, 200],\n", + " 'max_depth': [None, 10, 20, 30],\n", + " 'min_samples_split': [2, 5, 10]\n", + "}\n", + "\n", + "grid_search = GridSearchCV(RandomForestClassifier(random_state=42), \n", + " param_grid, cv=5, n_jobs=-1)\n", + "grid_search.fit(X_train_scaled, y_train)\n", + "\n", + "print(\"Best parameters:\", grid_search.best_params_)\n", + "print(\"Best score:\", grid_search.best_score_)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Final Model Evaluation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Use best model\n", + "best_model = grid_search.best_estimator_\n", + "y_pred_final = best_model.predict(X_test_scaled)\n", + "\n", + "print(\"Classification Report:\")\n", + "print(classification_report(y_test, y_pred_final))\n", + "\n", + "# Confusion Matrix\n", + "plt.figure(figsize=(8, 6))\n", + "cm = confusion_matrix(y_test, y_pred_final)\n", + "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')\n", + "plt.xlabel('Predicted')\n", + "plt.ylabel('Actual')\n", + "plt.title('Confusion Matrix')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Save Model (Optional)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import pickle\n", + "# with open('best_model.pkl', 'wb') as f:\n", + "# pickle.dump(best_model, f)\n", + "# with open('scaler.pkl', 'wb') as f:\n", + "# pickle.dump(scaler, f)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/templates/neural_network_model.ipynb b/templates/neural_network_model.ipynb new file mode 100644 index 0000000..bf01723 --- /dev/null +++ b/templates/neural_network_model.ipynb @@ -0,0 +1,360 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Neural Network Model Template\n", + "\n", + "This template provides a starting point for building neural networks using TensorFlow/Keras.\n", + "\n", + "## Features:\n", + "- Data preparation for neural networks\n", + "- Model architecture design\n", + "- Training with callbacks\n", + "- Performance evaluation\n", + "- Visualization of training history" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Data manipulation\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "# Visualization\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# TensorFlow/Keras\n", + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "from tensorflow.keras import layers, models\n", + "from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau\n", + "\n", + "# Scikit-learn utilities\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.metrics import classification_report, confusion_matrix\n", + "\n", + "# Settings\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "np.random.seed(42)\n", + "tf.random.set_seed(42)\n", + "\n", + "print(f\"TensorFlow version: {tf.__version__}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load and Prepare Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load your data\n", + "# df = pd.read_csv('your_data.csv')\n", + "\n", + "# Example: Create sample data\n", + "from sklearn.datasets import make_classification\n", + "X, y = make_classification(n_samples=5000, n_features=20, n_informative=15, \n", + " n_classes=3, n_clusters_per_class=2, random_state=42)\n", + "\n", + "print(f\"Dataset shape: {X.shape}\")\n", + "print(f\"Number of classes: {len(np.unique(y))}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Split data\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", + "X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)\n", + "\n", + "# Scale features\n", + "scaler = StandardScaler()\n", + "X_train_scaled = scaler.fit_transform(X_train)\n", + "X_val_scaled = scaler.transform(X_val)\n", + "X_test_scaled = scaler.transform(X_test)\n", + "\n", + "print(f\"Training set: {X_train_scaled.shape}\")\n", + "print(f\"Validation set: {X_val_scaled.shape}\")\n", + "print(f\"Test set: {X_test_scaled.shape}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Build Neural Network Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def create_model(input_dim, num_classes):\n", + " \"\"\"Create a feedforward neural network.\"\"\"\n", + " model = models.Sequential([\n", + " layers.Input(shape=(input_dim,)),\n", + " \n", + " # Hidden layers\n", + " layers.Dense(128, activation='relu'),\n", + " layers.BatchNormalization(),\n", + " layers.Dropout(0.3),\n", + " \n", + " layers.Dense(64, activation='relu'),\n", + " layers.BatchNormalization(),\n", + " layers.Dropout(0.3),\n", + " \n", + " layers.Dense(32, activation='relu'),\n", + " layers.BatchNormalization(),\n", + " layers.Dropout(0.2),\n", + " \n", + " # Output layer\n", + " layers.Dense(num_classes, activation='softmax')\n", + " ])\n", + " \n", + " return model\n", + "\n", + "# Create model\n", + "model = create_model(input_dim=X_train_scaled.shape[1], num_classes=len(np.unique(y)))\n", + "\n", + "# Display model architecture\n", + "model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Compile Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Compile model\n", + "model.compile(\n", + " optimizer=keras.optimizers.Adam(learning_rate=0.001),\n", + " loss='sparse_categorical_crossentropy',\n", + " metrics=['accuracy']\n", + ")\n", + "\n", + "print(\"Model compiled successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Setup Callbacks" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Define callbacks\n", + "early_stopping = EarlyStopping(\n", + " monitor='val_loss',\n", + " patience=10,\n", + " restore_best_weights=True,\n", + " verbose=1\n", + ")\n", + "\n", + "reduce_lr = ReduceLROnPlateau(\n", + " monitor='val_loss',\n", + " factor=0.5,\n", + " patience=5,\n", + " min_lr=1e-7,\n", + " verbose=1\n", + ")\n", + "\n", + "# model_checkpoint = ModelCheckpoint(\n", + "# 'best_model.h5',\n", + "# monitor='val_accuracy',\n", + "# save_best_only=True,\n", + "# verbose=1\n", + "# )\n", + "\n", + "callbacks = [early_stopping, reduce_lr]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Train Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Train model\n", + "history = model.fit(\n", + " X_train_scaled, y_train,\n", + " validation_data=(X_val_scaled, y_val),\n", + " epochs=100,\n", + " batch_size=32,\n", + " callbacks=callbacks,\n", + " verbose=1\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Visualize Training History" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Plot training history\n", + "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))\n", + "\n", + "# Loss\n", + "ax1.plot(history.history['loss'], label='Training Loss')\n", + "ax1.plot(history.history['val_loss'], label='Validation Loss')\n", + "ax1.set_xlabel('Epoch')\n", + "ax1.set_ylabel('Loss')\n", + "ax1.set_title('Model Loss')\n", + "ax1.legend()\n", + "ax1.grid(True)\n", + "\n", + "# Accuracy\n", + "ax2.plot(history.history['accuracy'], label='Training Accuracy')\n", + "ax2.plot(history.history['val_accuracy'], label='Validation Accuracy')\n", + "ax2.set_xlabel('Epoch')\n", + "ax2.set_ylabel('Accuracy')\n", + "ax2.set_title('Model Accuracy')\n", + "ax2.legend()\n", + "ax2.grid(True)\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. Evaluate Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Evaluate on test set\n", + "test_loss, test_accuracy = model.evaluate(X_test_scaled, y_test, verbose=0)\n", + "print(f\"Test Loss: {test_loss:.4f}\")\n", + "print(f\"Test Accuracy: {test_accuracy:.4f}\")\n", + "\n", + "# Predictions\n", + "y_pred = model.predict(X_test_scaled)\n", + "y_pred_classes = np.argmax(y_pred, axis=1)\n", + "\n", + "# Classification report\n", + "print(\"\\nClassification Report:\")\n", + "print(classification_report(y_test, y_pred_classes))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Confusion matrix\n", + "plt.figure(figsize=(10, 8))\n", + "cm = confusion_matrix(y_test, y_pred_classes)\n", + "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')\n", + "plt.xlabel('Predicted')\n", + "plt.ylabel('Actual')\n", + "plt.title('Confusion Matrix')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9. Save Model (Optional)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Save model\n", + "# model.save('neural_network_model.h5')\n", + "# model.save('neural_network_model') # SavedModel format\n", + "\n", + "# Load model\n", + "# loaded_model = keras.models.load_model('neural_network_model.h5')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/templates/reinforcement_learning.ipynb b/templates/reinforcement_learning.ipynb new file mode 100644 index 0000000..2e367d5 --- /dev/null +++ b/templates/reinforcement_learning.ipynb @@ -0,0 +1,421 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Reinforcement Learning Template\n", + "\n", + "This template provides a starting point for reinforcement learning projects.\n", + "\n", + "## Features:\n", + "- Environment setup\n", + "- Q-Learning implementation\n", + "- Deep Q-Network (DQN)\n", + "- Training and evaluation\n", + "- Performance visualization" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Data manipulation\n", + "import numpy as np\n", + "import pandas as pd\n", + "from collections import deque\n", + "import random\n", + "\n", + "# Visualization\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# Deep Learning\n", + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "from tensorflow.keras import layers\n", + "\n", + "# Settings\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "np.random.seed(42)\n", + "tf.random.set_seed(42)\n", + "\n", + "print(f\"TensorFlow version: {tf.__version__}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Define Simple Environment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class SimpleGridWorld:\n", + " \"\"\"Simple grid world environment for demonstration.\"\"\"\n", + " \n", + " def __init__(self, size=5):\n", + " self.size = size\n", + " self.reset()\n", + " \n", + " def reset(self):\n", + " \"\"\"Reset environment to initial state.\"\"\"\n", + " self.agent_pos = [0, 0]\n", + " self.goal_pos = [self.size-1, self.size-1]\n", + " return self._get_state()\n", + " \n", + " def _get_state(self):\n", + " \"\"\"Get current state.\"\"\"\n", + " return tuple(self.agent_pos)\n", + " \n", + " def step(self, action):\n", + " \"\"\"Take action and return next state, reward, done.\"\"\"\n", + " # Actions: 0=up, 1=down, 2=left, 3=right\n", + " if action == 0 and self.agent_pos[0] > 0:\n", + " self.agent_pos[0] -= 1\n", + " elif action == 1 and self.agent_pos[0] < self.size - 1:\n", + " self.agent_pos[0] += 1\n", + " elif action == 2 and self.agent_pos[1] > 0:\n", + " self.agent_pos[1] -= 1\n", + " elif action == 3 and self.agent_pos[1] < self.size - 1:\n", + " self.agent_pos[1] += 1\n", + " \n", + " # Check if goal reached\n", + " done = (self.agent_pos == self.goal_pos)\n", + " reward = 1.0 if done else -0.01\n", + " \n", + " return self._get_state(), reward, done\n", + " \n", + " def get_state_space(self):\n", + " \"\"\"Get size of state space.\"\"\"\n", + " return self.size * self.size\n", + " \n", + " def get_action_space(self):\n", + " \"\"\"Get size of action space.\"\"\"\n", + " return 4\n", + "\n", + "# Create environment\n", + "env = SimpleGridWorld(size=5)\n", + "print(f\"State space size: {env.get_state_space()}\")\n", + "print(f\"Action space size: {env.get_action_space()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Q-Learning Agent" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class QLearningAgent:\n", + " \"\"\"Q-Learning agent.\"\"\"\n", + " \n", + " def __init__(self, state_space, action_space, learning_rate=0.1, \n", + " discount_factor=0.95, epsilon=1.0, epsilon_decay=0.995, \n", + " epsilon_min=0.01):\n", + " self.state_space = state_space\n", + " self.action_space = action_space\n", + " self.lr = learning_rate\n", + " self.gamma = discount_factor\n", + " self.epsilon = epsilon\n", + " self.epsilon_decay = epsilon_decay\n", + " self.epsilon_min = epsilon_min\n", + " \n", + " # Initialize Q-table\n", + " self.q_table = np.zeros((state_space, action_space))\n", + " \n", + " def get_action(self, state):\n", + " \"\"\"Select action using epsilon-greedy policy.\"\"\"\n", + " if np.random.random() < self.epsilon:\n", + " return np.random.randint(self.action_space)\n", + " else:\n", + " state_idx = state[0] * 5 + state[1] # Convert state to index\n", + " return np.argmax(self.q_table[state_idx])\n", + " \n", + " def update(self, state, action, reward, next_state, done):\n", + " \"\"\"Update Q-table.\"\"\"\n", + " state_idx = state[0] * 5 + state[1]\n", + " next_state_idx = next_state[0] * 5 + next_state[1]\n", + " \n", + " # Q-learning update\n", + " if done:\n", + " target = reward\n", + " else:\n", + " target = reward + self.gamma * np.max(self.q_table[next_state_idx])\n", + " \n", + " self.q_table[state_idx, action] += self.lr * (target - self.q_table[state_idx, action])\n", + " \n", + " def decay_epsilon(self):\n", + " \"\"\"Decay exploration rate.\"\"\"\n", + " self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)\n", + "\n", + "# Create agent\n", + "agent = QLearningAgent(env.get_state_space(), env.get_action_space())\n", + "print(\"Q-Learning agent created\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Train Q-Learning Agent" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Training parameters\n", + "num_episodes = 1000\n", + "max_steps = 100\n", + "\n", + "# Training\n", + "rewards_history = []\n", + "epsilon_history = []\n", + "\n", + "for episode in range(num_episodes):\n", + " state = env.reset()\n", + " total_reward = 0\n", + " \n", + " for step in range(max_steps):\n", + " # Select and perform action\n", + " action = agent.get_action(state)\n", + " next_state, reward, done = env.step(action)\n", + " \n", + " # Update agent\n", + " agent.update(state, action, reward, next_state, done)\n", + " \n", + " total_reward += reward\n", + " state = next_state\n", + " \n", + " if done:\n", + " break\n", + " \n", + " # Decay epsilon\n", + " agent.decay_epsilon()\n", + " \n", + " # Record metrics\n", + " rewards_history.append(total_reward)\n", + " epsilon_history.append(agent.epsilon)\n", + " \n", + " if (episode + 1) % 100 == 0:\n", + " avg_reward = np.mean(rewards_history[-100:])\n", + " print(f\"Episode {episode+1}/{num_episodes}, Avg Reward: {avg_reward:.2f}, Epsilon: {agent.epsilon:.3f}\")\n", + "\n", + "print(\"Training completed!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Visualize Training Progress" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Plot training metrics\n", + "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))\n", + "\n", + "# Rewards\n", + "window_size = 50\n", + "moving_avg = pd.Series(rewards_history).rolling(window=window_size).mean()\n", + "ax1.plot(rewards_history, alpha=0.3, label='Episode Reward')\n", + "ax1.plot(moving_avg, label=f'{window_size}-Episode Moving Average')\n", + "ax1.set_xlabel('Episode')\n", + "ax1.set_ylabel('Total Reward')\n", + "ax1.set_title('Training Rewards')\n", + "ax1.legend()\n", + "ax1.grid(True)\n", + "\n", + "# Epsilon decay\n", + "ax2.plot(epsilon_history)\n", + "ax2.set_xlabel('Episode')\n", + "ax2.set_ylabel('Epsilon')\n", + "ax2.set_title('Exploration Rate Decay')\n", + "ax2.grid(True)\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Deep Q-Network (DQN) Agent" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class DQNAgent:\n", + " \"\"\"Deep Q-Network agent.\"\"\"\n", + " \n", + " def __init__(self, state_size, action_size, learning_rate=0.001):\n", + " self.state_size = state_size\n", + " self.action_size = action_size\n", + " self.memory = deque(maxlen=2000)\n", + " self.gamma = 0.95\n", + " self.epsilon = 1.0\n", + " self.epsilon_decay = 0.995\n", + " self.epsilon_min = 0.01\n", + " self.learning_rate = learning_rate\n", + " self.model = self._build_model()\n", + " \n", + " def _build_model(self):\n", + " \"\"\"Build neural network model.\"\"\"\n", + " model = keras.Sequential([\n", + " layers.Dense(24, input_dim=self.state_size, activation='relu'),\n", + " layers.Dense(24, activation='relu'),\n", + " layers.Dense(self.action_size, activation='linear')\n", + " ])\n", + " model.compile(loss='mse', optimizer=keras.optimizers.Adam(learning_rate=self.learning_rate))\n", + " return model\n", + " \n", + " def remember(self, state, action, reward, next_state, done):\n", + " \"\"\"Store experience in replay memory.\"\"\"\n", + " self.memory.append((state, action, reward, next_state, done))\n", + " \n", + " def act(self, state):\n", + " \"\"\"Select action using epsilon-greedy policy.\"\"\"\n", + " if np.random.random() <= self.epsilon:\n", + " return random.randrange(self.action_size)\n", + " act_values = self.model.predict(state, verbose=0)\n", + " return np.argmax(act_values[0])\n", + " \n", + " def replay(self, batch_size):\n", + " \"\"\"Train on batch from replay memory.\"\"\"\n", + " if len(self.memory) < batch_size:\n", + " return\n", + " \n", + " minibatch = random.sample(self.memory, batch_size)\n", + " for state, action, reward, next_state, done in minibatch:\n", + " target = reward\n", + " if not done:\n", + " target = reward + self.gamma * np.amax(self.model.predict(next_state, verbose=0)[0])\n", + " \n", + " target_f = self.model.predict(state, verbose=0)\n", + " target_f[0][action] = target\n", + " self.model.fit(state, target_f, epochs=1, verbose=0)\n", + " \n", + " if self.epsilon > self.epsilon_min:\n", + " self.epsilon *= self.epsilon_decay\n", + "\n", + "print(\"DQN agent class defined\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Evaluate Agent" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def evaluate_agent(agent, env, num_episodes=10):\n", + " \"\"\"Evaluate trained agent.\"\"\"\n", + " total_rewards = []\n", + " \n", + " for episode in range(num_episodes):\n", + " state = env.reset()\n", + " total_reward = 0\n", + " done = False\n", + " steps = 0\n", + " \n", + " while not done and steps < 100:\n", + " action = agent.get_action(state)\n", + " next_state, reward, done = env.step(action)\n", + " total_reward += reward\n", + " state = next_state\n", + " steps += 1\n", + " \n", + " total_rewards.append(total_reward)\n", + " print(f\"Episode {episode+1}: Reward = {total_reward:.2f}, Steps = {steps}\")\n", + " \n", + " print(f\"\\nAverage Reward: {np.mean(total_rewards):.2f}\")\n", + " print(f\"Std Reward: {np.std(total_rewards):.2f}\")\n", + "\n", + "# Evaluate Q-Learning agent\n", + "print(\"Evaluating Q-Learning Agent:\")\n", + "evaluate_agent(agent, env, num_episodes=10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. Save Model (Optional)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Save Q-table\n", + "# np.save('q_table.npy', agent.q_table)\n", + "\n", + "# Save DQN model\n", + "# dqn_agent.model.save('dqn_model.h5')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/templates/sentiment_analysis_model.ipynb b/templates/sentiment_analysis_model.ipynb new file mode 100644 index 0000000..99c7822 --- /dev/null +++ b/templates/sentiment_analysis_model.ipynb @@ -0,0 +1,393 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Sentiment Analysis Model Template\n", + "\n", + "This template provides a starting point for building sentiment analysis models.\n", + "\n", + "## Features:\n", + "- Text preprocessing for sentiment analysis\n", + "- Feature extraction (TF-IDF, word embeddings)\n", + "- Model training and evaluation\n", + "- Sentiment prediction\n", + "- Performance metrics" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Data manipulation\n", + "import numpy as np\n", + "import pandas as pd\n", + "import re\n", + "\n", + "# Visualization\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# NLP and ML\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.naive_bayes import MultinomialNB\n", + "from sklearn.svm import LinearSVC\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score\n", + "\n", + "# Deep Learning (optional)\n", + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "from tensorflow.keras import layers\n", + "from tensorflow.keras.preprocessing.text import Tokenizer\n", + "from tensorflow.keras.preprocessing.sequence import pad_sequences\n", + "\n", + "# Settings\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "np.random.seed(42)\n", + "\n", + "print(f\"TensorFlow version: {tf.__version__}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load and Explore Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example: Create sample sentiment data\n", + "# In practice, load your data: df = pd.read_csv('sentiment_data.csv')\n", + "\n", + "sample_data = [\n", + " (\"This product is amazing! I love it.\", \"positive\"),\n", + " (\"Terrible experience, would not recommend.\", \"negative\"),\n", + " (\"Great quality and fast shipping.\", \"positive\"),\n", + " (\"Waste of money. Very disappointed.\", \"negative\"),\n", + " (\"Excellent customer service!\", \"positive\"),\n", + " (\"Poor quality, broke after one use.\", \"negative\"),\n", + " (\"Best purchase I've made this year.\", \"positive\"),\n", + " (\"Not worth the price at all.\", \"negative\"),\n", + " (\"Highly satisfied with this product.\", \"positive\"),\n", + " (\"Complete waste of time and money.\", \"negative\")\n", + "]\n", + "\n", + "df = pd.DataFrame(sample_data, columns=['text', 'sentiment'])\n", + "\n", + "print(f\"Dataset shape: {df.shape}\")\n", + "print(\"\\nFirst few rows:\")\n", + "print(df.head())\n", + "\n", + "# Check class distribution\n", + "print(\"\\nClass distribution:\")\n", + "print(df['sentiment'].value_counts())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Text Preprocessing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def preprocess_text(text):\n", + " \"\"\"Clean and preprocess text.\"\"\"\n", + " # Convert to lowercase\n", + " text = text.lower()\n", + " \n", + " # Remove special characters and digits\n", + " text = re.sub(r'[^a-zA-Z\\s]', '', text)\n", + " \n", + " # Remove extra whitespace\n", + " text = ' '.join(text.split())\n", + " \n", + " return text\n", + "\n", + "# Apply preprocessing\n", + "df['cleaned_text'] = df['text'].apply(preprocess_text)\n", + "\n", + "print(\"Example preprocessing:\")\n", + "print(f\"Original: {df['text'].iloc[0]}\")\n", + "print(f\"Cleaned: {df['cleaned_text'].iloc[0]}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Feature Extraction - TF-IDF Approach" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Encode labels\n", + "df['label'] = df['sentiment'].map({'negative': 0, 'positive': 1})\n", + "\n", + "# Split data\n", + "X_train, X_test, y_train, y_test = train_test_split(\n", + " df['cleaned_text'], df['label'], test_size=0.2, random_state=42\n", + ")\n", + "\n", + "# TF-IDF vectorization\n", + "tfidf = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))\n", + "X_train_tfidf = tfidf.fit_transform(X_train)\n", + "X_test_tfidf = tfidf.transform(X_test)\n", + "\n", + "print(f\"TF-IDF feature shape: {X_train_tfidf.shape}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Train Traditional ML Models" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize models\n", + "models = {\n", + " 'Logistic Regression': LogisticRegression(random_state=42),\n", + " 'Naive Bayes': MultinomialNB(),\n", + " 'Linear SVM': LinearSVC(random_state=42),\n", + " 'Random Forest': RandomForestClassifier(random_state=42)\n", + "}\n", + "\n", + "# Train and evaluate\n", + "results = {}\n", + "for name, model in models.items():\n", + " # Train\n", + " model.fit(X_train_tfidf, y_train)\n", + " \n", + " # Predict\n", + " y_pred = model.predict(X_test_tfidf)\n", + " \n", + " # Evaluate\n", + " accuracy = accuracy_score(y_test, y_pred)\n", + " results[name] = accuracy\n", + " \n", + " print(f\"{name}: {accuracy:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Deep Learning Approach (Optional)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Tokenization for deep learning\n", + "tokenizer = Tokenizer(num_words=5000, oov_token='')\n", + "tokenizer.fit_on_texts(X_train)\n", + "\n", + "# Convert to sequences\n", + "X_train_seq = tokenizer.texts_to_sequences(X_train)\n", + "X_test_seq = tokenizer.texts_to_sequences(X_test)\n", + "\n", + "# Pad sequences\n", + "max_length = 100\n", + "X_train_padded = pad_sequences(X_train_seq, maxlen=max_length, padding='post')\n", + "X_test_padded = pad_sequences(X_test_seq, maxlen=max_length, padding='post')\n", + "\n", + "print(f\"Padded sequence shape: {X_train_padded.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Build LSTM model\n", + "def create_lstm_model(vocab_size, embedding_dim=128, max_length=100):\n", + " model = keras.Sequential([\n", + " layers.Embedding(vocab_size, embedding_dim, input_length=max_length),\n", + " layers.LSTM(64, return_sequences=True),\n", + " layers.Dropout(0.3),\n", + " layers.LSTM(32),\n", + " layers.Dropout(0.3),\n", + " layers.Dense(16, activation='relu'),\n", + " layers.Dense(1, activation='sigmoid')\n", + " ])\n", + " return model\n", + "\n", + "# Create and compile model\n", + "vocab_size = len(tokenizer.word_index) + 1\n", + "dl_model = create_lstm_model(vocab_size, max_length=max_length)\n", + "dl_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])\n", + "\n", + "dl_model.summary()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Train deep learning model\n", + "history = dl_model.fit(\n", + " X_train_padded, y_train,\n", + " epochs=20,\n", + " batch_size=32,\n", + " validation_split=0.2,\n", + " verbose=1\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. Model Evaluation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Evaluate deep learning model\n", + "y_pred_dl = (dl_model.predict(X_test_padded) > 0.5).astype(int)\n", + "dl_accuracy = accuracy_score(y_test, y_pred_dl)\n", + "\n", + "print(f\"Deep Learning Model Accuracy: {dl_accuracy:.4f}\")\n", + "print(\"\\nClassification Report:\")\n", + "print(classification_report(y_test, y_pred_dl, target_names=['Negative', 'Positive']))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. Sentiment Prediction Function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def predict_sentiment(text, model, use_deep_learning=False):\n", + " \"\"\"Predict sentiment of input text.\"\"\"\n", + " # Preprocess\n", + " cleaned = preprocess_text(text)\n", + " \n", + " if use_deep_learning:\n", + " # Deep learning prediction\n", + " sequence = tokenizer.texts_to_sequences([cleaned])\n", + " padded = pad_sequences(sequence, maxlen=max_length, padding='post')\n", + " prediction = model.predict(padded, verbose=0)[0][0]\n", + " sentiment = 'Positive' if prediction > 0.5 else 'Negative'\n", + " confidence = prediction if prediction > 0.5 else 1 - prediction\n", + " else:\n", + " # Traditional ML prediction\n", + " features = tfidf.transform([cleaned])\n", + " prediction = model.predict(features)[0]\n", + " sentiment = 'Positive' if prediction == 1 else 'Negative'\n", + " confidence = 1.0 # Traditional models don't provide probability easily\n", + " \n", + " return sentiment, confidence\n", + "\n", + "# Test predictions\n", + "test_texts = [\n", + " \"This is absolutely wonderful!\",\n", + " \"I hate this product.\",\n", + " \"Not bad, quite good actually.\"\n", + "]\n", + "\n", + "print(\"Sentiment Predictions:\")\n", + "for text in test_texts:\n", + " sentiment, conf = predict_sentiment(text, dl_model, use_deep_learning=True)\n", + " print(f\"\\nText: '{text}'\")\n", + " print(f\"Sentiment: {sentiment} (Confidence: {conf:.2f})\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9. Save Model (Optional)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Save deep learning model\n", + "# dl_model.save('sentiment_model.h5')\n", + "\n", + "# Save traditional ML model and vectorizer\n", + "# import pickle\n", + "# with open('sentiment_lr_model.pkl', 'wb') as f:\n", + "# pickle.dump(models['Logistic Regression'], f)\n", + "# with open('tfidf_vectorizer.pkl', 'wb') as f:\n", + "# pickle.dump(tfidf, f)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/templates/time_series_analysis.ipynb b/templates/time_series_analysis.ipynb new file mode 100644 index 0000000..26a813d --- /dev/null +++ b/templates/time_series_analysis.ipynb @@ -0,0 +1,477 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Time Series Analysis Template\n", + "\n", + "This template provides a starting point for time series analysis and forecasting.\n", + "\n", + "## Features:\n", + "- Time series data preprocessing\n", + "- Trend and seasonality analysis\n", + "- Forecasting models (ARIMA, Prophet, LSTM)\n", + "- Model evaluation\n", + "- Visualization of predictions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Data manipulation\n", + "import numpy as np\n", + "import pandas as pd\n", + "from datetime import datetime, timedelta\n", + "\n", + "# Visualization\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns\n", + "\n", + "# Time series analysis\n", + "from statsmodels.tsa.seasonal import seasonal_decompose\n", + "from statsmodels.tsa.stattools import adfuller\n", + "from statsmodels.graphics.tsaplots import plot_acf, plot_pacf\n", + "\n", + "# Forecasting models\n", + "from statsmodels.tsa.arima.model import ARIMA\n", + "from sklearn.metrics import mean_squared_error, mean_absolute_error\n", + "\n", + "# Deep Learning (optional)\n", + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "from tensorflow.keras import layers\n", + "from sklearn.preprocessing import MinMaxScaler\n", + "\n", + "# Settings\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "np.random.seed(42)\n", + "\n", + "# Set visualization style\n", + "sns.set_style('whitegrid')\n", + "\n", + "print(f\"TensorFlow version: {tf.__version__}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load and Explore Time Series Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load your data\n", + "# df = pd.read_csv('your_timeseries_data.csv', parse_dates=['date'], index_col='date')\n", + "\n", + "# Example: Create sample time series data\n", + "np.random.seed(42)\n", + "date_range = pd.date_range(start='2020-01-01', end='2023-12-31', freq='D')\n", + "n = len(date_range)\n", + "\n", + "# Create trend\n", + "trend = np.linspace(100, 200, n)\n", + "\n", + "# Create seasonality\n", + "seasonality = 20 * np.sin(2 * np.pi * np.arange(n) / 365.25)\n", + "\n", + "# Create noise\n", + "noise = np.random.normal(0, 5, n)\n", + "\n", + "# Combine components\n", + "values = trend + seasonality + noise\n", + "\n", + "df = pd.DataFrame({'date': date_range, 'value': values})\n", + "df.set_index('date', inplace=True)\n", + "\n", + "print(f\"Dataset shape: {df.shape}\")\n", + "print(f\"Date range: {df.index.min()} to {df.index.max()}\")\n", + "print(\"\\nFirst few rows:\")\n", + "print(df.head())\n", + "print(\"\\nBasic statistics:\")\n", + "print(df.describe())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Plot time series\n", + "plt.figure(figsize=(15, 6))\n", + "plt.plot(df.index, df['value'], linewidth=1)\n", + "plt.xlabel('Date')\n", + "plt.ylabel('Value')\n", + "plt.title('Time Series Data')\n", + "plt.grid(True)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Time Series Decomposition" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Decompose time series\n", + "decomposition = seasonal_decompose(df['value'], model='additive', period=365)\n", + "\n", + "# Plot components\n", + "fig, axes = plt.subplots(4, 1, figsize=(15, 12))\n", + "\n", + "decomposition.observed.plot(ax=axes[0], title='Original')\n", + "axes[0].set_ylabel('Observed')\n", + "\n", + "decomposition.trend.plot(ax=axes[1], title='Trend')\n", + "axes[1].set_ylabel('Trend')\n", + "\n", + "decomposition.seasonal.plot(ax=axes[2], title='Seasonality')\n", + "axes[2].set_ylabel('Seasonal')\n", + "\n", + "decomposition.resid.plot(ax=axes[3], title='Residuals')\n", + "axes[3].set_ylabel('Residual')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Stationarity Test" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def test_stationarity(timeseries, title='Time Series'):\n", + " \"\"\"Perform Augmented Dickey-Fuller test.\"\"\"\n", + " # Perform Augmented Dickey-Fuller test\n", + " result = adfuller(timeseries.dropna())\n", + " \n", + " print(f'ADF Test for {title}:')\n", + " print(f' ADF Statistic: {result[0]:.6f}')\n", + " print(f' p-value: {result[1]:.6f}')\n", + " print(f' Critical Values:')\n", + " for key, value in result[4].items():\n", + " print(f' {key}: {value:.3f}')\n", + " \n", + " if result[1] <= 0.05:\n", + " print(f' Result: Series is stationary (reject H0)')\n", + " else:\n", + " print(f' Result: Series is non-stationary (fail to reject H0)')\n", + " print()\n", + "\n", + "# Test original series\n", + "test_stationarity(df['value'], 'Original Series')\n", + "\n", + "# Test differenced series\n", + "df['value_diff'] = df['value'].diff()\n", + "test_stationarity(df['value_diff'], 'Differenced Series')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. ACF and PACF Analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Plot ACF and PACF\n", + "fig, axes = plt.subplots(1, 2, figsize=(15, 5))\n", + "\n", + "plot_acf(df['value'].dropna(), lags=40, ax=axes[0])\n", + "axes[0].set_title('Autocorrelation Function (ACF)')\n", + "\n", + "plot_pacf(df['value'].dropna(), lags=40, ax=axes[1])\n", + "axes[1].set_title('Partial Autocorrelation Function (PACF)')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 6. Train-Test Split" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Split data\n", + "train_size = int(len(df) * 0.8)\n", + "train = df['value'][:train_size]\n", + "test = df['value'][train_size:]\n", + "\n", + "print(f\"Training set size: {len(train)}\")\n", + "print(f\"Test set size: {len(test)}\")\n", + "\n", + "# Visualize split\n", + "plt.figure(figsize=(15, 6))\n", + "plt.plot(train.index, train, label='Train', linewidth=1)\n", + "plt.plot(test.index, test, label='Test', linewidth=1)\n", + "plt.xlabel('Date')\n", + "plt.ylabel('Value')\n", + "plt.title('Train-Test Split')\n", + "plt.legend()\n", + "plt.grid(True)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 7. ARIMA Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Fit ARIMA model\n", + "# Order (p, d, q): AR order, differencing order, MA order\n", + "arima_model = ARIMA(train, order=(5, 1, 2))\n", + "arima_result = arima_model.fit()\n", + "\n", + "print(\"ARIMA Model Summary:\")\n", + "print(arima_result.summary())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Make predictions\n", + "arima_predictions = arima_result.forecast(steps=len(test))\n", + "\n", + "# Calculate metrics\n", + "arima_mse = mean_squared_error(test, arima_predictions)\n", + "arima_rmse = np.sqrt(arima_mse)\n", + "arima_mae = mean_absolute_error(test, arima_predictions)\n", + "\n", + "print(f\"ARIMA Model Performance:\")\n", + "print(f\" MSE: {arima_mse:.4f}\")\n", + "print(f\" RMSE: {arima_rmse:.4f}\")\n", + "print(f\" MAE: {arima_mae:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 8. LSTM Model for Time Series" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def create_sequences(data, seq_length):\n", + " \"\"\"Create sequences for LSTM.\"\"\"\n", + " X, y = [], []\n", + " for i in range(len(data) - seq_length):\n", + " X.append(data[i:i+seq_length])\n", + " y.append(data[i+seq_length])\n", + " return np.array(X), np.array(y)\n", + "\n", + "# Prepare data for LSTM\n", + "scaler = MinMaxScaler()\n", + "train_scaled = scaler.fit_transform(train.values.reshape(-1, 1))\n", + "test_scaled = scaler.transform(test.values.reshape(-1, 1))\n", + "\n", + "# Create sequences\n", + "seq_length = 30\n", + "X_train, y_train = create_sequences(train_scaled, seq_length)\n", + "X_test, y_test = create_sequences(test_scaled, seq_length)\n", + "\n", + "# Reshape for LSTM\n", + "X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))\n", + "X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))\n", + "\n", + "print(f\"X_train shape: {X_train.shape}\")\n", + "print(f\"X_test shape: {X_test.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Build LSTM model\n", + "lstm_model = keras.Sequential([\n", + " layers.LSTM(50, activation='relu', return_sequences=True, input_shape=(seq_length, 1)),\n", + " layers.Dropout(0.2),\n", + " layers.LSTM(50, activation='relu'),\n", + " layers.Dropout(0.2),\n", + " layers.Dense(1)\n", + "])\n", + "\n", + "lstm_model.compile(optimizer='adam', loss='mse')\n", + "lstm_model.summary()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Train LSTM model\n", + "history = lstm_model.fit(\n", + " X_train, y_train,\n", + " epochs=50,\n", + " batch_size=32,\n", + " validation_split=0.2,\n", + " verbose=1\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Make predictions with LSTM\n", + "lstm_predictions_scaled = lstm_model.predict(X_test)\n", + "lstm_predictions = scaler.inverse_transform(lstm_predictions_scaled)\n", + "y_test_original = scaler.inverse_transform(y_test.reshape(-1, 1))\n", + "\n", + "# Calculate metrics\n", + "lstm_mse = mean_squared_error(y_test_original, lstm_predictions)\n", + "lstm_rmse = np.sqrt(lstm_mse)\n", + "lstm_mae = mean_absolute_error(y_test_original, lstm_predictions)\n", + "\n", + "print(f\"LSTM Model Performance:\")\n", + "print(f\" MSE: {lstm_mse:.4f}\")\n", + "print(f\" RMSE: {lstm_rmse:.4f}\")\n", + "print(f\" MAE: {lstm_mae:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 9. Visualize Predictions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Plot predictions\n", + "plt.figure(figsize=(15, 6))\n", + "\n", + "plt.plot(train.index, train, label='Train', linewidth=1)\n", + "plt.plot(test.index, test, label='Actual', linewidth=1)\n", + "plt.plot(test.index, arima_predictions, label='ARIMA Predictions', linewidth=1, linestyle='--')\n", + "\n", + "# Plot LSTM predictions (adjust index for sequence length)\n", + "lstm_pred_index = test.index[seq_length:]\n", + "plt.plot(lstm_pred_index, lstm_predictions, label='LSTM Predictions', linewidth=1, linestyle='--')\n", + "\n", + "plt.xlabel('Date')\n", + "plt.ylabel('Value')\n", + "plt.title('Time Series Forecasting Comparison')\n", + "plt.legend()\n", + "plt.grid(True)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10. Save Models (Optional)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Save ARIMA model\n", + "# arima_result.save('arima_model.pkl')\n", + "\n", + "# Save LSTM model\n", + "# lstm_model.save('lstm_model.h5')\n", + "\n", + "# Save scaler\n", + "# import pickle\n", + "# with open('scaler.pkl', 'wb') as f:\n", + "# pickle.dump(scaler, f)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 2bc785c0a57fb8b3cdb4a9d9513246d327b23472 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Sat, 11 Oct 2025 04:38:07 +0000 Subject: [PATCH 3/3] Update QUICK_REFERENCE.md to mention templates folder Co-authored-by: macanderson <542881+macanderson@users.noreply.github.com> --- QUICK_REFERENCE.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/QUICK_REFERENCE.md b/QUICK_REFERENCE.md index 2df3a88..2b38ae5 100644 --- a/QUICK_REFERENCE.md +++ b/QUICK_REFERENCE.md @@ -1,5 +1,7 @@ # ML Debugging Notebooks - Quick Reference Guide +> **New!** Looking to start a new ML project? Check out the [templates/](templates/) directory for ready-to-use project templates covering various ML tasks. + ## 🎯 Quick Problem Finder **My model has high training accuracy but low test accuracy** → [Notebook 1: Overfitting/Underfitting](notebooks/1_overfitting_underfitting.ipynb) @@ -144,6 +146,22 @@ If you encounter issues: 3. Open an issue on GitHub 4. Search Stack Overflow with specific error messages +## 🎨 ML Project Templates + +Ready to start a new ML project? The `templates/` directory contains ready-to-use Jupyter notebook templates: + +- **Machine Learning Preset** - General classification tasks +- **Neural Network Model** - Deep learning with TensorFlow/Keras +- **Language Model** - Text generation and NLP +- **Sentiment Analysis** - Text classification +- **Clustering Models** - Unsupervised learning +- **Reinforcement Learning** - Q-Learning and DQN +- **Anomaly Detection** - Outlier identification +- **Time Series Analysis** - Forecasting with ARIMA and LSTM +- **Computer Vision** - Image classification with CNNs + +See [templates/README.md](templates/README.md) for detailed information about each template. + --- **Remember:** Debugging is a skill that improves with practice. Don't get discouraged! 🚀