This README provides a comprehensive overview of the Prompt-Classification-ML project.
- Description
- Features
- Project Structure
- Installation
- Usage
- License
- Contributing
- Acknowledgements
- Screenshots/Examples
This repository hosts a Machine Learning project dedicated to the classification of textual prompts. The project aims to accurately categorize various prompts using advanced ML techniques, including hierarchical modeling. It covers the entire machine learning pipeline, from data collection and preprocessing to model training, evaluation, and deployment of a simple web application for real-time inference.
The core of the project involves:
- Data Preparation: Handling and expanding prompt datasets.
- Model Building: Experimenting with different machine learning algorithms and architectures, notably exploring hierarchical classification approaches.
- Model Persistence: Saving trained models and preprocessors for efficient use.
- Application Development: Providing a user-friendly interface to classify new prompts.
The repository includes Jupyter notebooks for experimental development, Python scripts for core functionalities, and documentation files detailing the project's methodology and results.
- Prompt Classification: Core functionality to classify input text prompts into predefined categories.
- Hierarchical Model Building: Implementation and exploration of hierarchical machine learning models for improved classification accuracy and structure.
- Data Preprocessing: Robust techniques for cleaning, tokenizing, and vectorizing textual data (e.g., using TF-IDF).
- Model Persistence: Pre-trained vectorizers and classification models are saved (
.pklfiles) for quick loading and inference. - Data Augmentation/Generation: Includes datasets that appear to be generated or expanded (
expanded_prompts.csv,generated_prompts_01.csv) to enhance training diversity. - Interactive Web Application: A lightweight web application (
app.py) for users to input prompts and receive real-time classification results. - Jupyter Notebooks: Comprehensive notebooks documenting the data exploration, model training, and evaluation processes.
- Project Documentation: Detailed reports (
.docx,.pdf) outlining the project's approach, findings, and technical aspects.
The repository is organized to clearly separate data, code, and documentation:
Prompt-Classification-ML/
├── .ipynb_checkpoints/ # Jupyter Notebook checkpoints (temporary files)
├── data/ # Datasets used for training and testing
│ └── expanded_prompts.csv
├── docx/ # Project reports and documentation
│ ├── 23BQ1A4261_ML_Report.docx.docx
│ └── 23BQ1A4261_prompt-classification report.pdf
├── 23BQ1A4261_Prompt-Classification.ipynb # Main Jupyter notebook for project development
├── 23BQ1A4261_Prompt-Classification.py # Python script version of the main project
├── app.py # Python script for the web application (likely Streamlit/Flask)
├── cluster_encoder.pkl # Pickled object, possibly a trained cluster encoder
├── vectorizer.pkl # Pickled object, typically a TF-IDF vectorizer or similar
├── README.md # This README file
└── __pycache__/ # Python bytecode cache
Key Files and Directories:
data/: Contains the primary dataset,expanded_prompts.csv, which is central to model training.docx/: Holds the official project reports, offering deep insights into the methodologies and results.23BQ1A4261_Prompt-Classification.ipynb: The main Jupyter notebook where data loading, preprocessing, model training, and evaluation are performed.app.py: The entry point for the web application, allowing users to classify prompts via a user interface.vectorizer.pkl: A serialized object (e.g.,TfidfVectorizer) used to transform text into numerical features, crucial for consistency between training and inference.cluster_encoder.pkl: A serialized encoder, potentially related to clustering or embedding for the hierarchical model.
To set up the project locally, follow these steps:
-
Clone the repository:
git clone https://github.com/harsha4261/Prompt-Classification-ML.git cd Prompt-Classification-ML -
Create a virtual environment (recommended):
python -m venv venv # On Windows .\venv\Scripts\activate # On macOS/Linux source venv/bin/activate
-
Install dependencies: This project likely uses standard Python libraries for machine learning and NLP. Since a
requirements.txtis not explicitly provided, you might need to install them manually.pip install pandas numpy scikit-learn nltk # If app.py is a Streamlit app pip install streamlit # If app.py is a Flask app # pip install flask
Note: You might need to install NLTK data for specific tokenizers or resources if used in the notebooks.
import nltk nltk.download('punkt') # Example, install as needed
The main development and experimentation happen in the Jupyter notebooks.
- Start Jupyter Notebook:
jupyter notebook
- Open
23BQ1A4261_Prompt-Classification.ipynbin your browser. - Execute the cells sequentially to:
- Load and preprocess data from
data/expanded_prompts.csv. - Train the text vectorizer (
vectorizer.pkl). - Train various classification models, including the hierarchical model.
- Evaluate model performance.
- Save trained models and
vectorizer.pkl,cluster_encoder.pkl. - Other notebooks like
enhanced_model_building-checkpoint.ipynbandhierarchical_model_building-checkpoint.ipynb(from.ipynb_checkpoints) can provide more details on specific model development phases.
- Load and preprocess data from
The app.py script provides a user interface for classifying prompts.
- Ensure you have run the Jupyter notebook at least once to generate the
vectorizer.pklandcluster_encoder.pkl(and any other model.pklfiles) as theapp.pyrelies on these pre-trained components. - Run the application:
If
app.pyis a Streamlit application:Ifstreamlit run app.py
app.pyis a Flask application (common, but requiresFLASK_APP=app.py):export FLASK_APP=app.py # For macOS/Linux # set FLASK_APP=app.py # For Windows CMD flask run
- Open your web browser and navigate to the address provided by the command line (e.g.,
http://localhost:8501for Streamlit orhttp://127.0.0.1:5000for Flask). - Enter a prompt in the input field and observe the classification result.
No explicit license file was found in the repository. Therefore, the project is currently unlicensed. Users should contact the repository owner for permissions regarding use, distribution, or modification.
Contributions are welcome! If you have suggestions for improvements, bug fixes, or new features, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/YourFeatureNameorbugfix/FixDescription). - Make your changes.
- Commit your changes (
git commit -m 'Add new feature'). - Push to the branch (
git push origin feature/YourFeatureName). - Open a Pull Request to the
mainbranch of the original repository.
Please ensure your code adheres to good practices and includes comments where necessary.
- The creator of this repository, Harsha, for developing this project.
- The open-source community for providing the tools and libraries (e.g.,
scikit-learn,pandas,streamlit) that make projects like this possible.