This repository contains a system for classifying log messages using a combination of clustering, regex-based rules, and machine learning models. The project is designed to process and categorize log data from various sources into meaningful labels such as "HTTP Status," "Security Alert," and "Critical Error", "Workflow Error" etc.
-
Regular Expression (Regex):
- Targets straightforward and predictable log message patterns.
- Ideal for rule-based classification of common events like "User Action" or "System Notification" using predefined patterns.
-
Sentence Transformer + ML Classification Models:
- Handles intricate patterns with adequate training data.
- Leverages embeddings from Sentence Transformers to capture semantic meaning, with several classification algorithms for robust decision-making.
-
LLM (Large Language Models):
- Addresses complex patterns when labeled data is scarce.
- Serves as a flexible fallback or supplementary method for rare or ambiguous log classifications.
- DeepSeek R1 LLM: Provides an additional layer of validation for complex or rare log patterns where labeled data is limited.
- ML Models: Ensures consistency by comparing predictions from Random Forest, Naive Bayes, and Logistic Regression, which demonstrated high accuracy.
This multi-model verification strengthens the reliability of the classification system, especially for edge cases identified in the notebook's clustering and regex phases.
-
training/:- Contains the Jupyter Notebook (
log_classification_training.ipynb) for training models using SentenceTransformer and multiple classifiers (Random Forest, Naive Bayes, Logistic Regression). - Includes code for regex-based classification as part of the training pipeline.
- Contains the Jupyter Notebook (
-
clf_models/:- Stores the saved models, including the trained Random Forest (
random_forest.joblib), Naive Bayes (naive_bayes.joblib), and Logistic Regression (logistic_regression.joblib) models, along with their SentenceTransformer embeddings.
- Stores the saved models, including the trained Random Forest (
-
files/:- This folder contains test CSV input files (
test.csv,test2.csv,test3.csv).
- This folder contains test CSV input files (
-
Root Directory:
- Contains the server script (
server.py) implemented with FastAPI for serving model predictions. - Includes
locustfile.pyfor load testing the system using Locust to simulate traffic and evaluate performance.
- Contains the server script (
-
Install Dependencies: Ensure you have Python installed on your system. Set up a virtual environment and install the required Python libraries by running the following commands:
pip install -r requirements.txt
-
Run the FastAPI Server: To start the server, use the following command:
uvicorn server:app --reload
Access the API once the server is running at:
http://127.0.0.1:8000 -
Run Load Testing with Locust: To simulate traffic and test the system's performance, use Locust by running:
locust -f locustfile.py
Access the Locust web interface at
http://127.0.0.1:8089to configure and start the load test.
Upload a CSV file containing logs to the FastAPI endpoint(Use Postman for testing) for classification. Ensure the file has the following columns:
sourcelog_message
The output will be a CSV file with an additional column target_label, which represents the classified label for each log entry.