This project implements an NLP pipeline designed to extract actionable tasks from unstructured text and categorize them. The system identifies sentences containing actionable instructions and extracts key details such as the responsible person and deadlines. It then organizes the tasks into meaningful groups using a heuristic-based approach, with options for additional topic modeling.
- Task Extraction: Identify actionable sentences using imperative detection and task indicator phrases.
- Entity Extraction: Extract the responsible person from task sentences when available.
- Deadline Extraction: Capture deadline information (e.g., "by 5 pm today", "by next Monday") using robust regular expressions.
- Task Categorization: Classify the extracted tasks into predefined categories (e.g., Shopping, Cleaning, Communication, Review, Work, Errand) using keyword matching and, optionally, LDA for dynamic topic clustering.
- Text Cleaning: Remove punctuation and unwanted characters while preserving the original sentence for extraction.
- Tokenization: Split the text into sentences and words.
- POS Tagging: Use part-of-speech tagging to help identify imperative sentences and other key elements.
- Imperative Detection: Determine if a sentence is a command by checking if it begins with a base-form verb.
- Task Indicators: Look for phrases such as "has to", "should", "must", "needs to", and "ought to" to signal actionable tasks.
- Deadline Extraction: Apply comprehensive regex patterns to extract deadlines from task sentences.
- Person Extraction: Use regex to capture the responsible person's name when it appears immediately before task indicators.
- Keyword-Based Categorization: Assign tasks to categories based on predefined keyword lists.
- Optional LDA Topic Modeling: Optionally, use LDA to cluster tasks dynamically for additional insights.
The pipeline produces a structured list of tasks that includes:
- Task Sentence: The original sentence describing the task.
- Person: The extracted responsible individual (if detected).
- Deadline: The extracted deadline (if detected).
- Category: The assigned category based on keyword matching or topic modeling.
- Install Dependencies:
Run the following command in your project directory:pip install -r requirements.txt - Run the Application:
For a Streamlit interface, use:
Alternatively, run the main script directly :
streamlit run app.py
python app.py
project/
├── app.py # Main application file(Streamlit interface)
├── categorization.py # For task categorization
├── preprocessing.py # For text cleaning, tokenization, and POS tagging
├── task_identification.py # For task extraction (imperative detection, deadlines, etc.)
├── utils.py # Utility functions
├── README.md
└── requirements.txt
- Heuristic Extraction: Fine-tuning was required to balance precision and recall.
- Deadline Parsing: Crafting regex for varied deadline formats proved challenging.
- Categorization: Combining keyword matching with LDA needed iterative adjustments.
- Integrate advanced NER for improved entity extraction.
- Enhance deadline parsing with more sophisticated techniques.
- Adding further visualizations and interactive user feedback options.