Extracting and Categorizing Tasks from Unstructured Text

Overview

This project implements an NLP pipeline designed to extract actionable tasks from unstructured text and categorize them. The system identifies sentences containing actionable instructions and extracts key details such as the responsible person and deadlines. It then organizes the tasks into meaningful groups using a heuristic-based approach, with options for additional topic modeling.

Objectives

Task Extraction: Identify actionable sentences using imperative detection and task indicator phrases.
Entity Extraction: Extract the responsible person from task sentences when available.
Deadline Extraction: Capture deadline information (e.g., "by 5 pm today", "by next Monday") using robust regular expressions.
Task Categorization: Classify the extracted tasks into predefined categories (e.g., Shopping, Cleaning, Communication, Review, Work, Errand) using keyword matching and, optionally, LDA for dynamic topic clustering.

Components

Preprocessing

Text Cleaning: Remove punctuation and unwanted characters while preserving the original sentence for extraction.
Tokenization: Split the text into sentences and words.
POS Tagging: Use part-of-speech tagging to help identify imperative sentences and other key elements.

Task Identification

Imperative Detection: Determine if a sentence is a command by checking if it begins with a base-form verb.
Task Indicators: Look for phrases such as "has to", "should", "must", "needs to", and "ought to" to signal actionable tasks.
Deadline Extraction: Apply comprehensive regex patterns to extract deadlines from task sentences.
Person Extraction: Use regex to capture the responsible person's name when it appears immediately before task indicators.

Task Categorization

Keyword-Based Categorization: Assign tasks to categories based on predefined keyword lists.
Optional LDA Topic Modeling: Optionally, use LDA to cluster tasks dynamically for additional insights.

Output

The pipeline produces a structured list of tasks that includes:

Task Sentence: The original sentence describing the task.
Person: The extracted responsible individual (if detected).
Deadline: The extracted deadline (if detected).
Category: The assigned category based on keyword matching or topic modeling.

How to Run

Install Dependencies:
Run the following command in your project directory:
```
pip install -r requirements.txt
```
Run the Application: For a Streamlit interface, use:
```
 streamlit run app.py
```
Alternatively, run the main script directly :
```
python app.py
```

Directory Structure

project/
├── app.py                     # Main application file(Streamlit interface)
├── categorization.py          # For task categorization 
├── preprocessing.py           # For text cleaning, tokenization, and POS tagging
├── task_identification.py     # For task extraction (imperative detection, deadlines, etc.)
├── utils.py                   # Utility functions 
├── README.md                 
└── requirements.txt

Challenges and Insights

Heuristic Extraction: Fine-tuning was required to balance precision and recall.
Deadline Parsing: Crafting regex for varied deadline formats proved challenging.
Categorization: Combining keyword matching with LDA needed iterative adjustments.

Future Improvements

Integrate advanced NER for improved entity extraction.
Enhance deadline parsing with more sophisticated techniques.
Adding further visualizations and interactive user feedback options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extracting and Categorizing Tasks from Unstructured Text

Overview

Objectives

Components

Preprocessing

Task Identification

Task Categorization

Output

How to Run

Directory Structure

Challenges and Insights

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
test		test
.gitignore		.gitignore
README.md		README.md
app.py		app.py
categorization.py		categorization.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
task_identification.py		task_identification.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Extracting and Categorizing Tasks from Unstructured Text

Overview

Objectives

Components

Preprocessing

Task Identification

Task Categorization

Output

How to Run

Directory Structure

Challenges and Insights

Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages