Automated Analysis Toolkit for Technician Commitment Documents
A comprehensive Python-based natural language processing (NLP) toolkit for extracting, categorizing, and analyzing action items from institutional commitment documents. Developed for analysis of UK universities' Technician Commitment action plans and progress reports.
This toolkit accompanies the paper:
"Synergies and Gaps in Support for Technical Career Development in UK Higher Education and Research: A Semi-Quantitative Analysis of ‘Technician Commitment’ Action Plans and Progress Reports using Natural Language Processing"
Dr. Samuel J Jackson, August 2025
TCDOC_EXTRACT provides a complete workflow for:
- 📄 Extracting action items from PDF and Google Doc formats
- 🏷️ Categorizing actions using weighted keyword matching
- 📊 Analyzing institutional progress through RAG (Red/Amber/Green) assessments
- 🔗 Tracking action evolution across multiple time periods
- 📈 Visualizing institutional commitment patterns and trends
- Google Gemini 2.5 language model for intelligent action extraction
- Handles diverse document formats (PDF, Google Docs, Word)
- Preserves semantic completeness of actions
- 26 predefined thematic categories
- Weighted keyword lexicon with 500+ domain-specific terms
- Sub-category classification for nuanced analysis
- NLTK-based lemmatization and POS tagging
- Three-tier action matching strategy (TF-IDF + AI semantic analysis)
- Trajectory classification (Continued, Related, Stopped, New)
- Cross-period relationship detection (Identical, Extended, Narrowed)
- 7 derived institutional metrics
- Checkpoint-based resumability for large datasets
- Automatic rate limiting and quota management
- Batch processing optimization (10x API cost reduction)
- Comprehensive error handling
# Clone repository
git clone https://github.com/Sammjjj/TCDOC_EXTRACT.git
cd TCDOC_EXTRACT
# Install dependencies
pip install -r requirements.txt
# Download NLTK data
python -m nltk.downloader punkt averaged_perceptron_tagger wordnet stopwordsSee INSTALL.md for detailed setup including Google Cloud configuration.
python document_extract2.pyOutput: actions_output2.csv - Extracted action items
python predefined_categories6.pyInput: actions_output2.csv
Output: categorized_actions_output.csv - Actions with category assignments
# Extract RAG data
python rag_extract_simple12.py
# Align actions across periods
python rag_alignment_workflow_unpaired8.py
# Generate visualizations
python create_graphical_summary_multi_institution.pyOutput: Matched actions with trajectory analysis and PNG visualizations
| Document | Description |
|---|---|
| INSTALL.md | Complete installation guide with Google Cloud setup |
| API_REFERENCE.md | Detailed technical documentation for all modules |
| Methods.txt | Methodological background and analytical approach |
| LICENSE | Apache 2.0 license terms |
┌─────────────────────────────────────────────────────────────┐
│ DATA ACQUISITION │
│ │
│ Google Drive ──→ document_extract2.py │
│ rag_extract_simple12.py │
└─────────────────────────────┬───────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ CATEGORIZATION │
│ │
│ predefined_categories6.py ──→ 26 Categories │
│ rule_based_class.py ──→ Sub-categories │
└─────────────────────────────┬───────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ LONGITUDINAL ANALYSIS │
│ │
│ rag_alignment_workflow_unpaired8.py: │
│ • Within-period matching (TF-IDF) │
│ • Cross-period matching (Gemini AI) │
│ • Trajectory classification │
│ • Derived metrics │
└─────────────────────────────┬───────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ VISUALIZATION & REPORTING │
│ │
│ create_graphical_summary_multi_institution.py │
│ create_comprehensive_summary_multi_institution.py │
└─────────────────────────────────────────────────────────────┘
| Script | Purpose | Input | Output |
|---|---|---|---|
document_extract2.py |
Extract actions from documents | Google Drive folder | actions_output2.csv |
predefined_categories6.py |
Categorize actions | Action CSV | categorized_actions.csv |
evaluate_multi_col.py |
Evaluate categorization performance | Predictions + ground truth | Performance metrics |
| Script | Purpose | Input | Output |
|---|---|---|---|
subclus_calc_weight.py |
TF-IDF keyword analysis | Categorized actions | keyword_analysis.txt |
rule_based_class.py |
Assign sub-categories | Categorized actions | Actions with sub-categories |
| Script | Purpose | Input | Output |
|---|---|---|---|
rag_extract_simple12.py |
Extract RAG assessments | RAG documents | rag_actions_extraction.csv |
rag_alignment_workflow_unpaired8.py |
Match actions across periods | RAG extraction | rag_actions_extraction_MATCHED.csv |
| Script | Purpose | Input | Output |
|---|---|---|---|
create_graphical_summary_multi_institution.py |
Generate category flow diagrams | Matched actions | PNG visualizations |
create_comprehensive_summary_multi_institution.py |
Create text-based summary tables | Matched actions | Text report |
| Script | Purpose | Input | Output |
|---|---|---|---|
enumerate_resources.py |
Find resource mentions | Actions + resource list | Resource counts |
Defines the 26-category taxonomy with weighted keyword lists.
Structure:
{
"Category Name": {
"keyword phrase": weight,
"another phrase": weight
}
}Categories:
- Apprenticeships, Internships, Placements
- Career Frameworks & Role Definition
- Career Pathways & Progression
- Data & Workforce Analysis
- EDI (Equality, Diversity & Inclusion)
- External Collaboration & Partnerships
- Funding for Technicians
- Mentorship & Support
- Monitoring & Evaluation of TC
- Networking and Presenting
- Ongoing Visibility & Communication
- Professional Registration & Accreditation
- Recognition & Awards
- Recruitment & Onboarding
- Representation in Institutional Governance
- Technician Leadership
- Technician Voice & Feedback
- Training & Skills Development
- ...and 8 more
OAuth2 credentials for Google Drive API access. Not included - you must create this via Google Cloud Console.
See INSTALL.md for setup instructions.
Document Name,Extracted Action,Training & Skills Development,Career Pathways & Progression,Recognition & Awards,Uncategorised
Bristol,Provide access to professional development workshops,1,0,0,0
Oxford,Implement a technician career progression framework,0,1,0,0
Cambridge,Establish an annual technician excellence award,0,0,1,0
LineID,Institution,Period,Action,RAG value,Category,Match AP vs RAG_AP,Match RAG_AP vs AP2,Trajectory
1,Bristol,1,Support professional registration,G,Professional Registration,2 (Identical; 0.98),15 (Extended; 0.85),Continued
- Growth Rate: 45.2% (Period 1 → Period 3)
- Continuation Rate: 78.3% (actions sustained)
- Trajectory Diversity: 0.65 (moderate variation)
- Longevity Score: 62.1% (Period 1 → Period 3)
Organize documents in Google Drive with consistent naming:
Root Folder/
├── Bristol-2021-ActionPlan.pdf
├── Bristol-2022-RAG.pdf
├── Oxford-2021-ActionPlan.docx
├── Oxford-2022-RAG.docx
└── ...
Naming Convention:
{Institution}-{Year}-{DocType}.{ext}{Year}-{Institution}-{DocType}.{ext}- DocType:
ActionPlan,RAG,AP1,RAG_AP1, etc.
For categorization scripts, CSV should have:
- Action column: Text of each action item
- Optionally: Institution, Document Name, etc.
Based on ground truth evaluation (n=341 actions):
| Metric | Average |
|---|---|
| Sensitivity (Recall) | 0.82 |
| Specificity | 0.95 |
See evaluate_multi_col.py output for per-category metrics.
| Task | Time (approx) |
|---|---|
| Extract 100 actions | 15 min |
| Categorize 3,410 actions | 3 sec |
| Full RAG workflow | 2-3 hours |
| Generate visualizations | 30 sec/institution |
Note: RAG workflow time depends on API quotas and dataset size.
If rag_alignment_workflow_unpaired8.py is interrupted:
python rag_alignment_workflow_unpaired8.py
# Prompts: "Resume from checkpoint? (y/n)"Workflow resumes from last saved state.
To add/modify categories:
- Edit
keyword_categories_4.json - Add category with weighted keywords
- Re-run
predefined_categories6.py
Example:
{
"My New Category": {
"specific keyword": 2.0,
"related term": 1.5,
"common phrase": 1.0
}
}The RAG workflow automatically processes all institutions in the input CSV. Results are institution-specific.
Outputs are CSV (comma-separated values) for easy import into:
- Excel / Google Sheets
- R / Python (pandas)
- Statistical software (SPSS, Stata)
- Database systems
Solution: Download credentials.json from Google Cloud Console. See INSTALL.md.
Solution:
python -m nltk.downloader punkt averaged_perceptron_tagger wordnet stopwordsSolution: Wait 1 minute for quota reset. Script has automatic rate limiting.
Solution: Check your CSV has the expected column name. Edit ACTION_COLUMN_NAME in script if needed.
For detailed logging, add to script:
import logging
logging.basicConfig(level=logging.DEBUG)This is a research project accompanying a published paper. While primarily for reproducibility, improvements are welcome:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
Focus areas for contribution:
- Additional document format support
- Performance optimizations
- Multilingual support
- Alternative categorization approaches
If you use this toolkit in your research, please cite:
@article{jackson2025technician,
title={Synergies and Gaps in Technical Skills Development in UK Universities:
A Semi-Quantitative Analysis of 'Technician Commitment' Action Plans
and Progress Reports using Natural Language Processing},
author={Jackson, Samuel J},
year={2025},
month={August}
}Code Repository:
https://github.com/Sammjjj/TCDOC_EXTRACT
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Technician Commitment: Data source for university action plans
- Google Cloud Platform: Gemini AI and Drive API
- NLTK Project: Natural language processing tools
- scikit-learn: Machine learning utilities
- UK Research Community: Feedback and validation
- Technician Commitment Website
- TALENT Commission Report
- Google Gemini Documentation
- NLTK Documentation
Author: Dr. Samuel J Jackson
For questions about:
- Methodology: See Methods.txt
- Installation: See INSTALL.md
- API Usage: See API_REFERENCE.md
- Research: Contact via GitHub issues
- v1.0 (August 2025): Initial public release
- Complete analysis pipeline
- 26-category taxonomy
- RAG trajectory analysis
- Multi-institution support
- Checkpoint resumability
✅ Stable - Published research toolkit
This codebase is the finalized version accompanying the published paper. It is maintained for reproducibility and community use.
Last Updated: December 2025