This project provides a framework for text classification using:
- Extract-Combine Processing: A functional MapReduce-like structure.
- Naïve Bayes Classifier: A statistical model for categorizing text.
- Word Frequency Analysis: A module to count occurrences of words in documents.
Key Features:
- Supports parallel processing using extract-combine.
- Implements a trainable classifier for document categorization.
- Provides a modular structure—users can plug in their own dataset.
Anyone studying functional programming
Researchers working on text classification
Developers needing a modular SML framework for document analysis
- Implements MapReduce-style parallelism.
- Extracts word frequencies and combines them efficiently.
- Useful for big text processing (word counting, tokenization).
- Uses Bayes' Theorem to categorize documents.
- Learns from labeled training data, predicts document categories.
- Can be extended for spam detection, sentiment analysis.
- Reads documents and counts word occurrences.
- Can be used as a preprocessing step for classifiers.
/text-classification-framework
│── /src/
│ ├── classify.sml # Naïve Bayes classifier implementation
│ ├── dict.sml # Dictionary structure for classifier
│ ├── extractcombine.sml # Extract-Combine framework
│ ├── wordfreq.sml # Word frequency counter
│ ├── sources-ec.cm # Compilation Manager file for Extract-Combine
│ ├── sources-classify-seq.cm # Compilation Manager file for Classifier
│── README.md # Documentation
│── .gitignore
│── LICENSE
│── Makefile
classify.sml→ Implements the Naïve Bayes classification model.dict.sml→ A dictionary structure to store word frequencies.extractcombine.sml→ Defines a functional Extract-Combine (MapReduce) framework.wordfreq.sml→ Extracts and counts word frequencies from text data.
The following files were provided by Professor Dan Licata and are included for completeness:
sequtils.sigmapreduce.sigmapreduce.smlsequtils.smlfilemr.smlextractcombine.sigclassify.sigtestclassify.smltestclassify-seq.sml
These files serve as a foundation for the classification framework but were not authored by Marouan El-Asery.
git clone https://github.com/Melasery/text-classification-framework.git
cd text-classification-frameworkTo compile and load the framework:
CM.make "sources-ec.cm"; (* Extract-Combine Processing *)
CM.make "sources-classify-seq.cm"; (* Naïve Bayes Classifier *)- CM.make "sources-ec.cm";
- WordFreq.test();This project is licensed under the MIT License – see the LICENSE file for details.