- Python
- NLTK 3.5
- Scikit-learn 0.24.2
- License: MIT
In an era of information overload, organizing news efficiently is crucial. This project automates the classification of Spanish-language news articles using Natural Language Processing (NLP) and Machine Learning. By implementing text preprocessing, vectorization, and classification models, the system categorizes articles into six key areas: Culture, Sports, Economy, Spain, International, and Society. This approach could be applied in media platforms, content recommendation systems, or editorial workflows to streamline content organization and improve user experience.
- Advanced text preprocessing: tokenization, stopword removal, and stemming
- Text vectorization using TF-IDF
- Implementation of two classification models: Logistic Regression and Random Forest
- Detailed model evaluation with precision, recall, and F1-score metrics
- Results visualization through confusion matrices
Classifying news articles automatically is not only a technical challenge — it directly impacts user experience and editorial efficiency. Accurate classification allows media platforms to:
- Recommend more relevant content to users, enhancing their experience.
- Streamline internal workflows by automatically tagging and categorizing articles.
- Gain insights into content distribution across different categories over time.
This project bridges NLP techniques and practical content strategy, showing how language technology can enhance real-world communication.
The dataset used in this project was obtained through the free version of NewsAPI and is hosted on GitHub
- Data source: NewsAPI (free version)
- dataset location: GitHub Repository
- Python 3.7+
- Pandas: for data manipulation
- NLTK: for natural language processing
- Scikit-learn: for modeling and evaluation
- Seaborn and Matplotlib: for results visualization
git clone https://github.com/BeaEsparcia/Spanish_News_Classification.git cd Spanish_News_Classification
If you have Jupyter Notebook installed, open the notebook with: jupyter notebook Spanish_News_Classification.ipynb
Alternatively, you can run it directly in Google Colab:
- Conversion to lowercase
- Punctuation removal
- Tokenization
- Stopword removal for Spanish
- Stemming using SnowballStemmer for Spanish
- Using the TF-IDF Vectorizer to convert text into numerical features
- Logistic Regression: an efficient linear model for classification problems.
- Random Forest Classifier: a decision tree-based model that improves accuracy by combining multiple trees.
The models were evaluated using the following metrics:
- Accuracy: Measures the percentage of correct predictions out of the total predictions.
- Precision: Measures the proportion of true positives among all predicted positives.
- Recall: Measures the proportion of true positives among all actual positives.
- F1-Score: Measures the balance between precision and recall.
- Confusion Matrix: Provides a detailed view of true positives, false positives, true negatives, and false negatives for each category.
The news classification model showed acceptable performance, with room for improvement. Some potential areas for future development include:
- Expanding the dataset with more news sources
- Experimenting with more advanced models such as neural networks
- Conducting a deeper analysis of the key features for classification
This project allowed me to deepen my understanding of text preprocessing for Spanish-language data, which presents unique challenges compared to English datasets. Working with noisy real-world data gave me a clearer perspective on the importance of data quality and iterative model tuning. Additionally,the project strengthened my ability to evaluate models beyond accuracy, using more nuanced metrics like precision, recall, and F1-score.
Finally,this project reaffirmed my passion for NLP applied to real-world content management problems, bridging technology,language and user experience — exactly the type of challenges I want to work on in my career.
Contributions are welcome. Please open an issue first to discuss any changes you'd like to make.
This project is licensed under the MIT License. See the LICENSE file for more details.
[Bea Esparcia] - [esparcia.beatriz@gmail.com] - [www.linkedin.com/in/beaesparcia]

