A scalable microservices-based system for ingesting, processing, and retrieving news articles with advanced search capabilities, trending analysis, and LLM-powered features.
The system consists of several microservices:
-
News Retrieval Service (Java/Spring Boot)
- Core service handling news article retrieval and search
- Provides RESTful APIs for querying news articles
- Integrates with PostgreSQL for data storage
- Uses Redis for caching and trending analysis
- Kafka for event streaming
-
LLM Service (Python/FastAPI)
- Processes natural language queries
- Generates article summaries
- Uses Google's Gemini model for NLP tasks
-
Ingest Service (Python)
- Handles data ingestion from JSON files
- Processes and validates news article data
- Loads data into PostgreSQL database
-
Trending Analysis Service
- Processes user events via Kafka
- Updates trending scores in Redis
- Provides geospatial-aware trending articles
- Natural language query processing
- Multiple search intents support:
- Category-based search
- Source-based filtering
- Geospatial search (nearby articles)
- Relevance score filtering
- Full-text search
- LLM-powered query understanding
- User event tracking
- Geospatial trending analysis
- Time-decay based scoring
- Redis-backed caching
- Kafka-powered event processing
- Automatic article summarization
- Category classification
- Relevance scoring
- Geospatial tagging
-
Backend Services:
- Java 21
- Spring Boot
- Python 3.10
- FastAPI
-
Databases:
- PostgreSQL with PostGIS extension
- Redis for caching and trending analysis
-
Message Queue:
- Apache Kafka
-
AI/ML:
- Google Gemini for NLP tasks
participant C as Client
participant NS as News Service
participant LLM as LLM Service
participant DB as PostgreSQL
participant R as Redis
participant K as Kafka
C->>NS: GET /api/v1/news/search?query=text
NS->>LLM: POST /process-query
LLM-->>NS: Intent & Entities
NS->>DB: Query Articles
DB-->>NS: Raw Articles
NS->>LLM: POST /summarize
LLM-->>NS: Article Summaries
NS->>K: Publish User Event
NS-->>C: Enriched Articles
Note over K,R: Async Processing
K->>R: Update Trending Scores
- Docker and Docker Compose
- Java 21
- Python 3.10
- Maven
- PostgreSQL 15+
- Redis 6+
- Apache Kafka
- Clone the repository:
git clone <repository-url>
cd news-data-retrieval-system-
Configure environment variables:
- Create
.envfiles in each service directory - Set required environment variables (see below)
- Create
-
Start the services:
docker-compose up -dSPRING_DATASOURCE_URL=jdbc:postgresql://postgres:5432/newsdb
SPRING_DATASOURCE_USERNAME=news
SPRING_DATASOURCE_PASSWORD=secret
LLM_SERVICE_URL=http://llm-service:8000
SPRING_DATA_REDIS_HOST=redis
SPRING_KAFKA_BOOTSTRAP_SERVERS=kafka:29092GOOGLE_API_KEY=your-gemini-api-keyDB_NAME=newsdb
DB_USER=news
DB_PASSWORD=secret
DB_HOST=postgres
DB_PORT=5432GET /api/v1/news/search?query={query}- Processes natural language queries
- Returns relevant articles with summaries
GET /api/v1/trending?lat={latitude}&lon={longitude}&radius={radiusKm}&limit={limit}- Returns trending articles based on location
- Supports radius-based search (default 100kms)
- Optional limit parameter (default: 5)
POST /api/v1/events- Records user interactions with articles
- Supports various event types
- Rate-limited for protection
POST /process-query- Analyzes natural language queries
- Extracts intents and entities
POST /summarize/- Generates article summaries
- Uses Gemini model
CREATE TABLE news_articles (
id UUID PRIMARY KEY,
title TEXT NOT NULL,
description TEXT,
url TEXT NOT NULL,
publication_date TIMESTAMPTZ NOT NULL,
source_name TEXT NOT NULL,
category TEXT[],
relevance_score REAL CHECK (relevance_score >= 0 AND relevance_score <= 1),
latitude DOUBLE PRECISION CHECK (latitude BETWEEN -90 AND 90),
longitude DOUBLE PRECISION CHECK (longitude BETWEEN -180 AND 180),
geom GEOGRAPHY(Point,4326),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);cd news-retrieval-system
mvn clean installcd llm-service
pip install -r requirements.txtcd ingest
pip install -r requirements.txt- Structured logging implemented across all services
- Log levels configurable via properties files
- Centralized logging recommended for production
- All services expose health endpoints
- Docker health checks configured
- Regular monitoring recommended
- Redis caching for frequent queries
- PostgreSQL indexes for common query patterns
- Kafka for asynchronous event processing
- Rate limiting on critical endpoints