A Retrieval-Augmented Generation (RAG) system for intelligent Fantasy Premier League analysis using Neo4j knowledge graphs, semantic embeddings, and large language models.
FPL Assistant is an intelligent conversational system that answers Fantasy Premier League questions using a retrieval-augmented generation (RAG) approach. Instead of hallucinating answers, the system:
- Classifies intent β understands what the user is asking (e.g., player stats, comparisons)
- Extracts entities β identifies relevant players, teams, gameweeks, and statistics from the query
- Retrieves context β uses multiple retrieval strategies (deterministic Cypher queries, semantic vector search, or hybrid) to fetch facts from a Neo4j knowledge graph
- Generates answers β passes retrieved context to a language model (DeepSeek, Llama, or Gemma) to synthesize natural, conversational responses
The system supports queries like:
- "Compare Salah and Haaland's total points"
- "How many goals did Salah score against Wolves?"
- "Which team did Salah score the least against in the 2022-23 season?"
Traditional LLMs on FPL data are prone to hallucination (making up stats). RAG solves this by grounding responses in actual data from the knowledge graph, ensuring accuracy and factuality.
| Feature | Description |
|---|---|
| Multi-Model LLM Support | DeepSeek (default), Llama, or Gemma for answer generation |
| Dual Embedding Models | All-MiniLM-L6-v2 (fast, small) and All-MPNet-Base-V2 (high-quality) |
| Four Retrieval Strategies | Baseline Cypher, Embeddings (Vector), Hybrid, and LLM-generated Cypher |
| Fuzzy Entity Matching | Robust player/team name recognition despite typos and abbreviations |
| Comprehensive FPL Schema | Covers players, teams, positions, gameweeks, seasons, and detailed performance stats |
| Interactive Web UI | Streamlit-based interface with debug mode and real-time configuration |
| Evaluation Framework | 30 test prompts Γ 18 permutations = 540 experiments to benchmark retrieval + LLM performance |
| Two Seasons of Data | 2021-22 and 2022-23 FPL data with player performance across all gameweeks |
User Query
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β PREPROCESSING & UNDERSTANDING β
β β’ Intent Classification (LLM or Rule-based) β
β β’ Entity Extraction (NER + Fuzzy Matching) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β RETRIEVAL LAYER (Multi-Strategy) β
β βββββββββββββββββββ ββββββββββββββββββββ β
β β Cypher Baseline β β Vector Embeddingsβ β
β β (Deterministic)β β (Semantic) β β
β ββββββββββ¬βββββββββ ββββββββββ¬ββββββββββ β
β ββββββββ¬βββββββββββββββ β
β β β
β Neo4j Knowledge Graph β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTEXT ASSEMBLY β
β β’ Combine & deduplicate results β
β β’ Format for LLM consumption β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM ANSWER GENERATION β
β β’ DeepSeek / Llama / Gemma β
β β’ Grounded in retrieved facts β
β β’ Suggest follow-up questions β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
Natural, Factual Answer
- Python 3.12
- Neo4j (Desktop or Docker)
- 4 GB+ RAM (for embedding models + FAISS indexes)
- Internet connection (for LLM API calls)
# Create and activate virtual environment
python -m venv .venv
.\venv\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root:
# Neo4j Configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=<your_password>
# LLM API Keys
DEEPSEEK_API_KEY=<your_deepseek_key>
DEEPSEEK_API_URL=https://api.deepseek.com/chat/completions
DEEPSEEK_MODEL=deepseek-chat
HF_TOKEN=<your_huggingface_token>
# Embedding Models
MODEL_A_NAME=sentence-transformers/all-MiniLM-L6-v2
MODEL_B_NAME=sentence-transformers/all-mpnet-base-v2
# FAISS Index Paths
FAISS_INDEX_A_PATH=./embeddings_out/faiss_index_modelA.index
FAISS_INDEX_B_PATH=./embeddings_out/faiss_index_modelB.index
MAPPING_A_PATH=./embeddings_out/idx_to_embedding_id_modelA.json
MAPPING_B_PATH=./embeddings_out/idx_to_embedding_id_modelB.json
# Output Directory
OUTPUT_DIR=./embeddings_out# Start Neo4j Desktop and create/launch a local database
# Run the knowledge graph creation script
python .\scripts\create_kg.pyThis populates Neo4j with:
- Seasons: 2021-22, 2022-23
- Players: ~600 per season
- Teams: 20 Premier League teams
- Fixtures: 380 per season (38 gameweeks Γ 20 teams)
- Performance Data: Goals, assists, clean sheets, total points, etc.
Download FAISS indexes and mappings from: Google Drive Link
Place files in embeddings_out/:
embeddings_out/
βββ faiss_index_modelA.index
βββ faiss_index_modelB.index
βββ idx_to_embedding_id_modelA.json
βββ idx_to_embedding_id_modelB.json
python .\scripts\generate_embeddings.pyThis script:
- Fetches all player performance records from Neo4j
- Generates text descriptions (e.g., "Haaland: 13 goals | assists: 1 | total_points: 10 | Position: FWD")
- Encodes them using both embedding models
- Creates FAISS indexes for fast similarity search
streamlit run main.pyOpen http://localhost:8501 in your browser.
fpl-assistant/
βββ main.py # Streamlit web UI entry point
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ STARTING.md # Minimal setup guide
βββ schema.md # Data schema documentation
βββ checklist.md # Project development phases
βββ .env.template # Environment variables template
β
βββ config/ # Configuration & lookup tables
β βββ settings.py # Model options, defaults
β βββ template_library.py # 35 Cypher query templates
β βββ team_name_variants.py # Team abbreviation β full name
β βββ stat_variants.py # Statistic name aliases
β βββ styles.py # Streamlit CSS styling
β βββ FPLTrivia.md # Possible user queries
β
βββ modules/ # Core application logic
β βββ preprocessing.py # Intent classification + entity extraction
β βββ cypher_retriever.py # Baseline retrieval via Cypher
β βββ vector_retriever.py # Semantic retrieval via embeddings
β βββ db_manager.py # Neo4j connection pool
β βββ llm_engine.py # LLM API calls (DeepSeek, Llama, Gemma)
β βββ llm_helper.py # Intent classification & Cypher Generator with LLM
β βββ tests_llm_engine.py # `llm_engine.py` customized for performance testing
β
βββ scripts/ # Data processing & setup
β βββ create_kg.py # Populate Neo4j from CSV
β βββ generate_embeddings.py # Create FAISS indexes
β βββ fpl_two_seasons.csv # Raw FPL data (2 seasons)
β βββ config.txt # Neo4j connection configuration
β
βββ embeddings_out/ # Pre-computed embeddings
β βββ faiss_index_modelA.index # Fast index for model A
β βββ faiss_index_modelB.index # Fast index for model B
β βββ idx_to_embedding_id_modelA.json
β βββ idx_to_embedding_id_modelB.json
β
βββ experiments/ # Evaluation framework
β βββ run_experiments.py # Execute all experiments
β βββ tests.json # 30 test prompts
β βββ results.json # Experimental results (540 trials)
β βββ validate_tests.json # Ground truth answers
β βββ cost_modify.py # Calculate LLM costs
β βββ viz.py # Visualize results using plots
β βββ plots/ # Generated charts
β
βββ logo.png # App logo
Converts raw user text into structured data.
Key Functions:
extract_entities(query: str) β dictβ Extracts players, teams, positions, gameweeks, seasons, stats
Features:
- Spacy NER for organization recognition (team names)
- Fuzzy matching to handle typos and partial names
- Regex patterns for gameweeks (e.g., "GW10"), positions, seasons
- Database lookups for robustness
Example:
query = "How many goals did Haaland score in GW5 2022-23?"
entities = extract_entities(query)
# Output:
# {
# "players": ["Erling Haaland"],
# "gameweeks": [5],
# "seasons": ["2022-23"],
# "statistics": []
# }Executes templated Cypher queries against Neo4j.
Key Functions:
retrieve_data_via_cypher(intent, entities, limit) β dictβ Executes a Cypher template selected by intent
Template Examples:
PLAYER_STATS_GW_SEASONβ Get a player's stats in a specific gameweekCOMPARE_PLAYERS_BY_TOTAL_POINTSβ Compare two players' total pointsPLAYER_CAREER_STATS_TOTALSβ Career aggregatesTOP_PLAYERS_BY_POSITIONβ Rank players by positionTEAM_FIXTURE_SCHEDULEβ Get team's upcoming/past fixtures
Features:
- Parameter injection safety (parameterized + template rendering)
- Missing parameter detection with fallbacks
- JSON-friendly output
Finds players/fixtures using semantic similarity via embeddings.
Key Functions:
vector_search(entities, top_k, model_choice) β dictβ Performs FAISS similarity searchget_models_and_indexes()β Cached loading of models + FAISS indexes
Process:
- Build query text from entities (e.g., "Players: Haaland | Positions: FWD")
- Encode query using SentenceTransformer
- Query FAISS index for top-k similar embeddings
- Fetch source nodes from Neo4j
Embedding Models:
- Model A:
all-MiniLM-L6-v2(22M params, fast) - Model B:
all-mpnet-base-v2(109M params, high-quality)
Singleton pattern for safe, pooled Neo4j access.
db = Neo4jGraph() # Singleton
results = db.execute_query("MATCH (p:Player) RETURN p LIMIT 5")Interfaces with multiple LLM providers.
Supported Models:
- DeepSeek (default, most cost-effective)
- Llama via Hugging Face Inference API
- Gemma via Hugging Face Inference API
Functions:
deepseek_generate_answer(query, context) β strllama_generate_answer(query, context) β strgemma_generate_answer(query, context) β str
System Prompt:
You are an elite Fantasy Premier League analyst.
Answer the user's question using ONLY the data provided.
Do NOT guess or hallucinate.
Keep output concise and actionable.
Suggest a follow-up question at the end.
- High-level LLM utilities for understanding user intent.
- Generating cypher queries.
Functions:
classify_with_deepseek(query, options) β listβ Map query to up to 3 Cypher templates- Fallback:
local_intent_classify(query)(rule-based, inconfig/template_library.py) create_query_with_deepseek(query: str, schema) -> cypher query- Generate a consistent query for user's query with respect to KG schema.
| Node | Properties | Purpose |
|---|---|---|
| Season | season_name |
Either 2021-22 or 2022-23 |
| Gameweek | season, GW_number |
38 gameweeks per season |
| Fixture | season, fixture_number, kickoff_time |
Individual matches |
| Team | name |
20 Premier League clubs per season |
| Player | player_name, player_element |
Individual players |
| Position | name |
FWD, MID, DEF, GK |
| Embedding | model, text, source_label |
Vector embeddings of player descriptions |
- (Season) - [:HAS_GW]-> (Gameweek)
- (Gameweek) - [:HAS_FIXTURE]-> (Fixture)
- (Fixture) - [:HAS_HOME_TEAM]-> (Team)
- (Fixture) - [:HAS_AWAY_TEAM]-> (Team)
- (Player) - [:PLAYS_AS]-> (Position)
- (Player) - [:PLAYED_IN]-> (Fixture)
minutes, goals_scored, assists, total_points, bonus,
clean_sheets, goals_conceded, own_goals, yellow_cards,
red_cards, saves, penalties_saved, penalties_missed,
bps, influence, creativity, threat, ict_index, form
The system supports four retrieval strategies, configurable from the UI sidebar:
When to use: High-precision factual queries (stats, comparisons)
Process:
- Classify intent (e.g., "PLAYER_STATS_GW_SEASON")
- Map entities to template parameters
- Execute Cypher query
- Return structured results
Pros: Deterministic, precise, low latency
Cons: Requires exact entity matching; limited flexibility
When to use: Exploratory queries, fuzzy matching (e.g., "players similar to Salah")
Process:
- Encode query + entities as a text vector
- Query FAISS index for top-k similar embeddings
- Fetch corresponding nodes from Neo4j
- Return ranked results
Pros: Robust to phrasing differences, discovers similar items
Cons: Less precise; slower than Cypher; not practically useful for the FPL Knowledge Graph
When to use: Balance precision + recall
Process:
- Run Cypher retrieval
- Run embedding search in parallel
- Combine results, deduplicate, rank by relevance
Pros: High recall + precision
Cons: Slower (executes both retrievers)
When to use: Complex, unconventional questions
Process:
- Prompt DeepSeek to generate Cypher queries
- Execute against Neo4j
- Return results
Pros: Flexibility for novel queries
Cons: Can generate invalid Cypher; slower; major cybersecurity threat
| Model | Provider | Speed | Quality | Cost |
|---|---|---|---|---|
| DeepSeek | Deepseek API | β‘ Fast | ββββ | π° Cheap |
| Llama 2 70B | Hugging Face | π’ Slow | βββββ | π°π° |
| Gemma 7B | Hugging Face | β‘ Fast | βββ | π° Cheap |
The project includes a comprehensive evaluation suite with:
- 30 test prompts across different query types
- 18 permutations per prompt (different retrieval modes Γ LLM models Γ embedding models)
- 540 total experiments to benchmark performance
python -m experiments.run_experimentsOutputs:
experiments/results.jsonβ Detailed results for each trialexperiments/plots/β Generated visualizations- Summary metrics: latency, token usage, cost analysis
- Player Performance β Stats, comparisons, rankings
- Team Analysis β Fixture difficulty, form
- Transfer Advice β Form
- Comparisons β Head-to-head player metrics
- Recommendations β Top performers by position
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FPL Graph-RAG Assistant β
β Ask questions about Fantasy Premier League β
βββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ€
β SIDEBAR β MAIN CONTENT β
β ββ LLM Choice β β’ Chat input β
β ββ Retrieval β β’ Answer β
β ββ Embedding model β β’ Graph visualization β
β ββ Top-K β β’ Raw retrieval context β
β ββ PL Logo β β’ Debug & Transparency β
β β β’ Chat history β
βββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββ
| Color | RGB | Use |
|---|---|---|
| β Cyan | rgb(4, 245, 255) |
Highlights, accents |
| β Pink | rgb(233, 0, 82) |
Warnings, important |
| β Green | rgb(0, 255, 133) |
Success, positive |
| β Purple | rgb(56, 0, 60) |
Background, muted |
| β White | rgb(255, 255, 255) |
Text, primary |
This project was developed as part of Advanced Computer Lab Milestone 3 course work on AI tools at The German University in Cairo, CSEN 903.
- Graph Architecture β Neo4j schema design, Cypher templates
- Preprocessing β Intent classification, fuzzy entity extraction
- Retrieval Layer β Cypher baseline, FAISS embeddings, hybrid approach
- LLM Integration β Multi-model support, prompt engineering
- Evaluation β Comprehensive benchmark suite with 540 trials
- Add
value,selected_by,transfer_balancefor actual FPL transfers recommendations - Add KG relations between Players & Teams
- Optimize FAISS indexes for faster search
- Add more LLM providers (Claude, GPT-4)
- Cache common queries for faster responses
- Implement streaming responses for long answers
- Add user feedback loop for fine-tuning (PPO)