Wikipedia vs Grokipedia Quality Control System

OriginTrail Global Hackathon 2025

An AI-powered content comparison and trust annotation system that analyzes articles from Wikipedia and Grokipedia, detects discrepancies, and publishes Community Notes to the OriginTrail Decentralized Knowledge Graph (DKG).

🚀 Features

Automated Content Fetching: Scrapes 50+ topics from Wikipedia and Grokipedia
Vector Embeddings: Uses Sentence-Transformers for semantic analysis
AI-Powered Analysis: Leverages Cerebras AI with 8-key load balancing
Discrepancy Detection: Identifies length, keyword, and structural differences
Community Notes: Generates neutral, evidence-based fact-checking notes
DKG Publishing: Publishes trust annotations to OriginTrail blockchain
Web Dashboard: Clean, responsive UI with real-time progress tracking

🛠️ Tech Stack

Backend

Python 3.9+ with Flask
Pinecone - Vector database (free tier)
Sentence-Transformers - Local embeddings (all-MiniLM-L6-v2)
Cerebras Cloud SDK - AI analysis with load balancing
OriginTrail DKG - Decentralized knowledge graph
BeautifulSoup4 - Web scraping
Wikipedia API - Content fetching

Frontend

HTML5/CSS3/JavaScript
Bootstrap 5 - Responsive design
Vanilla JS - No frameworks needed

📦 Installation

Prerequisites

Python 3.9+ with pip
Node.js 16+ with npm
Pinecone API key (free tier)
Cerebras API keys (provided)
OriginTrail DKG Node (local or remote)
Wallet with testnet tokens (NEURO + TRAC)

Setup Steps

Clone the repository

git clone <repository-url>
cd hackathon-project

Install Python dependencies

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Install Node.js dependencies (DKG SDK)

npm install

Configure environment variables

cp .env.example .env

Edit .env and add your credentials:

# Pinecone
PINECONE_API_KEY=your_pinecone_api_key

# DKG Edge Node (uses official OriginTrail SDK)
DKG_SERVICE_URL=http://localhost:3000
DKG_ENDPOINT=http://localhost:8900
DKG_PUBLIC_KEY=0xYourPublicAddress
DKG_PRIVATE_KEY=0xYourPrivateKey

# Flask
FLASK_DEBUG=True
FLASK_PORT=5000

Start DKG Edge Node service (separate terminal)

npm start

Start Python Flask app (separate terminal)

python app.py

Access the dashboard

http://localhost:5000

Quick Test

Test DKG integration:

node test-dkg.js

🎯 Usage

Starting a Scan

Open the dashboard at http://localhost:5000
Click the "🚀 Start Scanning" button
Watch real-time progress as topics are analyzed
View results in the table once complete

Viewing Comparisons

Click "View" on any completed topic
Review:
- Similarity score (color-coded)
- AI analysis from Cerebras
- Detected discrepancies
- Community Note
- Side-by-side content comparison

Publishing to DKG

On a comparison page, click "Publish to DKG"
Wait for confirmation
UAL (Universal Asset Locator) will be displayed

📊 API Endpoints

GET `/`

Renders the main dashboard

GET `/comparison/<topic_name>`

Renders detailed comparison page for a topic

GET `/api/topics`

Returns list of all topics with status

[
  {
    "name": "Artificial Intelligence",
    "similarity": 0.85,
    "discrepancies": 2,
    "status": "completed",
    "ai_analysis_available": true
  }
]

POST `/api/scan`

Starts background scanning process

{
  "status": "scanning",
  "job_id": "scan_001"
}

GET `/api/scan-status`

Returns current scan progress

{
  "status": "processing",
  "progress": 45,
  "current_topic": "Quantum Computing"
}

GET `/api/topic/<topic_name>`

Returns detailed analysis for a specific topic

{
  "similarity_score": 0.82,
  "discrepancies": [...],
  "ai_analysis": "...",
  "community_note": "...",
  "ual": "did:dkg:..."
}

POST `/api/publish-dkg`

Manually publishes a topic to DKG

{
  "topic": "Artificial Intelligence",
  "discrepancies": [...],
  "similarity_score": 0.82,
  "ai_analysis": "..."
}

🔗 DKG Integration

The system publishes Community Notes as JSON-LD Knowledge Assets to the OriginTrail Decentralized Knowledge Graph:

Format: ActivityStreams JSON-LD with provenance
Blockchain: NeuroWeb Testnet (Chain ID: 20430)
Local Node: Connects to http://localhost:8900
UAL: Returns Universal Asset Locator for each published note
Mock Mode: Works without DKG node (generates mock UALs)

See DKG_SETUP.md for detailed setup instructions.

🔑 Key Features Explained

API Key Load Balancing

The system rotates through 8 Cerebras API keys using round-robin algorithm to avoid rate limits:

# Automatic rotation on each request
key_rotator.get_next_key()

Vector Similarity

Uses cosine similarity on 384-dimensional embeddings:

0.8-1.0: High similarity (green)
0.6-0.8: Moderate similarity (yellow)
0.0-0.6: Low similarity (red)

Discrepancy Detection

Three types of discrepancies:

Length: >30% difference in content length
Keyword: <50% overlap in top keywords (TF-IDF)
Structural: Significant differences in formatting

Graceful Error Handling

Failed topics are skipped, not blocking the scan
Cerebras failures fall back to automatic analysis
DKG publishing errors are logged but don't crash

📁 Project Structure

hackathon-project/
├── app.py                      # Flask application
├── config.py                   # Configuration
├── requirements.txt            # Dependencies
├── backend/
│   ├── scraper.py             # Content fetching
│   ├── embeddings.py          # Vector embeddings
│   ├── comparison.py          # Discrepancy detection
│   ├── cerebras_analyzer.py   # AI analysis
│   └── dkg_publisher.py       # DKG publishing
├── data/
│   ├── topics.json            # 54 topics
│   └── api_keys.py            # API key rotation
├── static/
│   ├── css/style.css          # Custom styles
│   └── js/script.js           # Utilities
├── templates/
│   ├── base.html              # Base template
│   ├── index.html             # Dashboard
│   └── comparison.html        # Comparison view
└── .env                        # Environment variables

🐛 Troubleshooting

Pinecone Connection Issues

Verify API key in .env
Check index name matches wikipedia-grokipedia
Ensure free tier limits not exceeded

Cerebras API Errors

System automatically rotates through 8 keys
Check logs for specific error messages
Falls back to automatic analysis if all keys fail

Wikipedia Fetch Failures

Some topics may have disambiguation pages
System automatically tries first option
Check logs for specific failures

Grokipedia Scraping Issues

Update URL format in backend/scraper.py
Adjust CSS selectors based on site structure
Mock data used if site unavailable

📝 License

MIT License - OriginTrail Global Hackathon 2025

🙏 Acknowledgments

OriginTrail - DKG infrastructure
Cerebras - AI analysis platform
Pinecone - Vector database
Wikipedia - Baseline content source

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.kiro/specs/wikipedia-grokipedia-comparison		.kiro/specs/wikipedia-grokipedia-comparison
backend		backend
data		data
dkg-node		dkg-node
static		static
templates		templates
.env.example		.env.example
.gitignore		.gitignore
DKG_INTEGRATION.md		DKG_INTEGRATION.md
DKG_MOCK_MODE.md		DKG_MOCK_MODE.md
DKG_SETUP.md		DKG_SETUP.md
EMBEDDING_GUIDE.md		EMBEDDING_GUIDE.md
HACKATHON_SUBMISSION.md		HACKATHON_SUBMISSION.md
MIGRATION_SUMMARY.md		MIGRATION_SUMMARY.md
PITCH.md		PITCH.md
QUICKSTART.md		QUICKSTART.md
QUICKSTART_DKG.md		QUICKSTART_DKG.md
README.md		README.md
VISION.md		VISION.md
app.py		app.py
config.py		config.py
dkg-service-mock.js		dkg-service-mock.js
dkg-service.js		dkg-service.js
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
run.bat		run.bat
run.sh		run.sh
test-balance.js		test-balance.js
test-dkg.js		test-dkg.js
test_embeddings.py		test_embeddings.py
verify_setup.py		verify_setup.py

Folders and files

Latest commit

History

Repository files navigation

Wikipedia vs Grokipedia Quality Control System

🚀 Features

🛠️ Tech Stack

Backend

Frontend

📦 Installation

Prerequisites

Setup Steps

Quick Test

🎯 Usage

Starting a Scan

Viewing Comparisons

Publishing to DKG

📊 API Endpoints

GET /

GET /comparison/<topic_name>

GET /api/topics

POST /api/scan

GET /api/scan-status

GET /api/topic/<topic_name>

POST /api/publish-dkg

🔗 DKG Integration

🔑 Key Features Explained

API Key Load Balancing

Vector Similarity

Discrepancy Detection

Graceful Error Handling

📁 Project Structure

🐛 Troubleshooting

Pinecone Connection Issues

Cerebras API Errors

Wikipedia Fetch Failures

Grokipedia Scraping Issues

📝 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

GET `/`

GET `/comparison/<topic_name>`

GET `/api/topics`

POST `/api/scan`

GET `/api/scan-status`

GET `/api/topic/<topic_name>`

POST `/api/publish-dkg`

Packages