A sophisticated web application that combines web scraping, AI-powered content summarization, and intelligent question-answering capabilities. Built with Flask, this tool provides comprehensive website analysis through both a web interface and command-line interface.
- Web Scraping: Iteratively scrape websites up to configurable depth levels
- AI-Powered Summarization: Generate intelligent summaries using Google Gemini AI
- Question & Answer System: Ask questions about scraped content using BERT-based QA models
- Website Structure Visualization: Create interactive network graphs showing website link structures
- Content Embedding: Advanced semantic search using sentence transformers
- Web Application: User-friendly Flask-based web interface
- Command Line Tool: Standalone Python script for batch processing
- Semantic Search: Find relevant content using vector embeddings
- Confidence Scoring: Get reliability scores for AI-generated answers
- Export Functionality: Download summaries and visualizations
- Session Management: Persistent user sessions for web interface
- Backend: Flask (Python web framework)
- AI Models:
- Google Gemini 1.5 Pro for content summarization
- BERT-based QA pipeline for question answering
- Sentence Transformers for semantic embeddings
- Data Processing: BeautifulSoup4, NetworkX, Matplotlib
- Web Scraping: Requests library with error handling
WebGuru/
βββ app.py # Flask web application
βββ final.py # Command-line interface
βββ templates/
β βββ index.html # Web interface template
βββ static/
β βββ website_structure.png # Generated visualizations
β βββ website_summary.txt # Generated summaries
βββ requirements.txt # Python dependencies
- Python 3.7 or higher
- pip package manager
- Google Gemini API key
-
Clone the repository
git clone <repository-url> cd WebGuru
-
Install dependencies
pip install -r requirements.txt
-
Configure API keys
- Obtain a Google Gemini API key from Google AI Studio
- Update the API key in
app.pyandfinal.py:genai.configure(api_key='YOUR_API_KEY_HERE')
-
Run the application
# Web interface python app.py # Command line interface python final.py
-
Start the Flask server
python app.py
-
Open your browser and navigate to
http://localhost:5000 -
Enter a URL to analyze and click "Generate Summary"
-
View results:
- AI-generated summary
- Website structure visualization
- Ask questions about the content
-
Download results using the download button
-
Run the script
python final.py
-
Enter the target URL when prompted
-
Wait for processing:
- Website scraping
- Content analysis
- Summary generation
-
Ask questions interactively about the content
-
Type 'exit' to quit the program
Modify the max_depth parameter in the scrape_website() function:
# Default depth is 1, increase for deeper analysis
scrape_website(start_url, max_depth=2)The application supports multiple QA models. Uncomment your preferred model in final.py:
# Current model
qa_pipeline = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")
# Alternative models
# qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
# qa_pipeline = pipeline("question-answering", model="deepset/roberta-large-squad2")Customize file output locations:
# Visualization output
visualize_links(start_url, links, "custom_path.png")
# Summary output
save_summary_to_file(summary, "custom_summary.txt")website_structure.png: Network graph visualization of website linkswebsite_summary.txt: AI-generated content summary (downloadable)
- Web interface: Files are saved in the
static/directory - Command line: Files are saved in the current working directory
GET /: Main page with formsPOST /: Process URL submission or questionsGET /download_summary: Download generated summary as text file
- URL Form:
url- Target website URL - Question Form:
question- User question about content
- API Key Management: Store API keys in environment variables for production
- Rate Limiting: Implement rate limiting for web scraping
- Input Validation: Validate URLs and user inputs
- Session Security: Flask secret key is randomly generated
The application includes comprehensive error handling for:
- Network request failures
- Invalid URLs
- API rate limits
- File I/O operations
- Model inference errors
- Efficient queue-based scraping algorithm
- Vectorized similarity calculations
- Configurable scraping depth
- Session-based caching
- Implement Redis caching for embeddings
- Add async processing for large websites
- Database storage for persistent content
- CDN integration for static files
- Test with various website types (blogs, e-commerce, corporate)
- Verify error handling with invalid URLs
- Test question-answering with different content types
- Validate file downloads and visualizations
# Run tests (when implemented)
python -m pytest tests/- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- Follow PEP 8 guidelines
- Add docstrings to all functions
- Include type hints where appropriate
- Maintain consistent error handling
This project is licensed under the MIT License - see the LICENSE file for details.
- Google Gemini AI for advanced content summarization
- Hugging Face for pre-trained QA models
- Flask for the web framework
- BeautifulSoup4 for web scraping capabilities
- NetworkX for graph visualization
For issues, questions, or contributions:
- Create an issue on GitHub
- Contact the development team
- Check the documentation for common solutions
- v1.0.0: Initial release with web and CLI interfaces
- Core functionality: scraping, summarization, Q&A
- Basic visualization and export features
Note: This application requires an active internet connection and valid API keys to function properly. Ensure compliance with website terms of service and respect robots.txt files when scraping.