π View the Live Digest: xmkxabc.github.io/insightarxiv/
In an era of information overload, researchers face the daily challenge of navigating a deluge of new papers. InsightArxiv is designed to solve this core pain point. It's not just a data scraper; it's a fully automated, intelligent academic insight engine.
Our target users are efficiency-driven AI researchers, engineers, graduate students, and tech decision-makers. InsightArxiv offers them a revolutionary way to digest and comprehend the latest scientific breakthroughs:
- From Reading to Insight: Deconstructing long, complex papers into multi-dimensional, structured knowledge crystals.
- From Filtering to Discernment: Leveraging AI for in-depth analysis and critical commentary to help users quickly identify the most innovative and relevant research.
- From Passive to Proactive: Allowing users to fully customize their fields of interest, bringing the most valuable information directly to them.
InsightArxiv's unique value proposition lies in combining cutting-edge AI with a deep understanding of the research workflow. We aim to free researchers from the tedious task of literature screening, empowering them to focus on what truly matters: innovation and critical thinking.
- π€ End-to-End Automation: The entire pipelineβfrom fetching the latest arXiv papers to AI-powered analysis and daily report generationβis fully automated, kicked off with a single
run.shcommand. - π§ Multi-Dimensional AI Analysis: Utilizes Google Gemini to deconstruct each paper, generating structured insights including a TL;DR, Research Motivation, Methodology, Experimental Results, Core Conclusions, Keywords, and an exclusive AI-generated commentary.
- π Intelligent Categorization & Navigation: Automatically categorizes papers by subject and sorts them according to user preferences. The generated report features a dynamic two-level Table of Contents (TOC) and seamless internal links for a superior reading experience.
- π Resource-Aware Polling: A smart, built-in mechanism that rotates through multiple models and API keys. When a resource's free quota is exhausted, the system seamlessly switches to the next available one, maximizing cost-efficiency and ensuring high availability.
- β‘οΈ High-Performance & Robust: Built on
asynciofor high-concurrency processing, significantly boosting efficiency. The system is designed with robust error handling for network fluctuations, API errors, and data inconsistencies to ensure stable operation. - π¨ Template-Driven & Extensible: The final report's appearance is driven by a
template.md, completely separating content from presentation. This allows users to easily customize the report's style. The architecture is clean, modular, and easy to extend.
InsightArxiv operates on a well-architected, modular data processing pipeline:
-
[CRAWL]
daily_arxiv/(Scrapy)- A sophisticated Scrapy spider that fetches the latest papers from arXiv, configured via the
CATEGORIESenvironment variable. - Features intelligent deduplication, filtering of cross-lists, and rich metadata extraction.
- Output:
data/date.jsonl
- A sophisticated Scrapy spider that fetches the latest papers from arXiv, configured via the
-
[ENHANCE]
ai/(LangChain + Gemini)- Reads the raw data and processes it with high concurrency using
asyncio. - Leverages Pydantic models defined in
ai/structure.pyto instruct Gemini to return structured, multi-dimensional analysis. - The core
enhance.pyscript manages complex model/key rotation, rate limiting, and retry logic. - Output:
data/date_AI_enhanced_lang.jsonl
- Reads the raw data and processes it with high concurrency using
-
[GENERATE]
to_md/(Python)- A powerful report generation engine that consumes the AI-enhanced data.
- Renders the structured data into a beautiful, readable Markdown report based on
template.md. - Intelligently generates a categorized TOC sorted by user preference and convenient in-page navigation.
- Output:
data/date.md
-
[PUBLISH]
update_readme.py- Reads the daily generated Markdown report and dynamically updates the root
README.mdto publish the latest content.
- Reads the daily generated Markdown report and dynamically updates the root
Clone this repository to your local machine:
git clone https://github.com/xmkxabc/insightarxiv.git
cd insightarxivMake sure you have Python 3.10+ installed, along with uv (or pip) for package management.
It is recommended to use uv (or pip) to install the project dependencies:
# Using uv (recommended)
uv pip install -r requirements.txt
# Or using pip
pip install -r requirements.txtCreate a .env file in the project's root directory. This is crucial for the project to run.
# Required: Your Google API Keys, separated by commas. The script will poll them in order.
GOOGLE_API_KEYS=your_google_api_key_1,your_google_api_key_2
# Required: The Gemini models you want to use, in order of priority.
# The system will automatically switch to the next one if a quota is exceeded.
MODEL_PRIORITY_LIST=gemini-1.5-flash,gemini-1.5-pro
# Required: The arXiv categories you want to fetch and prioritize, separated by commas.
# The report will sort categories based on this order.
CATEGORIES=cs.CV,cs.AI,cs.LG,cs.CL,cs.RO,stat.MLExecute the run.sh script to start the entire automated workflow:
bash run.shOnce the script finishes, the latest AI-enhanced arXiv report will be automatically updated in this README.md file.
We warmly welcome contributions of all forms! Whether it's reporting a bug, suggesting a new feature, or improving the code through a Pull Request, your help is invaluable to the community.
- Found an issue? Please create an Issue.
- Want to add a new feature? Fork the repository and submit a Pull Request.
This project is open-sourced under the MIT License.
| Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 |
| 23 | 24 | 25 | 26 | 27 | 28 |
| Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | |||
| 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 | 31 |
2025
December
November
October
September
August
July
June
- 2025-06-30
- 2025-06-29
- 2025-06-28
- 2025-06-27
- 2025-06-26
- 2025-06-25
- 2025-06-24
- 2025-06-23
- 2025-06-22
- 2025-06-21
- 2025-06-20
- 2025-06-19
- 2025-06-18
- 2025-06-17
- 2025-06-16
- 2025-06-15
- 2025-06-14
- 2025-06-13
- 2025-06-12
- 2025-06-11
- 2025-06-10
- 2025-06-09
- 2025-06-08
- 2025-06-07
- 2025-06-06
- 2025-06-05
- 2025-06-04
- 2025-06-03
- 2025-06-02
- 2025-06-01
May
- 2025-05-31
- 2025-05-30
- 2025-05-29
- 2025-05-28
- 2025-05-27
- 2025-05-26
- 2025-05-25
- 2025-05-24
- 2025-05-23
- 2025-05-22
- 2025-05-21
- 2025-05-20
- 2025-05-19
- 2025-05-18
- 2025-05-17
- 2025-05-16
- 2025-05-15
- 2025-05-14
- 2025-05-13
- 2025-05-12
- 2025-05-11
- 2025-05-10
- 2025-05-09
- 2025-05-08
- 2025-05-07
- 2025-05-06
- 2025-05-05
- 2025-05-04
- 2025-05-03
- 2025-05-02
- 2025-05-01
April
- 2025-04-30
- 2025-04-29
- 2025-04-28
- 2025-04-27
- 2025-04-26
- 2025-04-25
- 2025-04-24
- 2025-04-23
- 2025-04-22
- 2025-04-21
- 2025-04-20
- 2025-04-19
- 2025-04-18
- 2025-04-17
- 2025-04-16
- 2025-04-15
- 2025-04-14
- 2025-04-13
- 2025-04-12
- 2025-04-11
- 2025-04-10
- 2025-04-09
- 2025-04-08
- 2025-04-07
- 2025-04-06
- 2025-04-05
- 2025-04-04
- 2025-04-03
- 2025-04-02
- 2025-04-01
This page is automatically updated by a GitHub Action.