A sophisticated academic paper citation compliance checking system that automatically analyzes citations and references in academic documents, identifying mismatches, missing citations, and format inconsistencies to improve paper quality and academic standards.
Developed by: Agent4S Project Team, TaShan Interdisciplinary Innovation Association, University of Chinese Academy of Sciences
Website: tashan.ac.cn
- Supported Formats: Word documents (.docx, .doc) and PDF files
- File Size Limit: Up to 10MB per document
- Smart Parsing: Automatic identification of document structure, extracting main content and reference sections
The system recognizes citations in academic papers and matches them with reference lists. Here's what formats are currently supported:
| Citation Format | Support Level | Examples | Notes |
|---|---|---|---|
| Author-Year (Chinese) | โ Full Support | ๅผ ไธ๏ผ2024๏ผ ๆๅ ็ญ๏ผ2020๏ผ |
Complete citation-reference matching and validation |
| Author-Year (English) | โ Full Support | Smith (2020) Smith & Jones (2019) Smith et al. (2018) |
Complete citation-reference matching and validation |
| GB/T 7714-2015 ่่ -ๅบ็ๅนดๅถ | โ Full Support | Same as author-year formats above | This is the primary format this tool is designed for |
| Numeric Sequential | [1], [2], [15] [1-3] (range) |
Can extract and identify, but does not perform citation-reference matching validation | |
| GB/T 7714-2015 ้กบๅบ็ผ็ ๅถ | Same as numeric sequential | Can extract and identify only | |
| IEEE (numeric) | [1], [2] (bracket style only) | Can extract bracket-style numbers; superscript numbers (e.g., textยน) are not supported | |
| APA | Basic author-year only | Only supports basic author-year format; page numbers and advanced features not supported | |
| MLA | โ Not Supported | - | Planned but not implemented |
| Chicago | โ Not Supported | - | Planned but not implemented |
Best Results: This tool works best with papers using author-year citation format (GB/T 7714-2015 ่่ -ๅบ็ๅนดๅถ or similar styles). For papers using numeric citation systems, the tool can identify citations but cannot perform comprehensive matching analysis.
- Bidirectional Mapping: Precise matching between in-text citations and reference list
- Context Analysis: Understanding of citation usage in document context
- Tolerance for Variations: Correct matching even with slight formatting differences
- Note: Full matching analysis is available for author-year format citations only
- Year Validation: Detection of citation year inconsistencies with reference years
- Format Standardization: Consistent citation formatting across documents
- Quality Assurance: Identification of uncited references and unreferenced citations
- Match Statistics: Citation count statistics and match success rates
- Correction Suggestions: Year inconsistency corrections and format standardization recommendations
- Formatted Citations: Standardized citations according to academic standards
- Intelligent Formatting: AI model-optimized citation formats
- Error Tolerance: Handling of non-standard formats with automatic correction
- Context Understanding: Analysis of citation correctness in context
- Framework: FastAPI (Python)
- Document Processing: python-docx, PyMuPDF
- AI Services: DashScope, LangChain, OpenAI integration
- Web Interface: HTML/CSS/JavaScript frontend
- API Architecture: RESTful API design with CORS support
- Python 3.8 or higher
- pip package manager
- Internet connection (required for AI-enhanced features; basic citation matching works offline)
-
Clone the repository:
git clone https://github.com/TashanGKD/TaShan-PaperChecker.git cd TaShan-PaperChecker -
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables (optional but recommended for AI features): Create a
.envfile in the project root with your API keys:DASHSCOPE_API_KEY=your_dashscope_api_key OPENAI_API_KEY=your_openai_api_key
Note: The system can work without AI API keys, but some advanced features like AI-powered citation extraction and relevance checking will be limited. Basic citation matching for author-year format will still function.
-
Configure the application (optional): You can modify the default settings in
config/config.pyor create a.envfile with the following options:SERVER_HOST=0.0.0.0 SERVER_PORT=8002 SERVER_RELOAD=true TEMP_DIR=temp_uploads MAX_UPLOAD_SIZE=10485760 # 10MB in bytes API_PREFIX=/api
-
The application will automatically create required directories on startup. These directories (temp_uploads, reports_md, logs, pdf_cache) are included in .gitignore and will not be tracked by Git.
The application can be configured through the config/config.py file:
server_host: Host address for the API server (default: "0.0.0.0")server_port: Port number for the API server (default: 8002)max_upload_size: Maximum file upload size in bytes (default: 10MB)temp_dir: Directory for temporary file storage (default: "temp_uploads")
python run_server.pyThe API server will start on http://localhost:8002 by default.
For production deployment, use uvicorn with multiple workers:
uvicorn app.main:app --host 0.0.0.0 --port 8002 --workers 4GET /- Root endpoint showing API informationGET /api/health- Check service health status
POST /api/upload-only- Upload a document file without processingGET /api/list-all-files- List all uploaded files in the temp_uploads directoryDELETE /api/file?file_path={path}- Delete a specific file by path
POST /api/full-report- Generate complete citation compliance report by uploading a filePOST /api/full-report-from-path- Generate report using file path with optional author format parameterPOST /api/extract-citations- Extract citations from document (form data input)POST /api/extract-citations-json- Extract citations from document (JSON input)POST /api/relevance-check- Perform citation relevance check with target content
/frontend- Access the web-based user interface for uploading documents and viewing analysis results
curl -X POST "http://localhost:8002/api/full-report" \
-H "Content-Type: multipart/form-data" \
-F "file=@path/to/your/document.docx"curl -X POST "http://localhost:8002/api/upload-only" \
-H "Content-Type: multipart/form-data" \
-F "file=@path/to/your/document.docx"curl -X GET "http://localhost:8002/api/list-all-files"curl -X POST "http://localhost:8002/api/extract-citations" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "file_path=temp_uploads/document.docx"curl -X POST "http://localhost:8002/api/relevance-check" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "file_path=temp_uploads/document.docx" \
-d "target_content=Machine learning techniques in NLP" \
-d "task_type=ๆ็ซ ๆดไฝ" \
-d "use_full_content=false"curl -X POST "http://localhost:8002/api/full-report-from-path" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "file_path=temp_uploads/document.docx" \
-d "author_format=full"import requests
# Upload and analyze a document
with open('document.docx', 'rb') as f:
response = requests.post(
'http://localhost:8002/api/full-report',
files={'file': f}
)
result = response.json()
print(result)
# Upload a file without processing
with open('document.docx', 'rb') as f:
response = requests.post(
'http://localhost:8002/api/upload-only',
files={'file': f}
)
upload_result = response.json()
print(upload_result)
# List all uploaded files
response = requests.get('http://localhost:8002/api/list-all-files')
files_list = response.json()
print(files_list)
# Extract citations from a file
response = requests.post(
'http://localhost:8002/api/extract-citations',
data={'file_path': 'temp_uploads/document.docx'}
)
citations = response.json()
print(citations)// Upload and analyze a document
const formData = new FormData();
const fileInput = document.querySelector('#file-input');
formData.append('file', fileInput.files[0]);
fetch('http://localhost:8002/api/full-report', {
method: 'POST',
body: formData
})
.then(response => response.json())
.then(data => console.log(data));
// List all uploaded files
fetch('http://localhost:8002/api/list-all-files')
.then(response => response.json())
.then(data => console.log(data.files));PaperChecker follows a modular architecture with clear separation of concerns:
- Extractor Layer: Handles document parsing and content extraction for various formats (Word, PDF)
- Checker Layer: Performs citation analysis, validation, and compliance checking
- Processor Layer: Orchestrates the end-to-end analysis workflow
- AI Services: Integrates with LLM providers for intelligent document analysis
- Report Generator: Creates comprehensive compliance reports
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ User Client โโโโโถโ FastAPI Server โโโโโถโ AI Services โ
โ โ โ โ โ (DashScope, โ
โ (Browser/App) โ โ โข API Routes โ โ OpenAI, etc.) โ
โโโโโโโโโโโโโโโโโโโ โ โข Request/Resp โ โโโโโโโโโโโโโโโโโโโ
โ โข Validation โ
โโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโ
โ Core Modules โ
โ โข Extractor โ
โ โข Checker โ
โ โข Processor โ
โ โข Reports โ
โโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโ
โ Utilities โ
โ โข File Handler โ
โ โข Format Utils โ
โ โข Cache Manager โ
โโโโโโโโโโโโโโโโโโโโ
PaperChecker/
โโโ api/ # API route definitions
โโโ app/ # Main application entry point
โ โโโ main.py # FastAPI application
โโโ config/ # Configuration files
โ โโโ config.py # Settings and configuration
โโโ core/ # Core processing modules
โ โโโ ai/ # AI-related utilities
โ โโโ ai_services/ # AI service integrations
โ โโโ checker/ # Citation checking logic
โ โโโ extractor/ # Document extraction logic
โ โโโ polish/ # Text polishing and enhancement
โ โโโ processors/ # Document processing logic
โ โโโ reports/ # Report generation logic
โโโ front/ # Frontend web interface
โโโ models/ # Data models and schemas
โโโ temp_uploads/ # Temporary file storage
โโโ pdf_cache/ # Cached PDF processing results
โโโ reports_md/ # Generated report files
โโโ pids/ # Process ID files
โโโ logs/ # Application logs
โโโ tests/ # Test suite
โโโ utils/ # Utility functions
โโโ run_server.py # Server startup script
โโโ requirements.txt # Python dependencies
โโโ AI_CODING_GUIDELINES.md # Development guidelines
โโโ DEPLOYMENT_README.md # Deployment instructions
โโโ design.md # System design documentation
โโโ README.md # This file
- FastAPI: Modern, fast web framework with async support
- Pydantic: Data validation and settings management
- python-docx: Word document processing
- PyMuPDF: PDF processing capabilities
- LangChain: Framework for developing applications with LLMs
- Tenacity: Retry mechanism for robust operations
- Semantic Scholar API: Academic paper metadata retrieval
- Crossref API: Reference validation and enrichment
Run the test suite:
pytest tests/We welcome contributions to PaperChecker! Here's how you can contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Run tests to ensure everything works (
pytest tests/) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please read our AI Coding Guidelines for best practices on development:
- Each new feature must include corresponding tests
- Follow the "small steps, quick iterations" development approach
- Reduce coupling between modules and increase reusability
- Prioritize using existing code over creating duplicate functionality
- Maintain clear documentation for all public interfaces
- Follow PEP 8 style guide for Python code
- Write clear, descriptive commit messages
- Include docstrings for all public functions and classes
- Add type hints where appropriate
When reporting issues, please include:
- Clear description of the problem
- Steps to reproduce the issue
- Expected vs actual behavior
- Environment details (OS, Python version, etc.)
This project is licensed under the MIT License.
If you encounter any issues or bugs, please open an issue on GitHub with:
- A clear description of the problem
- Steps to reproduce the issue
- Expected vs actual behavior
- Your environment details (OS, Python version, etc.)
For support, you can:
- Open an issue on GitHub
- Check the documentation in this README
- Look at the test examples in the
tests/examples/directory
This project is developed and maintained by the Agent4S Project Team of the TaShan Interdisciplinary Innovation Association (ไปๅฑฑๅญฆ็งไบคๅๅๆฐๅไผ) at the University of Chinese Academy of Sciences (ไธญๅฝ็งๅญฆ้ขๅคงๅญฆ).
- Association: TaShan Interdisciplinary Innovation Association
- Website: tashan.ac.cn
- Project: Agent4S - AI-powered Academic Tools
- Research Paper: Agent4S: The Transformation of Research Paradigms from the Perspective of Large Language Models (arXiv:2506.23692)
- Built with FastAPI for high-performance API development
- Uses advanced AI models for intelligent document analysis
- Inspired by the need for better academic writing tools
If this project helps you or your organization, consider supporting it:
- Star this repository
- Share it with others who might benefit
- Contribute code, documentation, or ideas
- Sponsor the maintainers through GitHub Sponsors or other channels
For questions, suggestions, or support, feel free to:
- Open an issue on GitHub
- Email us at: tashanxkjc@163.com
- Visit our website: tashan.ac.cn
WeChat Official Account (ๅพฎไฟกๅ ฌไผๅท)
Scan the QR code above to follow our WeChat Official Account for updates and news.
Douyin (ๆ้ณ)
Search "ไปๅฑฑๅญฆ็งไบคๅๅๆฐๅไผ" on Douyin to find our Agent4S course videos and tutorials.
Learn More About Agent4S
Read our comprehensive survey paper: Agent4S: The Transformation of Research Paradigms from the Perspective of Large Language Models
