Skip to content

Boyuan-Zheng/TaShan-PaperChecker-main

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

PaperChecker - Citation Compliance Checker

A sophisticated academic paper citation compliance checking system that automatically analyzes citations and references in academic documents, identifying mismatches, missing citations, and format inconsistencies to improve paper quality and academic standards.

Developed by: Agent4S Project Team, TaShan Interdisciplinary Innovation Association, University of Chinese Academy of Sciences
Website: tashan.ac.cn

๐Ÿš€ Features

Document Processing

  • Supported Formats: Word documents (.docx, .doc) and PDF files
  • File Size Limit: Up to 10MB per document
  • Smart Parsing: Automatic identification of document structure, extracting main content and reference sections

Citation Recognition

The system recognizes citations in academic papers and matches them with reference lists. Here's what formats are currently supported:

Citation Format Support Level Examples Notes
Author-Year (Chinese) โœ… Full Support ๅผ ไธ‰๏ผˆ2024๏ผ‰
ๆŽๅ›› ็ญ‰๏ผˆ2020๏ผ‰
Complete citation-reference matching and validation
Author-Year (English) โœ… Full Support Smith (2020)
Smith & Jones (2019)
Smith et al. (2018)
Complete citation-reference matching and validation
GB/T 7714-2015 ่‘—่€…-ๅ‡บ็‰ˆๅนดๅˆถ โœ… Full Support Same as author-year formats above This is the primary format this tool is designed for
Numeric Sequential โš ๏ธ Partial Support [1], [2], [15]
[1-3] (range)
Can extract and identify, but does not perform citation-reference matching validation
GB/T 7714-2015 ้กบๅบ็ผ–็ ๅˆถ โš ๏ธ Partial Support Same as numeric sequential Can extract and identify only
IEEE (numeric) โš ๏ธ Partial Support [1], [2] (bracket style only) Can extract bracket-style numbers; superscript numbers (e.g., textยน) are not supported
APA โš ๏ธ Partial Support Basic author-year only Only supports basic author-year format; page numbers and advanced features not supported
MLA โŒ Not Supported - Planned but not implemented
Chicago โŒ Not Supported - Planned but not implemented

Best Results: This tool works best with papers using author-year citation format (GB/T 7714-2015 ่‘—่€…-ๅ‡บ็‰ˆๅนดๅˆถ or similar styles). For papers using numeric citation systems, the tool can identify citations but cannot perform comprehensive matching analysis.

Intelligent Matching (for Author-Year Format)

  • Bidirectional Mapping: Precise matching between in-text citations and reference list
  • Context Analysis: Understanding of citation usage in document context
  • Tolerance for Variations: Correct matching even with slight formatting differences
  • Note: Full matching analysis is available for author-year format citations only

Automated Verification & Correction

  • Year Validation: Detection of citation year inconsistencies with reference years
  • Format Standardization: Consistent citation formatting across documents
  • Quality Assurance: Identification of uncited references and unreferenced citations

Comprehensive Reporting

  • Match Statistics: Citation count statistics and match success rates
  • Correction Suggestions: Year inconsistency corrections and format standardization recommendations
  • Formatted Citations: Standardized citations according to academic standards

AI-Powered Optimization

  • Intelligent Formatting: AI model-optimized citation formats
  • Error Tolerance: Handling of non-standard formats with automatic correction
  • Context Understanding: Analysis of citation correctness in context

๐Ÿ› ๏ธ Technical Stack

  • Framework: FastAPI (Python)
  • Document Processing: python-docx, PyMuPDF
  • AI Services: DashScope, LangChain, OpenAI integration
  • Web Interface: HTML/CSS/JavaScript frontend
  • API Architecture: RESTful API design with CORS support

๐Ÿ“‹ Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Internet connection (required for AI-enhanced features; basic citation matching works offline)

๐Ÿš€ Installation

  1. Clone the repository:

    git clone https://github.com/TashanGKD/TaShan-PaperChecker.git
    cd TaShan-PaperChecker
  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Set up environment variables (optional but recommended for AI features): Create a .env file in the project root with your API keys:

    DASHSCOPE_API_KEY=your_dashscope_api_key
    OPENAI_API_KEY=your_openai_api_key

    Note: The system can work without AI API keys, but some advanced features like AI-powered citation extraction and relevance checking will be limited. Basic citation matching for author-year format will still function.

  5. Configure the application (optional): You can modify the default settings in config/config.py or create a .env file with the following options:

    SERVER_HOST=0.0.0.0
    SERVER_PORT=8002
    SERVER_RELOAD=true
    TEMP_DIR=temp_uploads
    MAX_UPLOAD_SIZE=10485760  # 10MB in bytes
    API_PREFIX=/api
  6. The application will automatically create required directories on startup. These directories (temp_uploads, reports_md, logs, pdf_cache) are included in .gitignore and will not be tracked by Git.

โš™๏ธ Configuration

The application can be configured through the config/config.py file:

  • server_host: Host address for the API server (default: "0.0.0.0")
  • server_port: Port number for the API server (default: 8002)
  • max_upload_size: Maximum file upload size in bytes (default: 10MB)
  • temp_dir: Directory for temporary file storage (default: "temp_uploads")

๐Ÿƒโ€โ™‚๏ธ Running the Application

Development Mode

python run_server.py

The API server will start on http://localhost:8002 by default.

Production Mode

For production deployment, use uvicorn with multiple workers:

uvicorn app.main:app --host 0.0.0.0 --port 8002 --workers 4

๐ŸŒ API Endpoints

Health Check

  • GET / - Root endpoint showing API information
  • GET /api/health - Check service health status

File Operations

  • POST /api/upload-only - Upload a document file without processing
  • GET /api/list-all-files - List all uploaded files in the temp_uploads directory
  • DELETE /api/file?file_path={path} - Delete a specific file by path

Citation Analysis

  • POST /api/full-report - Generate complete citation compliance report by uploading a file
  • POST /api/full-report-from-path - Generate report using file path with optional author format parameter
  • POST /api/extract-citations - Extract citations from document (form data input)
  • POST /api/extract-citations-json - Extract citations from document (JSON input)
  • POST /api/relevance-check - Perform citation relevance check with target content

Frontend Access

  • /frontend - Access the web-based user interface for uploading documents and viewing analysis results

๐Ÿ’ก Usage Examples

Using cURL

Upload and analyze a document

curl -X POST "http://localhost:8002/api/full-report" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@path/to/your/document.docx"

Upload a file without processing

curl -X POST "http://localhost:8002/api/upload-only" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@path/to/your/document.docx"

List uploaded files

curl -X GET "http://localhost:8002/api/list-all-files"

Extract citations from a file

curl -X POST "http://localhost:8002/api/extract-citations" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "file_path=temp_uploads/document.docx"

Perform relevance check

curl -X POST "http://localhost:8002/api/relevance-check" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "file_path=temp_uploads/document.docx" \
  -d "target_content=Machine learning techniques in NLP" \
  -d "task_type=ๆ–‡็ซ ๆ•ดไฝ“" \
  -d "use_full_content=false"

Generate report from file path

curl -X POST "http://localhost:8002/api/full-report-from-path" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "file_path=temp_uploads/document.docx" \
  -d "author_format=full"

Python Client Example

import requests

# Upload and analyze a document
with open('document.docx', 'rb') as f:
    response = requests.post(
        'http://localhost:8002/api/full-report',
        files={'file': f}
    )

result = response.json()
print(result)

# Upload a file without processing
with open('document.docx', 'rb') as f:
    response = requests.post(
        'http://localhost:8002/api/upload-only',
        files={'file': f}
    )

upload_result = response.json()
print(upload_result)

# List all uploaded files
response = requests.get('http://localhost:8002/api/list-all-files')
files_list = response.json()
print(files_list)

# Extract citations from a file
response = requests.post(
    'http://localhost:8002/api/extract-citations',
    data={'file_path': 'temp_uploads/document.docx'}
)
citations = response.json()
print(citations)

JavaScript/Fetch Example

// Upload and analyze a document
const formData = new FormData();
const fileInput = document.querySelector('#file-input');
formData.append('file', fileInput.files[0]);

fetch('http://localhost:8002/api/full-report', {
  method: 'POST',
  body: formData
})
.then(response => response.json())
.then(data => console.log(data));

// List all uploaded files
fetch('http://localhost:8002/api/list-all-files')
.then(response => response.json())
.then(data => console.log(data.files));

๐Ÿ—๏ธ Technical Architecture

PaperChecker follows a modular architecture with clear separation of concerns:

Core Components

  • Extractor Layer: Handles document parsing and content extraction for various formats (Word, PDF)
  • Checker Layer: Performs citation analysis, validation, and compliance checking
  • Processor Layer: Orchestrates the end-to-end analysis workflow
  • AI Services: Integrates with LLM providers for intelligent document analysis
  • Report Generator: Creates comprehensive compliance reports

System Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   User Client   โ”‚โ”€โ”€โ”€โ–ถโ”‚  FastAPI Server  โ”‚โ”€โ”€โ”€โ–ถโ”‚  AI Services    โ”‚
โ”‚                 โ”‚    โ”‚                  โ”‚    โ”‚ (DashScope,     โ”‚
โ”‚ (Browser/App)   โ”‚    โ”‚ โ€ข API Routes     โ”‚    โ”‚  OpenAI, etc.)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚ โ€ข Request/Resp   โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚ โ€ข Validation     โ”‚
                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚
                       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                       โ”‚  Core Modules    โ”‚
                       โ”‚ โ€ข Extractor      โ”‚
                       โ”‚ โ€ข Checker        โ”‚
                       โ”‚ โ€ข Processor      โ”‚
                       โ”‚ โ€ข Reports        โ”‚
                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚
                       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                       โ”‚  Utilities       โ”‚
                       โ”‚ โ€ข File Handler   โ”‚
                       โ”‚ โ€ข Format Utils   โ”‚
                       โ”‚ โ€ข Cache Manager  โ”‚
                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Project Structure

PaperChecker/
โ”œโ”€โ”€ api/                    # API route definitions
โ”œโ”€โ”€ app/                    # Main application entry point
โ”‚   โ””โ”€โ”€ main.py             # FastAPI application
โ”œโ”€โ”€ config/                 # Configuration files
โ”‚   โ””โ”€โ”€ config.py           # Settings and configuration
โ”œโ”€โ”€ core/                   # Core processing modules
โ”‚   โ”œโ”€โ”€ ai/                 # AI-related utilities
โ”‚   โ”œโ”€โ”€ ai_services/        # AI service integrations
โ”‚   โ”œโ”€โ”€ checker/            # Citation checking logic
โ”‚   โ”œโ”€โ”€ extractor/          # Document extraction logic
โ”‚   โ”œโ”€โ”€ polish/             # Text polishing and enhancement
โ”‚   โ”œโ”€โ”€ processors/         # Document processing logic
โ”‚   โ””โ”€โ”€ reports/            # Report generation logic
โ”œโ”€โ”€ front/                  # Frontend web interface
โ”œโ”€โ”€ models/                 # Data models and schemas
โ”œโ”€โ”€ temp_uploads/           # Temporary file storage
โ”œโ”€โ”€ pdf_cache/              # Cached PDF processing results
โ”œโ”€โ”€ reports_md/             # Generated report files
โ”œโ”€โ”€ pids/                   # Process ID files
โ”œโ”€โ”€ logs/                   # Application logs
โ”œโ”€โ”€ tests/                  # Test suite
โ”œโ”€โ”€ utils/                  # Utility functions
โ”œโ”€โ”€ run_server.py           # Server startup script
โ”œโ”€โ”€ requirements.txt        # Python dependencies
โ”œโ”€โ”€ AI_CODING_GUIDELINES.md # Development guidelines
โ”œโ”€โ”€ DEPLOYMENT_README.md    # Deployment instructions
โ”œโ”€โ”€ design.md               # System design documentation
โ””โ”€โ”€ README.md              # This file

Key Technologies Used

  • FastAPI: Modern, fast web framework with async support
  • Pydantic: Data validation and settings management
  • python-docx: Word document processing
  • PyMuPDF: PDF processing capabilities
  • LangChain: Framework for developing applications with LLMs
  • Tenacity: Retry mechanism for robust operations
  • Semantic Scholar API: Academic paper metadata retrieval
  • Crossref API: Reference validation and enrichment

๐Ÿงช Testing

Run the test suite:

pytest tests/

๐Ÿค Contributing

We welcome contributions to PaperChecker! Here's how you can contribute:

Getting Started

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Run tests to ensure everything works (pytest tests/)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

Development Guidelines

Please read our AI Coding Guidelines for best practices on development:

  • Each new feature must include corresponding tests
  • Follow the "small steps, quick iterations" development approach
  • Reduce coupling between modules and increase reusability
  • Prioritize using existing code over creating duplicate functionality
  • Maintain clear documentation for all public interfaces

Code Standards

  • Follow PEP 8 style guide for Python code
  • Write clear, descriptive commit messages
  • Include docstrings for all public functions and classes
  • Add type hints where appropriate

Reporting Issues

When reporting issues, please include:

  • Clear description of the problem
  • Steps to reproduce the issue
  • Expected vs actual behavior
  • Environment details (OS, Python version, etc.)

๐Ÿ“„ License

This project is licensed under the MIT License.

๐Ÿ› Issues and Bug Reports

If you encounter any issues or bugs, please open an issue on GitHub with:

  • A clear description of the problem
  • Steps to reproduce the issue
  • Expected vs actual behavior
  • Your environment details (OS, Python version, etc.)

๐Ÿ†˜ Support

For support, you can:

  • Open an issue on GitHub
  • Check the documentation in this README
  • Look at the test examples in the tests/examples/ directory

๐Ÿ™ Acknowledgments

Development Team

This project is developed and maintained by the Agent4S Project Team of the TaShan Interdisciplinary Innovation Association (ไป–ๅฑฑๅญฆ็ง‘ไบคๅ‰ๅˆ›ๆ–ฐๅไผš) at the University of Chinese Academy of Sciences (ไธญๅ›ฝ็ง‘ๅญฆ้™ขๅคงๅญฆ).

Technical Acknowledgments

  • Built with FastAPI for high-performance API development
  • Uses advanced AI models for intelligent document analysis
  • Inspired by the need for better academic writing tools

๐Ÿค Support the Project

If this project helps you or your organization, consider supporting it:

  • Star this repository
  • Share it with others who might benefit
  • Contribute code, documentation, or ideas
  • Sponsor the maintainers through GitHub Sponsors or other channels

๐Ÿ“ž Contact

For questions, suggestions, or support, feel free to:

Follow Us

WeChat Official Account (ๅพฎไฟกๅ…ฌไผ—ๅท)

WeChat QR Code

Scan the QR code above to follow our WeChat Official Account for updates and news.

Douyin (ๆŠ–้Ÿณ)

Search "ไป–ๅฑฑๅญฆ็ง‘ไบคๅ‰ๅˆ›ๆ–ฐๅไผš" on Douyin to find our Agent4S course videos and tutorials.

Learn More About Agent4S

Read our comprehensive survey paper: Agent4S: The Transformation of Research Paradigms from the Perspective of Large Language Models

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Contributors