BioChemInsight is a powerful platform that automates the extraction of chemical structures and their corresponding bioactivity data from scientific literature. By leveraging deep learning for image recognition and OCR, it streamlines the creation of high-quality, structured datasets for cheminformatics, machine learning, and drug discovery research.
- Automated Data Extraction π: Automatically identifies and extracts compound structures and biological activity data (e.g., IC50, EC50, Ki) from PDF documents.
- Advanced Recognition Core π§ : Utilizes state-of-the-art DECIMER Segmentation models for image analysis and PaddleOCR for robust text recognition.
- dots_ocr as OCR Engine π: For significantly improved OCR performance, you can use
dots_ocras the OCR engine. Please refer toDOCKER_DOTS_OCR/README.mdfor setup and configuration. Note: Runningdots_ocron an RTX 5090 GPU requires approximately 30GB of VRAM. - Recommended Visual Model: For the visual model, it is recommended to use GLM-V4.5 for optimal results.
- Multiple SMILES Engines βοΈ: Offers seamless switching between MolScribe, MolVec, and MolNexTR to convert chemical diagrams into SMILES strings.
- Flexible Page Selection π: Process specific, non-continuous pages (e.g., "1-3, 5, 7-9, 12"), saving time and computational resources.
- Structured Data Output π οΈ: Converts unstructured text and images into analysis-ready formats like CSV and Excel.
- Modern Web UI π: A React-based frontend with FastAPI backend for intuitive PDF processing, real-time progress tracking, and interactive result visualization.
- Intelligent Data Merging π: Automatically merges structure and bioactivity data based on compound IDs, providing seamless integrated results.
- AI/ML Model Training: Generate high-quality datasets for training predictive models in cheminformatics and bioinformatics.
- Drug Discovery: Accelerate structure-activity relationship (SAR) studies and lead compound optimization.
- Automated Literature Mining: Drastically reduce the manual effort and time required for curating data from scientific articles.
BioChemInsight employs a multi-stage pipeline to convert raw PDFs into structured data:
- PDF Preprocessing: The input PDF is split into individual pages, which are then converted into high-resolution images for analysis.
- Structure Detection: DECIMER Segmentation scans the images to locate and isolate chemical structure diagrams.
- SMILES Conversion: The selected recognition engine (MolScribe, MolVec, or MolNexTR) converts the isolated diagrams into machine-readable SMILES strings.
- Identifier Recognition: A visual model (recommended: GLM-V4.5) recognizes the compound identifiers (e.g., "Compound 1", "2a") associated with each structure.
- Bioactivity Extraction: PaddleOCR (or
dots_ocrif configured) extracts text from specified assay pages, and large language models help parse and standardize the bioactivity results. - Data Integration: All extracted informationβcompound IDs, SMILES strings, and bioactivity dataβis merged into structured files (CSV/Excel) for download and downstream analysis.
git clone https://github.com/dahuilangda/BioChemInsight
cd BioChemInsightThe project requires a constants.py file for environment variables and paths. A template is provided.
# Rename the example file
mv constants_example.py constants.pyThen, edit constants.py to set your API keys, model paths, and other necessary configurations.
conda install -c conda-forge mamba
mamba create -n chem_ocr python=3.10
conda activate chem_ocrFirst, install PyTorch with CUDA support.
# Install CUDA Tools and PyTorch
mamba install -c nvidia -c conda-forge cudatoolkit=11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -i https://pypi.tuna.tsinghua.edu.cn/simpleNext, install the remaining Python packages.
# Install core libraries (using a mirror for faster downloads)
pip install decimer-segmentation molscribe -i https://pypi.tuna.tsinghua.edu.cn/simple
mamba install -c conda-forge jupyter pytesseract transformers
pip install PyMuPDF PyPDF2 openai -i https://pypi.tuna.tsinghua.edu.cn/simple
# Install web service dependencies
pip install fastapi uvicorn -i https://pypi.tuna.tsinghua.edu.cn/simple
# Install Node.js and npm (for frontend)
# On Ubuntu/Debian:
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs
# On macOS (using homebrew):
# brew install nodeBioChemInsight can be operated via an interactive web interface or directly from the command line.
Important: Start at least one OCR microservice before launching the pipeline. Choose between:
DOCKER_PADDLE_OCR(recommended) for the PaddleOCR service and setPADDLEOCR_SERVER_URLinconstants.py.DOCKER_DOTS_OCRfor the DotsOCR service and setDOTSOCR_SERVER_URLinconstants.py. You can run both and switch by updatingDEFAULT_OCR_ENGINE.
The modern React-based web interface provides an intuitive platform for processing documents with real-time progress tracking.
Step 1: Start the Backend API Server
From the project root directory, run:
uvicorn frontend.backend.main:app --host 0.0.0.0 --port 8000 --reloadStep 2: Start the Frontend Development Server
In a new terminal, run:
cd frontend/ui
npm install
NODE_OPTIONS="--max-old-space-size=8196" npm run devStep 3: Access the Interface
Open http://localhost:5173 in your web browser to access the interface. The backend API will be available at http://localhost:8000.
- PDF Upload: Upload and manage PDF files through the intuitive interface.
- Visual Page Selection: Click on page thumbnails to select pages for structure and assay extraction.
- Step-by-Step Processing:
- Step 1: Upload PDF and preview pages
- Step 2: Extract chemical structures with real-time progress
- Step 3: Extract bioactivity data with structure-constrained compound matching
- Step 4: Review and download merged results
- Real-time Progress Tracking: Monitor extraction progress with detailed status updates.
- Interactive Results: View, edit, and download structured data with integrated compound-activity matching.
- Automatic Data Merging: Seamlessly combines structure and bioactivity data based on compound IDs.
For batch processing and automation, the CLI is recommended.
The new syntax supports flexible, non-continuous page selections.
python pipeline.py data/sample.pdf \
--structure-pages "242-267" \
--assay-pages "270-272" \
--assay-names "FRET EC50" \
--engine molnextr \
--output outputFlexible Page Selection Examples:
- Extract structures from non-continuous pages:
python pipeline.py data/sample.pdf --structure-pages "242-250,255,260-267" --engine molnextr --output output - Extract multiple assays from scattered pages:
python pipeline.py data/sample.pdf --structure-pages "242-267" --assay-pages "30,35,270-272" --assay-names "IC50,FRET EC50" --engine molnextr --output output
For backward compatibility, the original start/end page syntax remains available.
python pipeline.py data/sample.pdf \
--structure-start-page 242 \
--structure-end-page 267 \
--assay-start-page 270 \
--assay-end-page 272 \
--assay-names "FRET EC50" \
--engine molnextr \
--output outputThe platform generates the following structured data files in the specified output directory:
structures.csv: Contains the detected compound identifiers and their corresponding SMILES representations.assay_data.json: Stores the raw extracted bioactivity data for each specified assay.merged.csv: A combined file that merges chemical structures with their associated bioactivity data.
Deploy BioChemInsight in a containerized environment for consistency and portability.
docker build -t biocheminsight .Option A: Launch the Web App (Default)
Run this command to start both the React frontend and FastAPI backend services.
docker run --rm -d --gpus all \
-p 3000:3000 -p 8000:8000 \
-e http_proxy="" \
-e https_proxy="" \
-v $(pwd)/data:/app/data \
-v $(pwd)/output:/app/output \
-v $(pwd)/constants.py:/app/constants.py \
biocheminsightAfter launching, access the UI by visiting:
- Frontend:
http://localhost:3000 - Backend API:
http://localhost:8000
Option B: Run the Command-Line Pipeline
If you need to execute a batch job using the CLI, override the default entrypoint by specifying python pipeline.py and its arguments after the docker run command.
docker run --rm --gpus all \
-e http_proxy="" \
-e https_proxy="" \
-v $(pwd)/data:/app/data \
-v $(pwd)/output:/app/output \
--entrypoint python \
biocheminsight \
pipeline.py data/sample.pdf \
--structure-pages "242-267" \
--assay-pages "270-272" \
--assay-names "FRET EC50" \
--engine molnextr \
--output outputOption C: Enter the Container for an Interactive Session
To debug or run commands manually inside the container:
docker run --gpus all -it --rm \
--entrypoint /bin/bash \
-e http_proxy="" \
-e https_proxy="" \
-v $(pwd)/data:/app/data \
-v $(pwd)/output:/app/output \
--name biocheminsight_container \
biocheminsight