A complete pipeline for extracting archaeological site information from academic papers and generating multi-channel remote sensing datasets for machine learning-based site detection.
This repository combines two complementary workflows designed to support reinforcement learning and deep learning approaches to archaeological site discovery:
- Site Extraction Pipeline - Extract archaeological site coordinates from academic PDFs using LLMs
- Dataset Generation Pipeline - Generate multi-channel satellite imagery datasets from known site locations
Academic Papers (PDFs)
↓
[Step 1: LLM Extraction]
↓
Site Coordinates (JSON/CSV)
↓
[Step 2: Dataset Generation]
↓
Multi-Channel Remote Sensing Dataset
↓
[Your RL/ML Model]
Project Origins: This work was developed as part of our participation in the OpenAI to Z Challenge on Kaggle, where we explored AI-powered archaeological discovery in the Amazon. Our competition writeup detailing the approach is available here. The pipeline has since evolved into a comprehensive framework for digitizing archaeological knowledge and preparing it for machine learning applications. Our approach was presented at CAA UK 2025 (Computer Applications and Quantitative Methods in Archaeology) held at the University of Cambridge, December 9-10, 2025.
Conference Materials:
A pre-generated dataset created using this pipeline is available on Hugging Face:
🤗 Archaeological Sites Dataset (CAA UK 2025)
The dataset provides multi-channel remote sensing data (Sentinel-2 + FABDEM + spectral indices) with balanced positive/negative samples and augmentations for training archaeological site detection models.
Below are example visualizations from each category in the dataset. Each sample is a 100×100 pixel multi-channel image with 11 bands (RGB composite shown for visualization).
Categories:
- Positives: Known archaeological sites (geoglyphs, mounds, settlements)
- Integrated Negatives: Areas surrounding positive sites, spatially close but archaeologically empty
- Landcover Negatives: Diverse landscapes (urban, water, cropland) to improve model robustness
- Unlabeled: Background samples for semi-supervised learning approaches
Important: Each step works as a standalone tool - you can use either one independently or combine them for the full workflow.
- Digitize archaeological site data from legacy publications
- Extract coordinates and metadata from PDFs
- Download satellite imagery for specific sites
- Export site databases for GIS or other applications
- No need to install Step 2 dependencies
- Generate training datasets from your own fieldwork coordinates
- Process site data from existing databases or catalogs
- Create balanced ML datasets from any site coordinate source
- No need to install Step 1 dependencies or OpenAI API
- Complete end-to-end pipeline from literature to ML-ready datasets
- Seamless data handoff between extraction and generation
- Ideal for comprehensive archaeological ML projects
CAA_UK_2025/
│
├── 1_site_extraction/ # Step 1: Extract sites from academic papers
│ ├── app.py # Flask web server with LLM + GEE integration
│ ├── templates/
│ ├── static/
│ ├── .gitignore # Step 1 specific ignores
│ ├── requirements.txt # Step 1 dependencies
│ ├── CAA_UK_2025_Site_Extraction_Demo.ipynb # Demo Notebook
│ └── README.md # Detailed documentation for Step 1
│
├── 2_dataset_generation/ # Step 2: Generate training datasets
│ ├── config/
│ ├── scripts/
│ ├── src/
│ ├── .gitignore # Step 2 specific ignores
│ ├── requirements.txt # Step 2 dependencies
│ ├── README.md # Detailed documentation for Step 2
│ └── run_pipeline.py
│
├── README.md # This file
└── LICENSE # MIT License
Note: Each step maintains its own .gitignore and requirements.txt for independence and modularity.
- 📄 PDF text extraction from archaeological publications
- 🤖 LLM-powered analysis (GPT-4o) to identify sites and coordinates
- 🗺️ Multiple coordinate format support (DMS, decimal degrees, etc.)
- 🛰️ Satellite data download for extracted sites (Sentinel-2 + terrain)
- 🌐 Web interface for easy interaction
- 🎯 Balanced dataset creation with positives, negatives, and unlabeled samples
- 🔄 Geometric augmentation via rotation (configurable: 3x, 4x, 6x, 12x)
- 🌈 Radiometric augmentation for lighting/contrast variation
- 📊 11-channel data (6 spectral bands + 3 indices + elevation + slope)
- 📦 Production-ready format with metadata and validation tools
For Step 1 Only:
- Python 3.7+
- OpenAI API key (get one here)
- Google Earth Engine account (sign up here)
- Modern web browser
For Step 2 Only:
- Python 3.7+
- Google Earth Engine account (sign up here)
For Both Steps:
- All of the above
Python Dependencies:
# Install for Step 1 only
cd 1_site_extraction
pip install -r requirements.txt
# Install for Step 2 only
cd 2_dataset_generation
pip install -r requirements.txt
# Or install for both if using full pipelinecd 1_site_extraction
# Set up environment
cp .env.example .env
# Edit .env with your OpenAI API key and GEE project ID
# Place GEE service account JSON
cp ~/Downloads/your-gee-key.json ./gee_service_account.json
# Start web interface
python app.py
# Open http://localhost:5000What you'll do:
- Upload archaeological paper PDFs
- Let GPT-4o extract site information
- Review extracted sites with coordinates
- Download satellite data for each site (optional)
- Export site list as JSON
See detailed instructions: 1_site_extraction/README.md
Step 1 outputs JSON format, but Step 2 requires CSV input. Create known_sites.csv:
site_id,latitude,longitude,site_type
site_001,-9.8765,-67.5346,geoglyph
site_002,-10.1234,-68.4567,mound
site_003,-11.2345,-69.5678,settlementExtract from your Step 1 JSON output:
site_id: Unique identifierlatitude: Decimal degrees (negative for S)longitude: Decimal degrees (negative for W)site_type: Optional classification
cd 2_dataset_generation
# Prepare input
mkdir -p inputs
cp /path/to/known_sites.csv inputs/
# Configure pipeline
cp config/settings.yaml.example config/settings.yaml
# Edit settings.yaml with your GEE project ID and parameters
# Authenticate GEE
earthengine authenticate
# Run full pipeline
python run_pipeline.pyWhat this generates:
- Multi-angle views of each site (rotation augmentation)
- Integrated negatives from surrounding landscape
- Diverse landcover negatives (urban, water, cropland)
- Unlabeled background samples
- Radiometric augmentations (brightness/contrast/noise)
Output: outputs/dataset/ with:
grid_metadata.parquet- Master metadata tablegrid_images/- Individual 100×100×11 samples as NumPy arrays- Ready for PyTorch/TensorFlow training
See detailed instructions: 2_dataset_generation/README.md
If you use this dataset or pipeline in your research, please cite either conference or software:
@inproceedings{li2025fusing,
title={{Fusing Text and Terrain}: {An LLM}-Powered Pipeline for Preparing Archaeological Datasets from Literature and Remote Sensing Imagery},
author={Li, Linduo and Wu, Yifan and Wang, Zifeng},
booktitle={{CAA UK 2025}: Computer Applications and Quantitative Methods in Archaeology},
year={2025},
month={December},
address={University of Cambridge, UK},
organization={CAA UK},
note={Conference held 9--10 December 2025}
}@software{archaeological_site_detection,
title={Archaeological Site Detection: From Papers to Datasets},
author={Li, Linduo and Wu, Yifan and Wang, Zifeng},
year={2025},
url={https://github.com/BostonListener/CAA_UK_2025}
}This work builds upon data and resources from multiple sources:
Satellite and Terrain Data:
- Sentinel-2 satellite imagery provided by the European Space Agency (ESA) through the Copernicus Programme
- FABDEM elevation data provided by the University of Bristol
- Google Earth Engine served as the primary platform for geospatial data processing and analysis
Archaeological Data Sources:
We are deeply grateful to James Q. Jacobs for his invaluable contribution to archaeological research through his publicly accessible compilation of geoglyph locations. His meticulous curation of archaeological data has been instrumental in enabling this work. The archaeological site locations were sourced from his compilation, which synthesizes data from:
- Jacobs, J.Q. (2025). JQ Jacobs Archaeology. Last modified July 31, 2025. https://jqjacobs.net/archaeology/geoglyph.html
- Kalliola (2024)
- Tokarský (2025)
- Peripato et al.
- Erickson et al. (2008)
- A. Olmeda (Google Earth observations, 2022)
- R. Walker (Google Earth observations, 2022)
- Pärssinen et al. (LiDAR data)
- Neves (LiDAR data)
- Prümers (LiDAR data)
- IPHAN (Instituto do Patrimônio Histórico e Artístico Nacional, Brazil)
- Global Forest Watch (globalforestwatch.org)
Technical Infrastructure:
- OpenAI GPT-4o for LLM-powered text extraction and analysis
- The Kaggle OpenAI to Z Challenge for providing the initial impetus and platform for this research
We acknowledge that this pipeline stands on the shoulders of both cutting-edge technology and dedicated scholarly work in the archaeological community.
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: See README files in each subfolder for detailed guides
- Issues: Report bugs or request features via GitHub Issues
- Discussions: Share your results and ask questions in GitHub Discussions
- Email: linduo.li@ip-paris.fr




































