mutomo aims to build an open dataset for multidimensional key information extraction from scientific abstracts, and to fine-tune a small, local language model for structured, abstractive extraction tasks.
- Create a high-quality, annotated dataset covering research motivations, objectives, methods, impacts, and topics.
- Design an annotation workflow and task definition using Argilla.
- Fine-tune and evaluate a small, open-source LLM for efficient, on-premise extraction.
To use conda for full reproducibility:
conda env create -f environment.yml --name mutomo
conda activate mutomoTo update/export your environment:
conda env export --name huggingface-gpu > environment.yml- Define task and annotation schema
- Add example annotations in Argilla
- Annotator onboarding & training
- Build annotated dataset
- Fine-tune and test small LLM model
- Share dataset and baseline models
Feel free to submit issues or pull requests!
This project is licensed under the Apache License 2.0.
If you find this work useful, please cite: SOON!