Large language models (LLMs) have shown great potential to enhance student learning by serving as AI-powered teaching assistants (TA). However, existing LLM-based TA systems often face critical challenges, including data privacy risks associated with third-party API-based solutions and limited effectiveness in courses with limited teaching materials.
This project proposes an automated TA training system based on LLM agents, designed to train customized, lightweight, and privacy-preserving AI models. Unlike traditional cloud-based AI TAs, our system allows local deployment, reducing data security concerns, and includes three components:
- Dataset Agent: Constructing high-quality datasets with explicit reasoning paths
- Training Agent: Fine-tuning models via Knowledge Distillation, effectively adapting to limited-data courses
- RAG Agent: Enhancing responses by retrieving external knowledge
We validate our system in Synthetic Biology, an interdisciplinary field characterized by scarce structured training data. Experimental results and user evaluations demonstrate that our AI TA achieves strong performance, high user satisfaction, and improved student engagement, highlighting its practical applicability in real-world educational settings.
Synthetic biology is a cutting-edge field that integrates knowledge from biology, chemistry, engineering, and computer science. In recent years, applications ranging from lab-grown meat to CRISPR-Cas9 gene editing technology have been leading the "Third Biotechnology Revolution." However, the dissemination of synthetic biology knowledge faces two major challenges:
- Interdisciplinary complexity: Requires integration of knowledge from multiple domains, creating a steep learning curve
- Educational resource limitations: Shortage of teaching talent with cross-disciplinary knowledge and practical experience
Traditional AI teaching assistant solutions typically rely on cloud service APIs, which introduce data privacy risks and perform poorly when specialized teaching materials are limited. The InternTA project is designed to address these challenges.
InternTA adopts a three-layer agent architecture to achieve automated training, local deployment, and privacy protection:
The Dataset Agent is responsible for constructing high-quality training data with explicit reasoning paths:
- Data Sources: Extracts post-class questions, key terms, and fundamental concepts from the "Synthetic Biology" textbook
- Reasoning Path Construction: Generates explicit reasoning paths for each question
- Guided Teaching Design: For complex thought questions, designs guided responses rather than providing direct answers
The Training Agent fine-tunes lightweight models using knowledge distillation techniques:
- Base Model: Uses DeepSeekR1-Distill-Qwen-7B as the foundation model
- Fine-Tuning Tools: Employs PeftModel for efficient fine-tuning
- Knowledge Distillation: Transfers knowledge from larger parameter-scale models to lightweight models
The RAG (Retrieval-Augmented Generation) Agent enhances answer quality by retrieving external knowledge:
- Knowledge Base Construction: Structured processing of "Synthetic Biology" textbook content
- Semantic Retrieval: Retrieves relevant knowledge points based on user questions
- Enhanced Generation: Combines retrieved knowledge to generate more accurate and in-depth answers
InternTA system design emphasizes data privacy protection and deployment flexibility:
- Local Model Deployment: All models can run on local machines, avoiding data exposure
- API Token Authentication: Provides API access control mechanisms to secure the system
- Lightweight Design: Optimizes model size to run efficiently on ordinary hardware
Online Experience Address: [E. Copi (Education)]
Local Deployment Method (NVIDIA GPU with 8GB or more VRAM):
# Clone the repository
git clone https://github.com/kongfoo-ai/internTA
# Go to the project directory
cd InternTA
# Install the dependencies
pip install -r requirements.txt
# Set API access token (optional)
# Create or edit the .env file in the project root directory, add API_TOKEN=your-secret-token
# Start demo (The default port is 8080. You can change it if necessary)
sh run.sh
# View run logs
tail -f nohup.outThe InternTA API server supports authentication using Bearer tokens. To enable this feature:
-
Set the
API_TOKENenvironment variable in the.envfile in the project root directory:API_TOKEN=your-secret-token -
Include the Authorization header in your requests to the API:
Authorization: Bearer your-secret-token -
If
API_TOKENis not set in the.envfile, authentication will be skipped, and the API will allow all requests. -
You can test the authentication feature using the provided
test_auth.pyscript:python test_auth.py
Install dependencies.
pip install -r requirements.txtGenerate high-quality training dataset.
cd data
python generate_data.pyGo to the project root directory
cd $ROOT_PATH Check if there is a file named personal_assistant.json in the data directory.
ls -lh dataFine-tune the model using data generated by the Dataset Agent and the Xtuner tool.
sh train.shObserve the model weights in the train directory. The naming convention for the directory is pth_$NUM_EPOCH.
ls -lh trainMerge the fine-tuned Adapter into the base model.
# Note: You need to pass the suffix of the directory containing the weights to be merged as a parameter to specify which LORA parameters to merge.
sh merge.sh $NUM_EPOCHTest the final merged model in the final directory.
# Note: Modify the model path as needed
sh chat.shThis section is used to calculate the ROUGE similarity scores for responses generated by the InternTA model and generate evaluation results.
# Ensure your SynBio-Bench.json file is in the correct directory
pytest ./test/test_model_evaluation.pyThis command will process the data file and output the results to the test_results.csv file.


