Train a personalized reply model from your Instagram chats using a local app + Colab GPU fine-tuning on Qwen/Qwen2.5-3B-Instruct.
RawArchive is a full workflow for style-learning from message history:
- You export Instagram messages as
.json. - You upload them to the local RawArchive web app.
- RawArchive parses chats and builds a training bundle (
bun_*). - Google Colab fine-tunes a LoRA adapter on top of Qwen 2.5 3B.
- You register the resulting
adapter.zipand get a model ID (mdl_*). - You chat with responses that follow the learned writing style.
Local PC handles ingestion/control/inference. Colab handles GPU training.
flowchart LR
subgraph LOCAL[Local PC]
A[Instagram JSON Export] --> B[Web UI Upload<br/>app/static/index.html]
B --> C[API Upload Endpoint<br/>POST /v1/datasets/instagram/upload]
C --> D[Parser and Normalizer<br/>app/parser.py]
D --> E[Bundle Builder<br/>app/dataset_builder.py]
E --> F[Bundle Artifact<br/>bun_*]
M[Model Register Endpoint<br/>POST /v1/models/register] --> N[Model ID<br/>mdl_*]
N --> O[Local Inference Chat<br/>scripts/chat_local.py]
end
subgraph CLOUD[Google Colab GPU Runtime]
G[Notebook<br/>colab/train_lora_ultrafast.ipynb] --> H[Download Bundle<br/>GET /v1/bundles/bun_xxx/download]
H --> I[Load Qwen Base Model<br/>Qwen/Qwen2.5-3B-Instruct]
I --> J[LoRA Fine-Tuning<br/>colab/train_lora.py]
J --> K[Export Adapter<br/>adapter.zip]
end
F --> G
K --> M
O --> P[Style-Matched Response]
- Input is Instagram export
.json. - Upload endpoint stores and validates data.
- Parser extracts senders, timestamps, and message text.
- Normalization removes invalid/empty records and prepares clean samples.
- You select target style/user from parsed chats.
- Builder creates prompt-response training pairs.
- Data is split into train/validation.
- Bundle is created with ID
bun_*.
- Colab receives:
BASE_URL= Cloudflare URL to your local APIBUNDLE_ID= generatedbun_*
- Notebook downloads bundle and loads
Qwen/Qwen2.5-3B-Instruct. - LoRA adapter layers are trained (not full model weights).
- Output is
adapter.zip.
- You register adapter in UI Step 4.
- API issues model ID
mdl_*. - Local chat script loads base model + adapter and generates replies.
Use Accounts Center and choose JSON format.
- Open Instagram app.
- Go to profile > menu > Settings and privacy.
- Open Accounts Center.
- Go to Your information and permissions.
- Open Download your information.
- Choose the Instagram account.
- Choose data range (or all time), then select Messages.
- Set format to JSON.
- Submit request and wait for email/notification.
- Download the archive and extract
.jsonfiles.
- Open Instagram in browser and log in.
- Go to More > Settings > Accounts Center.
- Open Your information and permissions.
- Click Download your information.
- Select account and data type Messages.
- Choose JSON format and submit request.
- Download archive when ready, then extract message
.json.
Notes:
- Export generation can take time depending on account size.
- Use JSON, not HTML, for RawArchive training.
- Windows + PowerShell
- Python
3.11+ - Google Colab account
- Internet access (for model downloads in Colab)
cloudflared(required if Colab must reach your local API)
Project files used:
app/colab/scripts/tests/requirements.txtrequirements.inference.txt
py -3.11 -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt.\.venv\Scripts\Activate.ps1
uvicorn app.main:app --host 127.0.0.1 --port 8000 --reload.\.venv\Scripts\Activate.ps1
cloudflared tunnel --url http://127.0.0.1:8000Open http://127.0.0.1:8000 and:
- Upload Instagram
.jsonfiles. - Build bundle.
- Copy generated
bun_*.
Notebook: colab/train_lora_ultrafast.ipynb
Set:
BASE_URL=https://...trycloudflare.comBUNDLE_ID= yourbun_*
Run all cells, then download adapter.zip.
In RawArchive UI Step 4:
- Adapter URI:
local://C:/Users/{your-username}/{your-location}/data/models/adapter.zip - Validation Loss: numeric value
- Style Score: numeric value
.\.venv\Scripts\Activate.ps1
pip install -r requirements.inference.txt
python scripts\chat_local.py --model-id mdl_your_model_idRun API:
uvicorn app.main:app --host 127.0.0.1 --port 8000 --reloadRun tunnel:
cloudflared tunnel --url http://127.0.0.1:8000Run tests:
pytest -qPrivate artifacts are intentionally ignored:
data/models/adapter.zipdata/datasets/(raw message content)data/bundles/(generated training artifacts)- patterns like
messages*.jsonandconversation*.json
Verify no private artifacts are tracked:
git ls-files | Select-String -Pattern "adapter\.zip|attachment\.zip|messages?\.json|conversation.*\.json"Expected result: no output.
- Colab cannot call
localhostdirectly. - Always pass the Cloudflare URL as
BASE_URL. - Keep local API + tunnel running during Colab download/training steps.
- If registration fails, verify file path and
local://prefix.
Yashas VM Creator & Lead Developer |
Made by @Yashas.VM
Co-Powered by Claude
