This repo contains the learner and coach materials for Workshop 6W (Unit 6: Optimisation).
Unit 6 is the "mathsy" part of AI6. We're not going to do calculus on a whiteboard, but we are going to use the real words you'll see in papers and tooling: loss, gradients (partial derivatives), stochastic gradient descent (SGD), learning rate, and generalisation.
The goal is simple: build enough intuition to answer practical engineering questions like:
- Is training actually working, or are we wasting compute?
- If it's not working, what's the first dial I should try turning?
- How do I explain what happened to a stakeholder without hand‑waving?
In this workshop we make Unit 6 feel real by watching optimisation happen inside a training loop.
Imagine you hire a graduate who already speaks excellent English. They understand grammar, tone, and context because they've read a huge amount of text.
A pretrained model is like that graduate: it already has strong general ability.
Now you hire them into a finance team. They still speak English, but they don't yet have good instincts for phrases like "guidance cut" or "beat expectations".
Fine‑tuning is onboarding them into your domain. You are not teaching them English again — you are sharpening their behaviour for your environment.
Under the hood, fine‑tuning is just optimisation. In each training step the model:
- makes a prediction
- gets scored by a loss function (a differentiable "how wrong was I?" number)
- uses the gradient (the slope — a partial derivative of the loss w.r.t. the weights) to work out which way is downhill
- updates the weights a tiny amount (using an optimiser like AdamW, which is an adaptive variant of SGD)
That "update loop" is what trainer.train() runs.
Usually, no. With small learning rates and a small number of epochs, the updates are incremental. The model doesn't "wipe its brain"; it adjusts its instincts.
If training is aggressive (very high learning rate, too many epochs, extremely narrow data), the model can become overly specialised. This is sometimes called catastrophic forgetting. We control that risk by keeping training small and by watching validation signals.
You will fine‑tune a small text model for 3‑class financial sentiment and run short, controlled tuning experiments.
By the end you will be able to:
- Explain what
trainer.train()is doing in optimisation terms (loss → gradients → weight updates) - Read a training curve and diagnose learning rate too high / too low / just right
- Spot a generalisation gap (train improves but validation doesn't) and explain why it matters
- Choose a training configuration under a time/compute constraint and justify it
- Keep a lightweight experiment log (auditability + reproducibility)
When you train, you'll see two kinds of signals.
Loss is like altitude. It tells you whether you're going downhill mathematically (becoming less wrong according to the loss function).
Validation performance (accuracy / F1) is like a road test. It tells you whether the model behaves better on examples it hasn't seen.
The most important relationship is the generalisation gap:
The generalisation gap is the difference between training performance and validation performance. If training keeps improving but validation stalls or gets worse, the model is memorising instead of generalising.
Healthy pattern:
- training loss ↓ and validation loss ↓
Overfitting pattern:
- training loss ↓ but validation loss ↑ (gap is widening)
If metrics barely move:
- that can still be a useful result. It often means you don't have enough signal in the data (or you didn't train long enough for this learning rate).
Why loss and accuracy can move independently:
Loss cares about how confident the model was when it got something wrong. A model that said "90% positive" when the label was negative gets punished much harder than one that said "40% positive". Accuracy only asks whether the top-ranked answer was correct. It ignores the margins. So a model can be getting meaningfully less wrong on the inside (loss falling) while still picking the same answer on the outside (accuracy flat). Loss tends to move first.
Further reading: Neural Networks Part 6: Cross Entropy (StatQuest / Josh Starmer)
| Emoji | Meaning |
|---|---|
| 🎯 | Learning Objective — what you will achieve |
| 📋 | Expected Outputs — end result to aim for |
| 📝 | Task/Step — something to do |
| ⌨️ | Terminal — shell command to run |
| 📓 | Notebook — action to take in Jupyter |
| ✅ | Checkpoint — verify your progress |
| 🤔 | Reflect — think deeply about this |
| 💡 | Tip/Hint — helpful suggestion |
| Warning — do not miss this | |
| 📘 | Explanation — background theory |
| 🚀 | Extension — optional stretch challenge |
| 🎓 | Complete — activity finished |
Read the introduction above first to understand the scenario. You may also wish to look ahead to Activity 7 before starting Activity 2, as it asks you to gather your experiment log entries from earlier activities — there is no harm in knowing what you are building towards.
| Activity | Title | Focus |
|---|---|---|
| Activity 1 | Curve Diagnosis | Match symptoms to causes; stakeholder framing |
| Activity 2 | Environment Setup & Baseline | Set up the environment; record the "before" measurement |
| Activity 3 | Group Learning Rate Experiment | Run your group's preset; record a reproducible experiment log |
| Activity 4 | Share & Compare | Report results; whole-group discussion; engineering conclusion |
| Activity | Title | Focus |
|---|---|---|
| Activity 5 | Convergence Check | Run convergence cells; connect maths to compute cost (K18) |
| Activity 6 | Data Quality & Uncertainty | Identify data limitations; practise honest stakeholder caveats (B5) |
| Activity 7 | Reflection & Horizon Scan | Write Part C reflections; find Task 6 horizon scan source (S31/K28) |
| Activity | Title | Focus |
|---|---|---|
| Going Further | Extensions (optional) | Schedule comparison / freeze encoder / error analysis / W&B tracking / SGD comparison |
-
activities/— numbered learner activity files (this scaffold) -
lab/— notebooks and sample data -
handouts/— learner worksheet + cheat sheet + plain‑English explainer + decision‑log template + glossaryHandout When to open it Activity_Worksheet.mdOpen at the start of Activity 1; keep it alongside the activities all day. This is your evidence document — Part B is your experiment log, Part C is your reflection, Part E is your horizon scan. Cheat_Sheet.mdQuick reference throughout the day. If you forget what "generalisation gap" means or want to remind yourself of the convergence rule, look here first. Fine_Tuning_in_Plain_English.mdRead before or after the opening talk track (09:15). Contains the "graduate hire" and "tailoring a suit" analogies in full — useful for Worksheet Part C Question 3. Unit6_Maths_Words_Glossary.mdVocabulary reference for K18 questions. One page covering loss, gradient, SGD, learning rate, convergence, generalisation, and regularisation. Tuning_Decision_Log_Template.mdA structured six-section template for documenting your tuning decisions in a way that is auditable and explainable. Useful for the S9 evidence in Task 6. Works alongside the experiment log: log for recording live, Decision Log for the polished write-up. Optional_WandB.mdGoing Further Option D only. Experiment tracking with Weights & Biases. Only use if your organisation's policies permit external cloud logging services. -
slides/— workshop slide deck -
coach/— coach playbook (timings, prompts, troubleshooting) -
assets/— training curve images used in Activity 1 and slides
This workshop is cloud‑neutral. You do not need AWS/Azure/GCP — you just need a browser, or Python and internet access to download model weights.
Pick the option that creates the least friction for your cohort. GitHub Codespaces is the recommended starting point for most learners — no local setup, no SSH, no firewall configuration.
Best for most learners. Requires only a browser and a free GitHub account. The repo is
public — any learner can be running in under 2 minutes with no local installation. Files
persist between sessions. Well within the GitHub Free tier for a 5-hour workshop.
See TROUBLESHOOTING.md for full setup steps and notes on cost, file persistence, and
organisations that may have Codespaces disabled.
Best when learners can install Python and you want them to practise a "real repo" workflow.
Clone the repo, create a virtual environment, and pip install -r requirements.txt.
See the Setup section below.
Good alternative when learners do not have a GitHub account. Google's network reaches
huggingface.co reliably. Note that files are lost when the session ends — ask learners
to download their worksheet before closing. See TROUBLESHOOTING.md for setup steps.
For cohorts where the delivery environment is a managed Pluralsight sandbox. Full
step-by-step instructions — including the correct ordering of install steps, how to
handle the Jupyter token URL, Windows SSH guidance, and sandbox time-limit caveats —
are in TROUBLESHOOTING.md. Read that section before the day; the setup has more steps
than the other options.
If you are using Codespaces or Colab, skip this section — setup instructions for
those environments are in TROUBLESHOOTING.md.
git clone https://github.com/Corndel/AI6-W6.git
cd AI6-W6Or, if you downloaded a zip, unzip it and cd into the resulting folder:
unzip AI6-W6-main.zip
cd AI6-W6-mainpython -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows PowerShellpip install -r requirements.txtjupyter labOpen: lab/01_finetune_distilbert_optimisation_in_the_wild.ipynb
The notebook contains speed dials (dataset size, max token length, epochs) so it runs on typical laptops.
If training is slow:
- reduce
TRAIN_SUBSET - reduce
MAX_LENGTH - set
EPOCHS = 1
If your environment blocks transformer model downloads, use the backup notebook:
lab/02_backup_sgd_text_classifier_learning_rate.ipynb
It still teaches Unit 6 optimisation and learning‑rate tuning using TF‑IDF + SGDClassifier (real SGD). All activity tasks, the data quality discussion, and the horizon scan apply unchanged.
This workshop is intentionally built around the Unit 6.2 concepts:
- loss functions: "measuring how wrong we are"
- gradient descent: "walking downhill"
- learning rate: "how big are your steps?"
- schedules: "adapting over time"
- why training costs time + money