An end-to-end knowledge distillation pipeline for fine-tuning a 9B-parameter Indonesian-Javanese language model. The core engineering challenge was that the GPU server could not have Python installed, requiring a fully Docker-based training workflow coordinated from a separate laptop.
Problem Statement
The production GPU server (RTX 5090) operated in a restricted environment with no Python runtime permitted. Standard fine-tuning workflows assume direct script execution on the training machine. The goal was to distill knowledge from a large 70B teacher model into a smaller 9B student model, deployable for low-latency inference — all without violating the server's environment constraints.
Solution: Three-Zone Architecture
The pipeline is split across three isolated zones, each with a clearly defined responsibility:
- Zone A — Teacher Inference (LM Studio): The 70B teacher model runs on a separate machine via LM Studio, exposing an OpenAI-compatible API on port 1234. No Python required here.
- Zone B — Orchestration (Laptop): All Python code — PDF loading, API calls to the teacher, data validation, variation generation — runs on the laptop. This zone produces the JSONL training dataset.
- Zone C — Training (Docker on GPU Server): The Unsloth container receives only the JSONL file and a config. It handles model loading, LoRA fine-tuning, GGUF export, and deployment — entirely inside Docker.
This separation means the server never needs Python installed, yet the full pipeline is reproducible and automated with a single command.
Architecture
┌──────────────────────────────────────────────────────────────────────┐
│ ZONE A — TEACHER INFERENCE (LM Studio) │
│ │
│ llama-sahabat-ai 70B ──▶ OpenAI-compatible API (:1234) │
└──────────────────────────────────┬───────────────────────────────────┘
│ HTTP
┌──────────────────────────────────▼───────────────────────────────────┐
│ ZONE B — ORCHESTRATION (Laptop / Python) │
│ │
│ ClickHouse DB ──▶ fetch_faq_to_pdf.py ──▶ PDFs │
│ │ │
│ PDFs ──▶ PDFLoader ──▶ DataGenerator ──▶ train_dataset.jsonl │
│ │ │
│ generate_variations.py │
│ (8 strategies, ~5x expansion, │
│ dedup, anti-fabrication) │
│ │ │
│ train_dataset_expanded.jsonl │
└──────────────────────────────────────────────┬───────────────────────┘
│ Volume mount
┌──────────────────────────────────────────────▼───────────────────────┐
│ ZONE C — TRAINING (Docker + GPU, no Python on host) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Unsloth Container (unsloth/unsloth:latest) │ │
│ │ Load gemma2-9B (4-bit) ──▶ Apply LoRA (r=64, α=128) │ │
│ │ SFTTrainer (200 steps, ~15–20 min on RTX 5090) │ │
│ │ ──▶ Merge ──▶ GGUF export (F16) ──▶ Training report │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ Auto-deploy to LM Studio │
└──────────────────────────────────────────────────────────────────────┘
Data Augmentation Pipeline
274 base conversations (FAQs from ClickHouse)
│
▼
┌──────────────────────────────────────────────────────────┐
│ 8 Augmentation Strategies │
│ ├── Persona Shift │
│ ├── Adversarial Question │
│ ├── Refusal & Boundary │
│ ├── Creative Rephrase │
│ ├── Emotional Context │
│ ├── Negative Confirmation │
│ ├── Explicit Confirmation │
│ └── Out-of-Scope Deflection │
└─────────────────────────┬────────────────────────────────┘
│ ~5x expansion
▼
Category-Aware Balancing
(~20 topic categories, prioritize under-represented)
│
▼
Anti-Fabrication Validation
├── URL whitelist check
├── Fabricated payment method detection
└── Sentence-level negation awareness
│
▼
Deduplication (85% similarity threshold)
│
▼
~175 high-quality training conversations
Key Features
- Automated Data Augmentation: 8 augmentation strategies (Persona Shift, Adversarial Question, Refusal & Boundary, Creative Rephrase, Emotional Context, Negative Confirmation, Explicit Confirmation, Out-of-Scope Deflection) expand the base dataset ~5x.
- Category-Aware Balanced Generation: Detects ~20 topic categories per conversation and allocates augmentation budget to under-represented categories, preventing training imbalance.
- Anti-Fabrication Validation: A multi-layer validator rejects generated variations containing non-whitelisted URLs, fabricated payment methods, or invented policies — using sentence-level negation awareness to distinguish refusals from hallucinations.
- Automatic Deduplication: Near-duplicate removal at an 85% similarity threshold keeps the dataset diverse after expansion.
- One-Command Pipeline:
docker compose uptriggers data copy → training → GGUF export → LM Studio deployment, with a diagnostic training report generated automatically at the end.
Technical Details
Training runs inside the unsloth/unsloth:latest Docker image (32.5 GB) using 4-bit quantization to reduce the 9B model from ~18 GB to ~6 GB VRAM footprint, then applies LoRA adapters at rank 64 / alpha 128. The optimized configuration uses a batch size of 8 with 2 gradient accumulation steps, BF16 precision (native on Blackwell architecture), and 200 training steps — covering approximately 3–4 epochs over ~175 high-quality conversations. VRAM usage peaks at ~19–22 GB, safely within the RTX 5090's capacity.
The data generation side uses an OpenAI-compatible client with exponential backoff retry logic, chunking source PDFs into 800-character segments before querying the teacher model. Nucleus sampling (top_p 0.95) on variation generation promotes lexical diversity.
GGUF export uses F16 precision via llama.cpp integration inside the Unsloth container, producing a deployment-ready model that is automatically copied to the LM Studio models directory.
Impact & Results
- 274 source FAQ entries expanded to ~175 deduplicated, high-quality training conversations after augmentation and validation.
- Full training completes in ~15–20 minutes on the RTX 5090.
- Automated pipeline eliminates manual steps between data generation and model deployment.
- Anti-fabrication validation measurably reduced hallucinated URLs and invented policies in the fine-tuned model's outputs compared to the base student model.