LLM Distillation Pipeline - Project Details

An end-to-end knowledge distillation pipeline for fine-tuning a 9B-parameter Indonesian-Javanese language model. The core engineering challenge was that the GPU server could not have Python installed, requiring a fully Docker-based training workflow coordinated from a separate laptop.

Problem Statement

The production GPU server (RTX 5090) operated in a restricted environment with no Python runtime permitted. Standard fine-tuning workflows assume direct script execution on the training machine. The goal was to distill knowledge from a large 70B teacher model into a smaller 9B student model, deployable for low-latency inference — all without violating the server's environment constraints.

Solution: Three-Zone Architecture

The pipeline is split across three isolated zones, each with a clearly defined responsibility:

Zone A — Teacher Inference (LM Studio): The 70B teacher model runs on a separate machine via LM Studio, exposing an OpenAI-compatible API on port 1234. No Python required here.
Zone B — Orchestration (Laptop): All Python code — PDF loading, API calls to the teacher, data validation, variation generation — runs on the laptop. This zone produces the JSONL training dataset.
Zone C — Training (Docker on GPU Server): The Unsloth container receives only the JSONL file and a config. It handles model loading, LoRA fine-tuning, GGUF export, and deployment — entirely inside Docker.

This separation means the server never needs Python installed, yet the full pipeline is reproducible and automated with a single command.

Architecture

  ┌──────────────────────────────────────────────────────────────────────┐
  │  ZONE A — TEACHER INFERENCE (LM Studio)                              │
  │                                                                      │
  │  llama-sahabat-ai 70B ──▶ OpenAI-compatible API (:1234)              │
  └──────────────────────────────────┬───────────────────────────────────┘
                                     │  HTTP
  ┌──────────────────────────────────▼───────────────────────────────────┐
  │  ZONE B — ORCHESTRATION (Laptop / Python)                            │
  │                                                                      │
  │  ClickHouse DB ──▶ fetch_faq_to_pdf.py ──▶ PDFs                      │
  │                                                │                     │
  │  PDFs ──▶ PDFLoader ──▶ DataGenerator ──▶ train_dataset.jsonl        │
  │                                                │                     │
  │                                  generate_variations.py              │
  │                                  (8 strategies, ~5x expansion,       │
  │                                   dedup, anti-fabrication)           │
  │                                                │                     │
  │                                  train_dataset_expanded.jsonl        │
  └──────────────────────────────────────────────┬───────────────────────┘
                                                 │  Volume mount
  ┌──────────────────────────────────────────────▼───────────────────────┐
  │  ZONE C — TRAINING (Docker + GPU, no Python on host)                 │
  │                                                                      │
  │  ┌────────────────────────────────────────────────────────────────┐  │
  │  │  Unsloth Container (unsloth/unsloth:latest)                    │  │
  │  │  Load gemma2-9B (4-bit) ──▶ Apply LoRA (r=64, α=128)           │  │
  │  │  SFTTrainer (200 steps, ~15–20 min on RTX 5090)                │  │
  │  │  ──▶ Merge ──▶ GGUF export (F16) ──▶ Training report           │  │
  │  └────────────────────────────────────────────────────────────────┘  │
  │                                              │                       │
  │                              Auto-deploy to LM Studio                │
  └──────────────────────────────────────────────────────────────────────┘

Data Augmentation Pipeline

  274 base conversations (FAQs from ClickHouse)
         │
         ▼
  ┌──────────────────────────────────────────────────────────┐
  │  8 Augmentation Strategies                               │
  │  ├── Persona Shift                                       │
  │  ├── Adversarial Question                                │
  │  ├── Refusal & Boundary                                  │
  │  ├── Creative Rephrase                                   │
  │  ├── Emotional Context                                   │
  │  ├── Negative Confirmation                               │
  │  ├── Explicit Confirmation                               │
  │  └── Out-of-Scope Deflection                             │
  └─────────────────────────┬────────────────────────────────┘
                            │  ~5x expansion
                            ▼
              Category-Aware Balancing
              (~20 topic categories, prioritize under-represented)
                            │
                            ▼
              Anti-Fabrication Validation
              ├── URL whitelist check
              ├── Fabricated payment method detection
              └── Sentence-level negation awareness
                            │
                            ▼
              Deduplication (85% similarity threshold)
                            │
                            ▼
              ~175 high-quality training conversations

Key Features

Automated Data Augmentation: 8 augmentation strategies (Persona Shift, Adversarial Question, Refusal & Boundary, Creative Rephrase, Emotional Context, Negative Confirmation, Explicit Confirmation, Out-of-Scope Deflection) expand the base dataset ~5x.
Category-Aware Balanced Generation: Detects ~20 topic categories per conversation and allocates augmentation budget to under-represented categories, preventing training imbalance.
Anti-Fabrication Validation: A multi-layer validator rejects generated variations containing non-whitelisted URLs, fabricated payment methods, or invented policies — using sentence-level negation awareness to distinguish refusals from hallucinations.
Automatic Deduplication: Near-duplicate removal at an 85% similarity threshold keeps the dataset diverse after expansion.
One-Command Pipeline: docker compose up triggers data copy → training → GGUF export → LM Studio deployment, with a diagnostic training report generated automatically at the end.

Technical Details

Training runs inside the unsloth/unsloth:latest Docker image (32.5 GB) using 4-bit quantization to reduce the 9B model from ~18 GB to ~6 GB VRAM footprint, then applies LoRA adapters at rank 64 / alpha 128. The optimized configuration uses a batch size of 8 with 2 gradient accumulation steps, BF16 precision (native on Blackwell architecture), and 200 training steps — covering approximately 3–4 epochs over ~175 high-quality conversations. VRAM usage peaks at ~19–22 GB, safely within the RTX 5090's capacity.

The data generation side uses an OpenAI-compatible client with exponential backoff retry logic, chunking source PDFs into 800-character segments before querying the teacher model. Nucleus sampling (top_p 0.95) on variation generation promotes lexical diversity.

GGUF export uses F16 precision via llama.cpp integration inside the Unsloth container, producing a deployment-ready model that is automatically copied to the LM Studio models directory.

Impact & Results

274 source FAQ entries expanded to ~175 deduplicated, high-quality training conversations after augmentation and validation.
Full training completes in ~15–20 minutes on the RTX 5090.
Automated pipeline eliminates manual steps between data generation and model deployment.
Anti-fabrication validation measurably reduced hallucinated URLs and invented policies in the fine-tuned model's outputs compared to the base student model.