ChromaDB RAG Chatbot - Project Details

A Dockerized Retrieval-Augmented Generation (RAG) system built as the foundational chatbot layer for a B2B FMCG e-commerce platform. The system enables intelligent customer interaction in Indonesian language — resolving the core challenge that customers use colloquial and regional product names that don't match official catalog entries.

Problem Statement

In Indonesian B2B commerce, customers refer to products by colloquial names, brand abbreviations, or regional terms. A standard keyword search fails completely — "indomie kuning" or "mie soto" won't match "Indomie Mi Instan Rasa Kaldu Ayam" in a traditional catalog. The platform also needed to handle 35+ distinct customer intent types (from cart operations to profile inquiries) and answer FAQ questions without hardcoding lookup tables.

Solution: Unified RAG Architecture

The system uses a three-collection ChromaDB setup with semantic embeddings (sentence-transformers all-MiniLM-L6-v2) to resolve queries across three knowledge domains in parallel:

Product collection — each product is embedded with its official name, colloquial aliases, pack size, and Indonesian description, enabling fuzzy semantic matching regardless of naming variation.
FAQ collection — FAQs are sourced from ClickHouse and indexed into ChromaDB, enabling semantic question matching that handles paraphrase and spelling variation.
Intent collection — 35 e-commerce intent types with example phrases are embedded, allowing the system to classify the customer's action intent before routing to the LLM.

A single UnifiedRetriever searches all three collections and returns the best match ranked by relevance_score = 1.0 - cosine_distance. The UnifiedRAGOrchestrator then builds a targeted LLM context from the matched result and generates an Indonesian-language response.

Architecture

  User Query (Indonesian / colloquial)
          │
  ┌───────▼───────────────────────────────┐
  │  UnifiedRetriever                     │
  │  Embedding → parallel ChromaDB search │
  └───┬───────────────┬───────────────┬───┘
      ▼               ▼               ▼
  [Products]      [FAQs]         [Intents]
  colloquial      ClickHouse     35 types
  name mapping    sourced
      │               │               │
      └───────┬───────┴───────────────┘
              │  best match by relevance_score
              ▼
  ┌───────────────────────────────────────┐
  │  UnifiedRAGOrchestrator               │
  │  LLM context build + response         │
  │  Function calling: check_inventory()  │
  └───────────────────────────────────────┘
              │
              ▼
  Response + Order Tracking (JSON)

Key Features

Colloquial name resolution: Products are indexed with semicolon-separated colloquial alias lists. A query for "mie goreng bungkusan merah" correctly resolves to the official product via semantic similarity, with no explicit alias lookup table.
LLM function calling for inventory: When a product query includes a quantity, the orchestrator enables a check_inventory(sku, requested_quantity) function call. The LLM invokes it automatically, and the result is captured by OrderTracker and persisted to JSON — simulating an order capture flow.
ClickHouse FAQ integration: FAQ data is fetched from a ClickHouse database at index time and stored in ChromaDB, decoupling the vector search from live database queries during runtime.
Docker-first deployment: The entire system runs inside a single Docker container. Indexing scripts are run as one-time setup commands inside the container, with ChromaDB persisted to a mounted volume. No Python installation required on the host server.
Windows compatibility: All scripts include UTF-8 reconfiguration for Windows console output, and dependency versions are pinned to avoid onnxruntime DLL issues specific to Windows + torch 2.8.0.

Technical Challenges

The primary compatibility challenge was ChromaDB version locking. ChromaDB 1.3.4 causes segmentation faults on Windows when paired with torch 2.8.0 — the version required for sentence-transformers 2.7.0. The fix was to pin chromadb==0.5.0 and document the constraint explicitly. Similarly, onnxruntime was removed from requirements.txt entirely (Docker uses PyTorch backend; Windows developers install 1.16.3 locally), preventing DLL load failures in the Docker image.

Intent classification required careful data design: the intent collection indexes not just intent names but concrete example customer phrases, so the embedding captures linguistic variation rather than just categorical labels. This keeps the system adaptable — adding a new intent requires only adding example phrases to data/intent.txt and re-indexing.

Impact

Established the semantic retrieval baseline for the platform's chatbot — covering product search, FAQ, and intent classification in a single unified pipeline.
35+ intent types classified with no hardcoded keyword matching, using only semantic similarity against example phrases.
Dockerized deployment makes the system reproducible across environments with a single docker-compose up.
Colloquial name resolution removes the need for a manually maintained alias lookup table, reducing ongoing maintenance overhead.