TurboRAG
TurboRAG is a production-grade compressed vector retrieval engine with graph-augmented search, implementing the quantization techniques from Google Research's TurboQuant (ICLR 2026), Quantized Johnson-Lindenstrauss (QJL) (AAAI 2025), and PolarQuant (AISTATS 2026) papers by Amir Zandieh and Vahab Mirrokni.
Performance
Small Scale (1K vectors, 128-dim, 100 queries, k=10, 4-bit)
| Backend | Recall@10 | MRR | QPS | Memory |
|---|---|---|---|---|
| TurboRAG 4-bit | 1.000 | 1.000 | 6,209 | 0.08 MB |
| Exact float32 | 1.000 | 1.000 | 26,774 | 0.49 MB |
| FAISS Flat | 1.000 | 1.000 | 32,384 | 0.49 MB |
| FAISS HNSW | 1.000 | 1.000 | 23,640 | 0.55 MB |
| FAISS IVF-PQ | 0.990 | 0.990 | 27,438 | < 0.49 MB |
Large Scale (100K vectors, 384-dim, 200 queries, k=10, 3-bit)
Latest local reproducible runs as of 2026-04-20. Exact mode uses the weighted integer scorer — a LUT-free arithmetic kernel that exploits the affine-uniform quantizer to score packed 3-bit vectors with direct weight × level dot products instead of table lookups. Machine-agnostic: auto-vectorizes to SSE/AVX on x86 and NEON on ARM via -O3 -march=native.
| Backend | Recall@10 | QPS (single) | QPS (batch) | Memory | Notes |
|---|---|---|---|---|---|
| TurboRAG exact | 1.000 | 171 | 114 | 19.2 MB | Weighted integer scorer, 8 threads |
| TurboRAG fast | 0.965 | 158 | — | 19.2 MB | Binary sketch head + LUT refine |
| Exact float32 | 1.000 | 64 | — | 153.6 MB | NumPy brute force |
| FAISS Flat | 1.000 | 80 | — | 146.5 MB | Brute force |
| FAISS HNSW | 0.575 | 1,737 | — | 152.6 MB | High throughput, recall drop |
TurboRAG auto mode selects the best strategy per query: exact for small indexes, fast for large ones.
Exact mode uses the native C scorer and defaults to up to 8 threads. Override with TURBORAG_EXACT_THREADS=<n> when benchmarking.
Key advantages:
- 8× memory compression — 19.2 MB vs 153.6 MB float32 at 100K scale, with perfect recall in exact mode
- 171 QPS exact at 100% recall — the weighted integer scorer now beats the local float32 and FAISS Flat baselines on the latest reproducible 100K rerun
-
Guaranteed exact mode —
mode="exact"always gives perfect recall when accuracy is non-negotiable -
Machine-agnostic — no platform-specific intrinsics; compiler auto-vectorization works on x86, ARM, and any architecture with
-O3 -march=native - Production-hardened with atomic persistence, concurrency-safe HTTP service, and input validation
What Ships Today
- Core engine: Compressed vector index with adaptive two-stage search (binary sketch head + fused LUT refine), C scoring kernel with POPCNT, batch search, threaded shard scanning
- Graph retrieval: Entity extraction, community detection, hybrid dense+graph search with explainability
- Document ingestion: Token-aware chunking for PDF, markdown, and plain text with metadata propagation
- Sidecar adoption: Drop-in compatibility adapters for existing RAG systems — no database migration required
- HTTP service: Production-hardened REST API with CORS, metrics, request tracking, batch queries, text ingestion, concurrency-safe mutations
-
Atomic persistence:
save()clears stale shard files before writing, preventing deleted vectors from reappearing - MCP server: Tool-based agent integration over stdio (query, describe, ingest)
- Benchmark suite: Side-by-side comparison against exact float and FAISS backends
- CLI: Full command-line interface for import, query, benchmark, serve, and MCP modes
- Client SDKs: TypeScript/Node.js, Go, Ruby, and Rust clients for language-agnostic HTTP integration
- Docker: Production-ready multi-stage Dockerfile with pre-compiled C kernel
Package Layout
src/turborag/
__init__.py
_cscore.c # C scoring kernel (auto-compiled)
_cscore_wrapper.py # ctypes bridge with auto-compilation
adapters/ # LangChain-style and generic compatibility adapters
benchmark.py # Side-by-side benchmark harness
chunker.py # Token-aware PDF/MD/text chunking
cli.py # Click-based CLI with global logging options
compress.py # Rotation, quantization, LUT-based scoring
embeddings.py # Optional sentence-transformers integration
exceptions.py # Domain-specific exception hierarchy
fast_kernels.py # Vectorised LUT scoring (Python + C dispatch)
graph.py # Entity graph with persistence and community detection
hybrid.py # Dense + graph hybrid retrieval
index.py # TurboIndex with search, batch, delete, update
ingest.py # Dataset import and sidecar builder
mcp_server.py # MCP stdio server (query, describe, ingest tools)
service.py # Starlette HTTP service with CORS, metrics, batch
types.py # ChunkRecord, RetrievalResult dataclasses
tests/ # 104+ tests
Dockerfile # Multi-stage production build
Quick Start
import numpy as np
from turborag import TurboIndex
rng = np.random.default_rng(42)
index = TurboIndex(dim=8, bits=3, seed=7)
vectors = rng.normal(size=(100, 8)).astype(np.float32)
ids = [f"chunk-{i}" for i in range(len(vectors))]
index.add(vectors, ids)
results = index.search(vectors[0], k=5)
for chunk_id, score in results:
print(chunk_id, score)Install
From PyPI
# Core
pip install turborag
# Everything
pip install turborag[all]From Source
git clone https://github.com/ratnam1510/turborag.git
cd turborag
pip install -e '.[all,dev]'Individual extras: graph, embed, ingest, serve, mcp, all, dev. See docs/installation.md for details.
Run As A Service
turborag serve --index ./my_index --host 0.0.0.0 --port 8080 --workers 4Endpoints:
-
GET /health— health check -
GET /index— index configuration and stats -
GET /metrics— latency histograms and error counts -
POST /query— single query (vector or text) -
POST /query/batch— batch vector queries -
POST /ingest— add records with embeddings -
POST /ingest-text— raw text ingestion with auto-chunking
CORS is enabled by default. Use --cors-origins to restrict.
Docker
docker build -t turborag .
docker run -p 8080:8080 -v ./my_index:/data/index turborag \
turborag serve --index /data/index --host 0.0.0.0Side-By-Side Benchmark
# One-command local comparison
./scripts/benchmark_compare.sh
# Custom benchmark
turborag benchmark \
--index ./turborag_sidecar \
--queries ./queries.jsonl \
--turborag-mode exact \
--dataset ./corpus.jsonl \
--baseline exact \
--k 10For the current exact-mode path, use TURBORAG_EXACT_THREADS=8 or leave it unset to use the default native thread count.
See docs/benchmarking.md.
MCP Agent Integration
turborag mcp --index ./my_indexTools exposed: turborag_query, turborag_describe, turborag_ingest.
Sidecar Adoption (No Database Migration)
TurboRAG runs as a retrieval sidecar alongside your existing RAG database:
- Keep your current chunk/document store untouched
- Build a TurboRAG index from your existing embeddings
- Query TurboRAG for ranked IDs → hydrate from your existing DB
- Gradually shift traffic as confidence grows
Full guide: docs/current-rag-rollout.md.
Existing DB Integration (ID-Only, Low Memory)
If you already have text and metadata in your own database, TurboRAG can return ranked IDs only and skip local hydration entirely.
- Python adapter: build from embeddings and pass your existing backend client.
- HTTP service: use
POST /querywith"hydrate": false. - CLI: use
turborag query --ids-only.
This mode minimizes memory and keeps TurboRAG as a pure retrieval sidecar.
from turborag.adapters.compat import ExistingRAGAdapter
adapter = ExistingRAGAdapter.from_existing_backend(
embeddings=embeddings_matrix,
ids=chunk_ids,
query_embedder=embedder,
records_backend=your_existing_db_client,
bits=3,
)
hits = adapter.search_ids("What changed in capex guidance?", k=5)
# Hydrate hits from your own DB path.Known backend helper builders are available in turborag.adapters.backends:
- Postgres / Neon / Supabase Postgres:
build_postgres_fetch_records(...)/build_neon_fetch_records(...) - Supabase Python client:
build_supabase_fetch_records(...) - Pinecone:
build_pinecone_fetch_records(...) - Qdrant:
build_qdrant_fetch_records(...) - Chroma:
build_chroma_fetch_records(...)
Or configure plug-and-play adapter mode via CLI:
turborag adapt set neon --index ./turborag_sidecar --option dsn=${DATABASE_URL}
turborag serve --index ./turborag_sidecarShortcut command style (env-aware):
turborag adapt supabase --index ./turborag_sidecar
turborag serve --index ./turborag_sidecarAutomatic mode (detect backend from env):
turborag adapt --index ./turborag_sidecarIf you're in the index directory already:
turborag adaptNeed a quick starter command for a backend?
turborag adapt demo supabaseturborag serve --index ./turborag_sidecar --no-load-snapshot
curl -X POST http://localhost:8080/query \
-H 'content-type: application/json' \
-d '{"query_vector":[0.1,0.2,0.3],"top_k":5,"hydrate":false}'Verification
# Test suite (104+ tests)
pytest
# End-to-end smoke test
./scripts/smoke_test.sh