AI coding agents - Claude Code, GitHub Copilot, Gemini Code Assist - are remarkably good at writing code. They’re considerably worse at finding documentation.
When an agent needs to understand “how does tenant isolation work?” it typically falls back to grep, file-tree browsing, or reading every file until it finds something relevant. In a monorepo with 600+ documents across 13 component directories, that approach consumes 40,000-50,000 tokens per lookup. At typical LLM pricing, a single documentation-heavy session can cost more than the compute running the pipeline the agent is helping to build.
The obvious solution is a vector database. Embed your documentation, query by semantic similarity, return the top 5 results. Problem solved - except for the bill. Databricks Vector Search endpoints cost ~$400/month with no scale-to-zero. Pinecone at production tiers runs $840-1,800/year. For documentation search - episodic, low-throughput, and perhaps 50 queries per day during active development - that’s renting a warehouse to store a bookshelf.
We found a different answer. Our semantic search system costs $2–10/month, uses no dedicated vector database, and runs entirely within the Databricks workspace.
The standard embedding search architecture requires a vector database to store embeddings and serve low-latency similarity queries. What we discovered is that Databricks SQL Warehouse - already present in every workspace - can compute cosine similarity directly on ARRAY<FLOAT> columns using higher-order array functions: zip_with, aggregate, transform.
This means the Delta table that stores your document chunks is your vector database. No separate service. No external API key. No always-on infrastructure. The SQL Warehouse scales to zero when idle; it costs nothing when nobody’s searching.
The full architecture:
Compare that to a grep-based search: ~46,000 tokens per query. That’s a 42x reduction. An agent making 10 documentation lookups per session drops from ~464K tokens to ~10K.
We didn’t design this system top-down. We watched agents struggle with real queries and kept asking “why did that fail?”
Stage 1: Grep. Fast, no infrastructure, great for exact matches. Failed completely on conceptual queries - “how does tenant isolation work?” returns zero results when the phrase doesn’t appear verbatim.
Stage 2: Directory-level indexes. LLM-generated summaries at the directory level, manually curated. Dramatically better for navigational queries. Too expensive to maintain; still cost 25,000+ tokens per lookup because agents had to read full index files.
Stage 3: Top-level manifest. A root manifest pointing to all directory indexes. Best precision in our benchmark (P@5=0.200). But agents still had to read through metadata hierarchies, and the system didn’t scale to a codebase where hundreds of documents change weekly.
Stage 4: Flat semantic search. Single API call, 42x token reduction, zero maintenance. Precision dropped (P@5=0.117) because content chunks capture what paragraphs say, not what documents are about.
Stage 5: Hierarchical chunking with LLM summaries. Alongside content chunks, the pipeline generates document summaries (LLM-distilled descriptions of what each document covers and why an agent would consult it) and section summaries. These are embedded and participate in the same cosine similarity search. Conceptual query precision improved 18% on medium-difficulty queries. Cost per index build: ~$0.17.
We tested three approaches across 15 queries spanning keyword lookups, conceptual questions, and cross-component architectural questions:
| Metric | Grep | Structured Index | Semantic Search |
|---|---|---|---|
| Avg Tokens | 46,442 | 39,594 | 1,058 |
| Avg Precision@5 | 0.107 | 0.200 | 0.127 |
The structured index wins on precision because LLM reasoning about document purpose outperforms real-time similarity matching on the specific query types where it matters. Semantic search wins on token economy (42x reduction) and on medium-difficulty conceptual queries, where embedding similarity bridges vocabulary gaps that keyword matching can’t close.
The takeaway: neither approach is universally best. For episodic documentation workloads, semantic search’s token efficiency and zero-maintenance properties make it the practical choice. Hierarchical summaries close the precision gap on the queries where it matters most.
No new credentials to manage. No new vendor contracts. No new monitoring dashboards. The Delta table is governed by Unity Catalog. The SQL Warehouse is managed by the workspace. The Foundation Model API authenticates with the same Databricks tokens used by every other component.
The operational surface area of this system is three things: the Delta table, the weekly indexing job, and the search CLI. Engineers interact with it through a single command:
The agent gets ranked results - file path, heading, similarity score, content snippet - in ~1,100 tokens. It knows exactly which files to read and which sections to focus on.
Cut your AI agents' documentation costs without standing up a vector database. We'll walk you through it on your own Databricks workspace.