MCP Doc Server

An MCP server that indexes documents into collections, extracts and chunks content, generates embeddings, and stores them in a vector database for semantic and hybrid search. Everything runs locally.

I built this for personal use so LLMs can better understand my project docs and requirements, and assist with deep, project-specific development tasks.

Document Indexing & Management

The system indexes documents from URLs or local files in PDF, HTML, Markdown, and DOCX using Docling. Documents are organized into collections with custom chunking parameters.

The indexing pipeline uses parallel processing for large documents and intelligent chunking with configurable size and overlap.

Indexing Stripe's API documentation

Semantic Search

Search uses vector similarity with Qwen/Qwen3-Embedding-0.6B embeddings for fast similarity search across collections. Results can be reranked using keyword boost, length penalty, and position boost.

The system supports context windows to retrieve surrounding chunks, enables searching within specific documents or collections, and returns metadata such as page numbers, sections, and headings.

Searching through 130 page design document

Local-First Architecture

Everything runs on your machine. The Python FastMCP server handles document processing, LanceDB stores vectors directly on disk, and SQLite manages metadata.

The embedding model is downloaded from HuggingFace and runs locally, so all computations happen on your hardware. This keeps your documents private and eliminates dependency on external services.

Advanced Features

Building an efficient document indexing system required solving several performance and reliability challenges:

Reducing redundant embedding computations
Handling large documents without memory issues
Extracting rich structural metadata for better search context

LRU Embedding Cache

I implemented an LRU cache that stores embeddings keyed by text content hashes. When the same or similar content appears (like repeated headers or boilerplate), the system retrieves the cached embedding instead of recomputing. This dramatically speeds up re-indexing and batch processing of documents with overlapping content.

Parallel Chunking Architecture

Large documents can generate hundreds or thousands of chunks. I built a parallel chunking system using Python's ThreadPoolExecutor that distributes chunk creation across multiple workers when documents exceed 100+ chunks. Metadata extraction happens concurrently, significantly reducing indexing time for large PDFs.

Structural Metadata Extraction

Beyond just text, the system extracts rich structural information using Docling's document model. Each chunk carries metadata including page numbers, section hierarchies, and character spans. This enhances search results by providing context about where information appears, making it easier to navigate to exact locations and retrieve surrounding text.