About Projects Experience Contact
Back to Projects

Crawler RAG DocHelper

A powerful Retrieval-Augmented Generation (RAG) system with Sparse-Dense Hybrid Search. Crawls documentation websites, ingests content into a vector database, and provides intelligent question-answering capabilities.

Overview

This project automates the orchestration pipeline mapping documentation crawling to dynamic LLM response contexts. It fundamentally relies on standardizing documentation indices (`llms.txt`) to isolate relevant framework concepts, scraping them via Tavily, splitting them into vectorized chunks, and interrogating the index using LangChain orchestrated against Google Gemini.

Core Engineering Highlights

Sparse-Dense Hybrid Search Implementation

The Strategy: Pure Semantic search frequently hallucinates context due to synonyms overlapping across technical frameworks.

The Implementation: Orchestrated a hybrid model combining the semantic strength of Dense Embeddings (HuggingFace) for understanding context and synonym variance, layered against Sparse BM25 Encoders strictly executing exact-term matches across rare keyword terminology or specific class names. The alpha parameter is dynamically balanced at `0.5` through the primary Pinecone dotproduct index.

LLM Multi-Level Content Caching

The Strategy: Scraping hundreds of API documentation boundaries triggers throttling, invalidates test iterations, and inherently raises token costs.

The Implementation: Hardened the ingestion step with deterministic hashing dictating recursive chunk storage inside `crawled_data.json`. Running an ingestion sequence twice intelligently maps identical URLs back into the cache framework entirely sidestepping subsequent network drops to Tavily Extract APIs. Re-processing only costs compute time parsing local character boundaries.

Noise Dilution via Cohere Reranking

The Strategy: Expanding nearest-neighbor returns past top-k results dilutes the specific instruction sets sent to the ultimate Gemini payload context window.

The Implementation: Injected a rerank-english-v3.0 Cohere Model directly subsequent to the Pinecone search response. The expanded subset of retrieved nodes fundamentally recalculates exact relevancy relevance against the initial baseline query string parameters, actively suppressing the hallucination rate across technical responses.