Uncategorized

Multi‑Source RAG — Ten Patterns with ADK


Pattern 1 — Multi‑Source Retrieval & Synthesis Orchestration

Pattern type: Architecture / Evaluation

Context / Background
Real questions often require complementary evidence scattered across multiple documents (narrative chapters, meeting segments). Systems must retrieve from diverse sources and synthesize long‑form answers.

Problem
Single‑source or factoid QA hides failure modes: low retrieval diversity, missing complementary evidence, and weak long‑form synthesis.

Forces

  • Recall vs. context budget
  • Complementary vs. overlapping evidence
  • Latency/cost vs. quality (reranking, longer prompts)
  • Abstractive coherence vs. extractive faithfulness

Solution (overview)
Split the pipeline into (A) multi‑source retrieval (dense + reranker) and (B) a dedicated synthesis step run by a reasoning model. Evaluate with MSRS‑style tasks that require multi‑doc integration.

Solution — Ten Steps

  1. Build a unified corpus with doc‑level and segment‑level metadata.
  2. Chunk to ~800–1200 tokens; store both chunk and doc embeddings.
  3. Use domain‑appropriate dense embeddings (story vs. meeting).
  4. Retrieve top‑K chunks per query; ensure source diversity (distinct docs).
  5. Rerank chunks (pairwise or cross‑encoder) to select complementary evidence.
  6. De‑duplicate and expand to full passages; enforce per‑doc quotas.
  7. Construct a synthesis prompt with citations and per‑doc keypoints.
  8. Run a reasoning LLM for long‑form synthesis; keep chain‑of‑thought internal.
  9. Post‑validate with factuality checks against retrieved snippets.
  10. Score with ROUGE‑1/2, BERTScore, and an LLM‑based rubric (G‑Eval‑style).

Implementation (ADK‑style, Python skeleton)

# Pseudocode 
from google.adk.agents import Agent
from google.adk.memory import InMemoryMemoryService
from google.adk.runtime import Orchestrator

retriever = DenseRetriever(index="story_or_meet.index", k=12)
reranker = VertexAIReranker(model="text-multilingual-rerank@latest")

class MultiSourceRAGAgent(Agent):
    def handle(self, query: str):
        initial = retriever.search(query, k=24)              # high‑recall set
        reranked = reranker.rerank(query, initial)
        diverse = enforce_doc_diversity(reranked, per_doc_max=3, top_k=8)
        prompt = build_synthesis_prompt(query, diverse)
        return self.call_reasoning_llm(prompt, model="gemini-2.5-pro")

orchestrator = Orchestrator(root=MultiSourceRAGAgent(), memory=InMemoryMemoryService())

Resulting Consequences

  • Exposes and reduces retrieval diversity gaps.
  • Stable quality across narrative and meeting domains.
    − Slightly higher latency due to reranking and reasoning LLM step.

Related Patterns: #2 Decontextualization, #5 Reasoning‑Model Synthesis, #8 Domain‑specific Retriever Pairing.


Pattern 2 — Query Decontextualization Gate

Pattern type: Data / Pre‑processing

Context / Background
Legacy datasets contain queries that implicitly rely on hidden context (e.g., “What’s the plot?”). This confuses retrieval and penalizes the wrong stage.

Problem
Ambiguous queries inflate difficulty and degrade retrieval precision; synthesis quality becomes noisy.

Forces

  • Ambiguity vs. realism
  • Rewrite effort vs. evaluation clarity
  • Risk of query drift

Solution (overview)
Rewrite queries into standalone forms before retrieval; add a light classifier that routes “underspecified” queries through a decontextualization step.

Solution — Ten Steps

  1. Define decontextualization rubric (who/what/when/where).
  2. Fine‑tune or prompt an LLM to rewrite underspecified queries.
  3. Detect underspecification with a small classifier (BERT/SFT) or rules.
  4. Preserve original intent via entity and constraint extraction.
  5. Add provenance tags (original vs. rewritten) for auditability.
  6. Re‑run retrieval on rewritten queries; keep both result sets for ablation.
  7. Penalize rewrites that materially change the ask (edit distance + entity diff).
  8. Log retrieval deltas (P@K/R@K/NDCG) and downstream score deltas.
  9. If deltas regress, fall back to human‑approved rewrite templates.
  10. Report decontextualization coverage and gains in eval dashboards.

Implementation (ADK‑style, micro‑agent)

class DecontextualizeAgent(Agent):
    def handle(self, query: str) -> str:
        if is_underspecified(query):
            return rewrite_query(query, model="gemini-2.5-flash")
        return query

Resulting Consequences

  • Higher retrieval precision/recall; cleaner attribution.
    − Small risk of intent drift (mitigated by entity constraints).

Related Patterns: #1 Orchestration, #6 Oracle‑Gap Diagnostic.


Pattern 3 — Retriever Diversity & K‑Target Tuning

Pattern type: Retrieval Engineering

Context / Background
MSRS domains differ in optimal K and chunking. Narratives need more complementary slices; meetings benefit from concise, timestamped segments.

Problem
Fixed K and naïve chunking reduce complementary coverage or flood the context with redundancy.

Forces

  • Chunk size vs. semantic cohesion
  • K vs. latency and context window
  • Doc‑level vs. segment‑level granularity

Solution (overview)
Tune chunk size (~1K tokens), choose K to match average oracle set per domain, and enforce cross‑doc diversity with per‑doc caps before synthesis.

Solution — Ten Steps

  1. Profile oracle set sizes per domain (e.g., K≈8 story, K≈3 meet).
  2. Chunk at 800–1200 tokens; store title/speaker metadata.
  3. Use mean‑pooled embeddings over tokens; persist doc‑level vectors.
  4. Retrieve k_hi for recall (e.g., 24).
  5. Rerank to k_mid (e.g., 10) with cross‑encoder or Vertex Reranker.
  6. Enforce per‑doc cap (e.g., ≤3 from any doc).
  7. Expand to full passages if needed (keep citations tight).
  8. Run ablations over {k_hi, k_mid, per_doc_cap}.
  9. Track retrieval metrics alongside generation (linked runs).
  10. Ship tuned defaults per domain config.

Implementation (selection helper)

def select_diverse(chunks, per_doc_max=3, top_k=8):
    out, seen = [], {}
    for c in chunks:
        d = c.meta["doc_id"]
        if seen.get(d, 0) < per_doc_max:
            out.append(c); seen[d] = seen.get(d, 0) + 1
        if len(out) == top_k: break
    return out

Resulting Consequences

  • Better complementary coverage; lower redundancy.
    − Requires domain profiling and stored metadata discipline.

Related Patterns: #8 Domain‑specific Pairing, #4 Long‑Context Guardrail.


Pattern 4 — Long‑Context Guardrail vs. Selective Retrieval

Pattern type: Architecture

Context / Background
Temptation: “just stuff” the entire corpus into a long‑context model. In meetings, this often exceeds limits or degrades synthesis quality versus a strong retriever.

Problem
Truncation, noise injection, and cost explode in long‑context‑only setups.

Forces

  • Context window vs. corpus size
  • Latency/cost vs. precision
  • Summarization sensitivity to irrelevant passages

Solution (overview)
Adopt a guardrail: default to selective retrieval; allow long‑context only when the corpus fits comfortably and shows non‑inferior performance in canary runs.

Solution — Ten Steps

  1. Compute per‑query corpus token count; preflight against window.
  2. Run canary evals: Long‑Context vs. Strong‑Retriever vs. Oracle.
  3. If LC ≤ SR − delta on G‑Eval, route to SR; else allow LC.
  4. For meetings, always SR unless corpus < 60% of window.
  5. Prefer reasoning LLM for synthesis even under LC.
  6. Trim boilerplate (agenda, greetings) via regex/learned filters.
  7. Collapse repetitive turns with speaker‑aware summarizers pre‑retrieval.
  8. Cache LC summaries; refresh on corpus change.
  9. Monitor window overflows; fail closed to SR.
  10. Periodically re‑test thresholds as models evolve.

Implementation (routing sketch)

if tokens(corpus) < 0.6 * CONTEXT_LIMIT and lc_performs_ok():
    mode = "long_context"
else:
    mode = "selective_retrieval"

Resulting Consequences

  • Prevents silent truncation losses; lowers costs.
    − Requires periodic A/B checks as LLMs change.

Related Patterns: #1 Orchestration, #6 Oracle‑Gap Diagnostic.


Pattern 5 — Reasoning‑Model Synthesis Step

Pattern type: Generation

Context / Background
Even with oracle (gold) docs, non‑reasoning LLMs miss major/minor details in multi‑doc synthesis.

Problem
Under‑integration of evidence; shallow abstractions.

Forces

  • Coherence vs. coverage
  • Abstractive style vs. lexical‑overlap metrics

Solution (overview)
Use a reasoning LLM for the synthesis step with a scaffold that enumerates per‑doc keypoints and forces cross‑document linkage.

Solution — Ten Steps

  1. Convert retrieved chunks → per‑doc bullets (auto‑salience).
  2. Require model to cite doc‑ids inline (e.g., [D3]).
  3. Ask for contrasts and agreements across docs.
  4. Add a checklist of required elements (entities, events, outcomes).
  5. Enforce sectioned output (Summary / Evidence / Gaps).
  6. Use low temperature; enable reasoning mode.
  7. Set generous max tokens (≥ reference 95th percentile).
  8. Run self‑critique pass with evidence‑grounded rubric.
  9. Optionally compress to executive digest.
  10. Capture structured citations for audit.

Implementation (prompt scaffold, sketch)

SYNTH_PROMPT = f"""
You are a synthesis model. Given query Q and passages P[i] with doc_id,
write a coherent long‑form answer that integrates complementary info.
- Enumerate keypoints per doc.
- Link points across docs (agreements/contrasts).
- Cite like [D{{doc_id}}].
- Sections: Summary, Evidence, Open Questions.
Q: {{query}}
P: {{passages}}
"""

Resulting Consequences

  • Higher G‑Eval (coherence/coverage) even if ROUGE‑2/BERTScore move modestly.
    − Slightly longer latency.

Related Patterns: #1, #3, #6.


Pattern 6 — Oracle‑Gap Diagnostic Harness

Pattern type: Evaluation / Diagnostics

Context / Background
We need to attribute failures to retrieval vs. generation.

Problem
Pipeline confounding hides which component to improve.

Forces

  • Retriever diversity vs. generator capability
  • Metric sensitivity (IR vs. summarization)

Solution (overview)
Run three controlled conditions per query: Oracle (gold docs), Strong Retriever, Long‑Context. Compare ROUGE‑1/2, BERTScore, and G‑Eval‑style rubric; add small‑scale human error tags.

Solution — Ten Steps

  1. Build eval dataset with query, gold summary, oracle doc‑ids.
  2. Implement runners for the three modes.
  3. Log IR metrics (P@K/R@K/NDCG/MAP) for SR.
  4. Compute ROUGE‑1/2/L and BERTScore F1.
  5. Run LLM‑as‑Judge with a fixed rubric (coverage, coherence, grounding).
  6. Sample 40 outputs for human error taxonomy (missing major/minor, hallucination, vagueness).
  7. Attribute gap: if Oracle ≫ SR → fix retrieval; if Oracle low → fix synthesis.
  8. Report per‑domain breakdown (story vs. meeting).
  9. Store run manifests for repro.
  10. Alert when regressions exceed thresholds.

Implementation (runner sketch)

for mode in ["oracle", "strong_retriever", "long_context"]:
    preds = run_mode(mode, dataset)
    scores[mode] = compute_scores(preds, refs)
report = compare(scores)

Resulting Consequences

  • Clear, actionable attribution of errors.
    − Requires curation of oracle doc sets.

Related Patterns: #4 Guardrail, #9 Contamination Safeguard.


Pattern 7 — Multi‑Document Necessity Enforcement (Dataset Construction)

Pattern type: Data / Benchmarking

Context / Background
Benchmarks are often solvable from a single doc, defeating the point of multi‑source RAG.

Problem
Systems “cheat” by relying on one document; evaluation under‑stresses retrieval diversity and synthesis.

Forces

  • Realism vs. construction cost
  • Complementary vs. redundant evidence

Solution (overview)
During dataset construction, ensure that answering the query requires at least two complementary documents; validate necessity by ablating each doc and observing score drops.

Solution — Ten Steps

  1. Start from long‑context, query‑focused MDS sources.
  2. Segment into documents; map queries to needed docs.
  3. Mark complementary roles (plot, setting; agenda, decision).
  4. Create oracle doc sets ≥2 docs per item.
  5. Verify necessity via leave‑one‑out ablations.
  6. Remove items solvable from a single doc.
  7. Human‑validate a sample per batch.
  8. Publish doc‑id mappings for reproducibility.
  9. Provide retrieval baselines.
  10. Version datasets with changelogs.

Implementation
A small script computes Δscore when removing each doc; keep items with Δscore significant.

Resulting Consequences

  • Realistic pressure on retrieval and synthesis.
    − Higher curation effort.

Related Patterns: #1, #6.


Pattern 8 — Domain‑Specific Retriever + Reranker Pairing

Pattern type: Retrieval Engineering

Context / Background
Narrative prose and meetings differ in structure, vocabulary, and evidence distribution.

Problem
One‑size retriever fails cross‑domain; BM25 can look “good” on IR metrics but underperform on generation for meetings.

Forces

  • Dense vs. sparse retrieval
  • Dialogue structure (speaker, timestamp) vs. prose
  • Reranker choice and depth

Solution (overview)
Choose per‑domain stacks (e.g., gemini‑embedding for STORY; NV‑Embed‑v2 + reranker for MEET). Use reranking depth tuned to domain.

Solution — Ten Steps

  1. Benchmark multiple dense models per domain.
  2. Keep BM25 as fallback for rare keywords.
  3. Train/choose a reranker aligned to domain style.
  4. Tune k_hi/k_mid per domain.
  5. Enforce doc diversity caps.
  6. Use speaker/timestamp fields in meeting retrieval.
  7. Penalize boilerplate turns in meetings during rerank.
  8. Maintain separate indices per domain.
  9. Log cross‑domain comparisons.
  10. Re‑validate quarterly as models update.

Implementation (config separation)

story:
  embed_model: gemini-embedding
  k_hi: 24
  k_mid: 10
  reranker: vertexai_rerank
meet:
  embed_model: nv-embed-v2
  k_hi: 18
  k_mid: 6
  reranker: vertexai_rerank

Resulting Consequences

  • Higher synthesis quality with stable retrieval.
    − Slightly increased ops overhead (two stacks).

Related Patterns: #3 Tuning, #4 Guardrail.


Pattern 9 — Data‑Contamination & Leakage Safeguard

Pattern type: Evaluation Hygiene

Context / Background
Pretraining leakage can inflate scores and mask weaknesses.

Problem
Contaminated corpora compromise credibility and comparability.

Forces

  • Detectability vs. cost
  • Model updates over time
  • Reproducibility

Solution (overview)
Run contamination checks, report multiple retrieval settings (Oracle, SR, LC), and maintain contamination manifests.

Solution — Ten Steps

  1. Hash and timestamp corpora; store digests.
  2. Use web‑scale overlap detectors for known model pretrains when available.
  3. Re‑run eval on release model updates.
  4. Publish both SR and Oracle results.
  5. Record prompts and seeds for LLM‑as‑judge.
  6. Keep out‑of‑domain canaries.
  7. Maintain frozen “leaderboard” snapshots.
  8. Track drift across time.
  9. Share contamination notes with peers.
  10. Automate checks in CI.

Implementation
Simple overlap heuristics (URL/domain matches, n‑gram hashes) + manual spot checks.

Resulting Consequences

  • Trustworthy comparisons; easier peer review.
    − Extra upfront work.

Related Patterns: #6 Diagnostic, #10 Scalable Construction.


Pattern 10 — Scalable Benchmark Construction Loop

Pattern type: Process / Dataset Ops

Context / Background
Constructing multi‑source tasks by hand is costly. MSRS shows a scalable way to bootstrap from existing long‑context MDS datasets.

Problem
Quality vs. throughput trade‑off; human validation budget.

Forces

  • Automation vs. manual quality
  • Coverage vs. necessity constraints

Solution (overview)
Bootstrap from long‑context, query‑focused datasets; enforce multi‑doc necessity; apply decontextualization; validate at scale with spot checks.

Solution — Ten Steps

  1. Pick base datasets (e.g., stories, meetings).
  2. Segment into documents; map queries to segments.
  3. Generate/clean decontextualized queries.
  4. Build oracle sets (≥2 docs).
  5. Index with dense embeddings.
  6. Calibrate K targets per domain.
  7. Create SR/LC/Oracle runners.
  8. Score with ROUGE/BERTScore/G‑Eval rubric.
  9. Human‑validate samples and error taxonomy.
  10. Release code/data with versioning.

Implementation
Provide scripts for segmentation, indexing, oracle mapping, and eval.

Resulting Consequences

  • Sustainable benchmark evolution; transferable to new domains.
    − Ongoing maintenance of mappings and oracles.

ADK‑Style Evaluation Harness (reproduces SR vs. LC vs. Oracle)

Inputs: queries.jsonl (id, query), oracle.jsonl (id, doc_ids[]), refs.jsonl (id, summary), index/ (FAISS/Vertex, plus metadata), corpus/ (text).

Modes

  • oracle: load oracle docs; pass to reasoning synthesizer.
  • strong_retriever: dense → rerank → diverse select.
  • long_context: pack entire corpus (guardrailed) or query‑scoped subcorpus.

Metrics

  • IR: P@K, R@K, NDCG, MAP (for SR mode)
  • Generation: ROUGE‑1/2/L, BERTScore F1
  • LLM‑as‑Judge rubric (coverage, coherence, grounding) → 0–100

Runner (skeleton)

from rag_eval import rouge, bertscore, geval

def run(mode, batch):
    if mode == "oracle":
        ctx = pull_oracle(batch)
    elif mode == "strong_retriever":
        ctx = retrieve_and_rerank(batch)
    elif mode == "long_context":
        ctx = pack_long_context(batch)
    return synthesize(ctx, model="gemini-2.5-pro")

scores = {}
for mode in ["oracle","strong_retriever","long_context"]:
    preds = run(mode, dataset)
    scores[mode] = {
        **rouge(preds, refs),
        **bertscore(preds, refs),
        "geval": geval(preds, refs, judge="gemini-2.5-pro")
    }
compare_and_report(scores)

Human Error Taxonomy (lightweight)
Sample 40 items and annotate: (a) missing many major details, (b) missing many minor details, (c) hallucination, (d) query misunderstanding, (e) vagueness.


ADK Components & Swappables

  • Embedding: gemini-embedding (story), nv-embed-v2 (meeting).
  • Reranking: Vertex AI Ranking API, LangChain Vertex Reranker wrapper.
  • Reasoning LLM: gemini-2.5-pro / gemini-2.5-flash for faster ablations.
  • Memory: InMemoryMemoryService for local runs; Memory Bank for persistence.
  • Orchestration: Root agent routes through #2 gate → #3 tuned retrieval → #5 synthesis.