Critiqued Fan-In

A multi-agent synthesis pattern with external rubric calibration.

The pattern shape that emerged from the newsagent system, abstracted from
its news-monitoring instantiation. Reusable wherever multi-domain LLM
synthesis needs sustained quality discipline.

Intent

Produce a single calibrated artefact from multiple domain-specialist agents
in a way that (a) preserves depth in each domain, (b) forces deep synthesis
across domains, (c) detects when the synthesis is shallow or self-flattering,
and (d) keeps the discipline auditable rather than tacit.

Also Known As

Specialist – Synthesizer – Critic
Externally-Audited Iterative Synthesis
Calibrated Fan-In

Context

You are building an LLM-based system that:

Draws on multiple disjoint domains (legal/political/economic; security/trade; clinical/financial; legal/technical; etc.).
Produces synthesis artefacts, not just retrieval results — the consumer wants a fused take, not a list of facts.
Runs repeatedly over time, so artefact-to-artefact quality variance matters and silent regressions are expensive.
Has tool access available (search, retrieval, code execution) for fact-gathering, but you want fact-gathering and fact-fusion to be separate concerns.
Will be read or acted on by humans who need to know which claims are load-bearing and which are speculative.

This pattern is most useful when single-pass single-agent synthesis produces results that look good (LLMs are fluent) but audit poorly — restating sources, hedging on the hard claims, mistaking paraphrase for synthesis.

Problem

LLM-based synthesis has four endemic failure modes that compound when stacked naively:

Single-pass shallowness. Asking one agent to “produce a deep analysis” yields a fluent surface read. The agent has no mechanism to deepen across iterations or notice what it has missed.
Generalist dilution. A single agent asked to cover multiple domains either runs out of context for each domain or drops to lowest-common-denominator framing.
Self-scoring inflation. When an LLM rates its own output (“on a scale of 1–10, how insightful is your analysis?”), it is systematically optimistic by ~2–3 points. Self-scored hill-climbing therefore plateaus invisibly — the model relabels and reframes without producing new claims, then certifies the rebranding as progress.
Cross-run contradictions. Run-to-run, the same agent will produce mutually inconsistent facts (one run: “leader X assassinated”; next run: “leader X promoted”) without any mechanism flagging the contradiction. Downstream consumers cannot tell which is load-bearing.

The pattern resolves these by separating roles, formalising rubrics, and forcing external critique.

Forces

Force	Tension
Depth ↔ breadth	A single agent covering many domains loses depth; many specialised agents lose synthesis.
Independence ↔ coordination	Specialists must run independently to preserve domain fidelity; their outputs must coordinate at fusion.
Iteration ↔ cost	More iterations improve depth but multiply latency and token cost.
Self-critique ↔ external critique	Self-critique is cheap and scales but is biased; external critique adds an agent but breaks the optimism.
Generality ↔ specificity	The pattern must be reusable across domains; the rubrics must be specific enough to score against.
Rigour ↔ velocity	A heavyweight discipline produces better artefacts but slows the pipeline; teams will skip steps under pressure.
Prompt agility ↔ code stability	Prompts evolve weekly; code shouldn’t churn at the same rate.

Solution

Three generic roles, composed as a graph: a fan-out of independent Specialists into a Synthesizer (Orchestrator) into an Evaluator (Critic).

Each role has a single responsibility, communicates through named state keys rather than function arguments, and is instructed by an externalised prompt that can be edited without touching the wiring code.

The synthesis uses explicit scoring rubrics with two opposing objectives — one to maximise, one to minimise — applied at every iteration. Hill-climb thresholds (Δ on each objective) gate iteration-to-iteration progress. The Evaluator independently re-scores using the same rubrics with anti-inflation calibration anchors.

A fact-stability audit runs before publication: claims that contradict across upstream specialists are flagged with confidence tiers, and LOW-confidence claims are firewalled from load-bearing inferences.

Detailed Solution

Role 1 — Specialist

Purpose: produce structured facts and self-audited gaps for a single domain.

Responsibilities
– Owns a single domain (e.g., trade policy, geopolitics, clinical evidence, legal precedent).
– Has access to relevant tools (search, retrieval, structured-data APIs).
– Produces a structured brief at multiple layers (e.g., MACRO / MESO / MICRO).
– Self-audits: explicitly enumerates coverage gaps (parties or instruments expected but missing) and latent undercurrents (non-obvious second-order links).
– Attaches confidence tiers (HIGH / MEDIUM / LOW) to load-bearing factual claims, especially those subject to cross-run contradiction (leadership status, casualty counts, operational state).
– Writes its output to a dedicated state key, never overwriting another specialist’s namespace.

Anti-patterns
– A specialist that quietly produces a “generalist” view when its domain is sparse — silence about a missing fact is more useful than synthetic filler.
– A specialist that omits the coverage-gap audit because it found nothing — the audit is itself a signal.

Role 2 — Synthesizer (Orchestrator)

Purpose: fuse specialist outputs into a single artefact, iteratively, under explicit rubric pressure.

Responsibilities
– Has no tools. Re-research would invite duplicate work and contradict the graph’s separation of concerns.
– Reads upstream state keys via template substitution.
– Runs N iterations (typically 3), each exploring multiple analytical layers and ending with a what I missed reflection.
– Self-scores each iteration on two opposing rubrics:
– J⁺ (maximize): insight density / depth — composed of sub-dimensions like causal-mechanism naming, cross-domain coupling, forward predictiveness, novelty against headline framing, source-anchoring.
– J⁻ (minimize): surface-restatement penalty — composed of sub-dimensions like paraphrase ratio, fact-recombination-only, headline framing reuse, hedging without commitment, citation theatre.
– Enforces a hill-climb rule: each iteration must improve J⁺ by at least some threshold and reduce J⁻ by at least some threshold versus the prior iteration. If not, rewrite with a different aperture before publishing.
– Runs a fact-stability audit before publication: cross-checks entities mentioned in multiple upstream briefs, flags contradictions, attaches confidence tiers, applies a resolution rule (LOW-confidence facts cannot be load-bearing for analytical claims).

Anti-patterns
– Synthesizer with tools: it will start re-researching instead of synthesizing, and the iterations will diverge from the upstream briefs.
– Synthesizer without explicit rubrics: “deeper” becomes hand-wavy and iterations plateau without anyone noticing.

Role 3 — Evaluator (Critic)

Purpose: independently grade the synthesizer’s self-scores with explicit anti-inflation discipline.

Responsibilities
– Has no tools. Pure evaluation, no re-research.
– Reads both the upstream specialist briefs and the synthesizer’s output, so it can verify source-anchoring (claims must trace to specialist facts).
– Applies the same rubrics as the synthesizer, but with anti-inflation calibration anchors: explicit guidance about what a 10/10 actually means (paradigm-shifting, not just well-written), and a stated prior that the synthesizer’s self-scores are over-graded by 2–3 points.
– Produces a structured grading output: per-iteration calibrated scores, per-dimension justifications where it diverges from the synthesizer’s self-score, inflation diagnosis, concrete quoted passages that exhibit restatement or hedging or unanchored speculation, and a recalibrated hill-climb verdict.
– Provides a one-line recommendation for the next run.

Anti-patterns
– Evaluator using the same model as the synthesizer with no calibration anchors: it will rubber-stamp.
– Evaluator without access to the upstream briefs: cannot verify source-anchoring, becomes a stylistic critic only.
– Evaluator scoring without quoted evidence: produces opinion, not audit.

Communication and state

All inter-role communication happens through named, disjoint state keys. No role passes function arguments to another; every dependency is a state-key read.

Producer	State key
Specialist A	`state[brief_a]`
Specialist B	`state[brief_b]`
Specialist N	`state[brief_n]`
Synthesizer	`state[final_artefact]`
Evaluator	`state[evaluation]`

This makes the graph fully composable: any specialist can be added or removed by editing the fan-out stage and the synthesizer’s template, without touching the agent code paths.

Iteration discipline and hill-climbing

Each synthesizer iteration is a structured pass with these components:

Layered analysis — macro / meso / micro (or any domain-appropriate set of zoom levels).
What I missed — explicit enumeration of 3–5 gaps in this iteration’s coverage.
Score — J⁺ and J⁻ with per-dimension breakdown.
Hill-climb check — vs. prior iteration; rewrite if thresholds unmet.

The structural insight: iteration N must address the gaps surfaced by iteration N−1’s what-I-missed section. This forces aperture changes between iterations rather than mere depth at the same aperture.

Externalised prompts

Each role’s prompt lives in a separate Markdown file, loaded at module import time. This keeps the prompt as artefact rather than as code-string:

project/
  agents/
    specialist_a.py        # 25 lines: wiring only
    specialist_b.py
    synthesizer.py
    evaluator.py
  prompts/
    specialist_a.md        # full prompt
    specialist_b.md
    orchestrator.md
    evaluator.md

The agent module loads the prompt via the language’s resource-loading idiom (e.g., importlib.resources in Python). A test asserts that the agent’s runtime instruction equals the on-disk file content, preventing silent re-inlining.

Benefits:
– Prompts are editable without touching code; git blame stays clean on the wiring.
– Prompts can be diffed and reviewed as documents.
– Rubric tuning becomes a Markdown PR, not a Python PR.
– Prompts can be deployed independently of code in environments that support hot-reload.

Diagram (concrete instantiation)

Consequences

Benefits

Synthesis quality is auditable. The hill-climb trajectory is a number sequence, not a vibe.
Self-scoring inflation is caught. The Evaluator routinely catches iterations where the Synthesizer relabelled iter-2 content as “iter-3 novelty” and certified the rebranding as progress.
Domain depth is preserved. Each Specialist stays in scope; cross-cutting synthesis is the Synthesizer’s job, not its job.
Cross-run consistency is enforced. Fact-stability audits surface contradictions and apply confidence tiers, so consumers know which claims to trust.
Roles are modular. Specialists can be added or swapped without rewriting the Synthesizer; Evaluator can be swapped to a different model to reduce same-model bias.
Prompt churn is decoupled from code churn. Externalised prompts mean a rubric-tuning PR doesn’t touch wiring code.
The artefact is falsifiable. The Synthesizer’s iteration 3 includes explicit falsification triggers — observations that would invalidate the analysis. This is the discipline equivalent of a forecast confidence interval.

Liabilities

Three stages = 3× minimum latency. A single specialist-synthesizer run is ~30s; adding the Evaluator can push wall-clock past 60s. Unsuitable for real-time interaction.
Three stages = 3× cost. Each pass burns tokens at LLM rates; the Evaluator pass adds ~30% over a synthesizer-only design.
Rubric design is hard. Bad rubrics produce false confidence. The dimensions must be operationalisable (each scoreable from the artefact alone) or the scoring degrades to vibes.
Plateau detection requires the Evaluator. Without it, the Synthesizer’s claimed hill-climb is rarely the actual hill-climb. Cutting the Evaluator to save cost re-introduces the original problem.
Same-model bias. If the Synthesizer and Evaluator share the same underlying model and weights, the Evaluator may share the Synthesizer’s blind spots. Mitigated by anti-inflation anchors but not eliminated. Best practice: pin the Evaluator to a different model when feasible.
Cold-start cost on each rubric change. Calibrating the Evaluator’s anti-inflation anchors against a new rubric requires some manual tuning.
The Specialists’ tool calls dominate the latency budget. If retrieval is slow, the parallel fan-out helps but does not eliminate the floor.

When NOT to apply

Single-domain content, where there is no synthesis to fuse — just use one agent.
High-frequency / low-latency systems where 60s wall-clock is unacceptable.
Cost-sensitive deployments where 3× token cost is unjustified.
One-off generation where iteration discipline is overkill — a single-pass agent suffices.
Domains without scoring traction. If you can’t define a meaningful J⁺ / J⁻ rubric for the artefact, the hill-climb machinery cannot grip the synthesis and degrades to ceremony.

Implementation Notes

Choosing N iterations

Three is a practical default. The empirical pattern observed in the newsagent instantiation:

Iter 1 → Iter 2: largest improvement; the synthesizer goes from surface description to coupled mechanisms.
Iter 2 → Iter 3: smaller improvement, more inflation. The synthesizer is prone to relabel iter-2 content under a new header.

Two iterations under-deliver depth; four are usually wasted. The hill-climb rule lets you abort earlier if Δ thresholds are met sooner; this is the right escape valve.

Designing the rubrics

Good J⁺ / J⁻ dimensions are:

Operationalisable — each one scoreable from the artefact text alone, without re-querying tools.
Orthogonal — separate dimensions don’t trade off mechanically. If “depth” and “novelty” always move together, they are one dimension, not two.
Anchored — each score level has a quoted example, not just an adjective.

A useful test: hand the rubric to a human who hasn’t read the artefact and ask them to score it. If two readers diverge by more than 1 point on most dimensions, the rubric is under-specified.

Anti-inflation calibration anchors

The Evaluator’s prompt should include explicit statements like:

“A perfect 10 is reserved for paradigm-shifting analysis. Most competent output lands 6–8.“
“Prior on the Synthesizer’s self-score: when it claims 10, the prior is over-grading by 2–3 points.“

Without these anchors, the Evaluator will be sympathetic to the Synthesizer’s framing and grade indistinguishably from self-scoring.

Confidence tiers

Three tiers (HIGH / MEDIUM / LOW) is enough granularity. Five or more tiers invite false precision; two collapse to “true / not-sure” and lose the firewall mechanism.

The resolution rule matters as much as the tiers: a LOW-confidence claim cannot be load-bearing for downstream inferences. The Synthesizer must downgrade the dependent claim or label it conditional. Without this rule, LOW-confidence claims propagate as if they were HIGH.

Test discipline

Three test categories to maintain:

Wiring tests — agent name, tool list, state-key output. Catches structural drift.
Prompt-content tests — assert that specific rubric tokens (e.g., “causal depth”, “hill-climb”) appear in each prompt. Catches silent prompt regressions where someone trims the rubric.
Externalisation lock-in — assert that agent.instruction == read_file(prompt.md). Catches re-inlining.

The lock-in test is the one most often skipped and most useful.

Known Uses

newsagent — geopolitical + tariff news monitoring (this codebase). Two specialists (geopolitical, tariff) with google_search, one synthesizer running 3-iteration FCoT with J⁺ / J⁻ rubrics and fact-stability audit, one evaluator with anti-inflation calibration.
Clinical case review: specialists for radiology / pathology / labs / patient history, synthesizer for diagnostic synthesis, evaluator for clinical-rigour calibration.
Investment memos: specialists for fundamentals / competitive landscape / macro / regulatory, synthesizer for thesis, evaluator for risk-attribution calibration.
Legal-brief synthesis: specialists for case law / statutory / factual record, synthesizer for argument, evaluator for citation-rigour and precedent-applicability calibration.
Threat intelligence: specialists for IoCs / TTPs / adversary profiles / asset exposure, synthesizer for risk narrative, evaluator for actionability calibration.

Variants

Feedback-loop variant. After the Evaluator scores, route its one-line recommendation back to the Synthesizer for one more iteration. Useful when the artefact has high stakes; doubles latency.
Specialist-of-specialists variant. A Specialist may itself be a sub-pipeline (e.g., a “legal” specialist that contains case-law and statutory sub-specialists). Recursive Critiqued Fan-In.
Per-iteration evaluator variant. Evaluator runs after each Synthesizer iteration rather than only at the end. Higher cost; tighter feedback. Useful when iterations diverge widely in quality.
Multi-model evaluator variant. Two Evaluators on different model families; their scores are averaged or their disagreements flagged. Strongest mitigation of same-model bias.
Tool-using evaluator variant. Evaluator has limited tools to verify a Synthesizer claim against an external source. Risks scope creep; useful for high-stakes factual claims.

Anti-patterns

Rubber-stamp Critic. Evaluator with no anti-inflation anchors, on the same model as the Synthesizer. Produces scores that mirror the Synthesizer’s self-scores and adds latency without adding signal.
Rubric-less Synthesis. Synthesizer told to “iterate and improve” without a scoring discipline. Iterations plateau into recombination at iter 2.
No fact-stability audit. Cross-run contradictions ship unflagged. Consumers cannot tell which is the load-bearing claim.
Generalist Specialist. A “Specialist” whose scope is broad enough to overlap with another’s. Produces duplicate coverage that the Synthesizer must deduplicate, wasting iterations.
Prompt as code-string. Prompts inlined as triple-quoted strings in Python. Editing requires code PRs; reviewers can’t focus on prompt content; rubric tweaks pollute git blame.
Hill-climb without thresholds. “Each iteration should be deeper” without a numeric Δ floor — produces movement without measurable progress.

Map-Reduce — Critiqued Fan-In is a refinement where the reduce step is iterative, scored, and externally audited.
Pipeline — Critiqued Fan-In is a pipeline whose middle stage is internally iterative.
LLM-as-Judge — Critiqued Fan-In uses this technique specifically for scoring rubric application, paired with anti-inflation calibration to defuse its known bias.
Constitutional AI / RLAIF — Critiqued Fan-In is a runtime, per-artefact version of the same impulse: separate the producer from the critic. Constitutional AI bakes the critic into training; Critiqued Fan-In runs it at inference.
Self-Refine — Single-agent self-critique. Critiqued Fan-In separates self-critique (Synthesizer’s iteration what-I-missed) from external critique (Evaluator) because they catch different classes of failure.
Multi-Agent Debate — Adversarial multi-agent reasoning. Critiqued Fan-In differs in that there is no debate: the Specialists are non-overlapping, the Synthesizer is sole-author, and the Evaluator is non-participating critic.

Summary

Critiqued Fan-In is a three-role multi-agent pattern: Specialists fan out by domain, a Synthesizer fuses through iterative scoring, and an Evaluator independently grades with anti-inflation discipline. Its central insight is that LLM synthesis quality is recoverable through explicit rubrics and external critique — but only if those are operationalised structurally, not aspirationally. The pattern’s cost is latency and tokens; its return is auditable depth, cross-run consistency, and a falsifiable artefact that survives review.

Appendix A — Worked Example: Investment Memo Synthesis

The simplest way to test whether a pattern is genuinely abstract is to apply
it to a domain it wasn’t designed for. Below: a full instantiation in equity
research, with concrete role decomposition, rubric adaptation, and prompt
sketches. The shape is identical to newsagent; almost everything else
shifts.

Context

Team: equity-research desk at an investment firm.
Artefact: weekly investment memo per candidate position — thesis + risks + position-sizing recommendation.
Audience: investment committee, votes on positions during weekly review.
Cadence: 5–15 memos per week across the team. Quality matters more than latency; a memo that takes 90 seconds to render is acceptable.
Stakes: each memo, if accepted, drives a position whose size–weighted P&L compounds in the fund. Confirmation bias and tail-risk under-weighting are the dominant failure modes.

Role decomposition

Four specialists instead of two, each with primary-source access:

Role	Domain scope	Primary tools
`fundamentals_agent`	Financials, unit economics, capital structure, management track record	EDGAR retrieval, earnings transcripts, internal model snapshot
`competitive_agent`	Market position, moat structure, competitor trajectory, customer concentration	Market-research APIs, web search, channel-check notes
`macro_agent`	Interest-rate path, sector cyclicals, currency exposure, commodity sensitivity	Bloomberg / FactSet feeds, economic calendars
`regulatory_agent`	Litigation docket, regulatory pipeline, antitrust risk, policy exposure	Court records, regulatory filings, policy-shop research

Each writes to a dedicated state key (state[fundamentals_brief], etc.) with the same internal structure as newsagent specialists: MACRO/MESO/MICRO layering, COVERAGE GAPS, LATENT UNDERCURRENTS, confidence-tier-tagged claims.

Role	Scope	State key
`thesis_synthesizer`	Fuses all four briefs into a thesis + position-sizing recommendation	`state[memo]`
`risk_calibrator`	External evaluator — grades the synthesizer with finance-specific anti-inflation calibration	`state[evaluation]`

Rubric adaptation

This is the largest deliberate change from newsagent. The objectives swap entirely:

J⁺ — Thesis Sharpness (max)

Dimension	0	1	2
Causal Mechanism	Narrative (“they should do well”)	Plausible link	Falsifiable mechanism connecting fundamentals → revaluation
Cross-Domain Coupling	Single specialist’s view	Linked but not mechanised	Explicit chain (e.g., regulatory ruling → competitive realignment → margin expansion)
Named Catalyst	None	Vague timeframe	Specific event with quarter/date and pre-committed sign
Non-Consensus	Consensus thesis	Slight contrarian lean	Differentiates from sell-side median with stated reason
Source-Anchored	Speculative	Loosely supported	Every step traceable to a specialist brief

J⁻ — Confirmation-Bias / Pitch-Deck Penalty (min)

Dimension	0	1	2
Cherry-Picking	Both confirming and disconfirming evidence cited	Disconfirming evidence acknowledged but discounted	Bull-case only
Hedge-Then-Recommend	Committed view, owns its risk	Some hedging	“But we still like it” after listing reasons not to
Narrative-Over-Numbers	Numbers carry the claim	Numbers support a narrative	Story carries the claim, numbers decorative
Anchoring to Current Price	Price-independent thesis	Some price reference	Thesis collapses if the stock is +/− 20% from here
Survivorship in Analogs	Both winning and losing analog cases cited	Mostly winning analogs	Only winning analogs (the “next Amazon” trap)

The newsagent J⁻ minimises restatement; the investment-memo J⁻ minimises self-serving framing. The shape is identical: five 0–2 dimensions, sum 0–10, lower better.

Evaluator calibration anchors

The risk_calibrator prompt swaps the news-monitoring anti-inflation anchors for finance-specific ones:

“A J⁺ of 10/10 is reserved for memos that name a non-consensus catalyst, a falsifiable mechanism, and at least one cross-domain coupling. Most competent memos land 5–7.”

“Equity-research memos are systematically optimistic on management quality and systematically pessimistic about tail-risk recognition (analysts under-imagine adverse scenarios). Apply a +1 J⁻ discount on memos that conclude BUY without containing at least one specific disconfirming observation that would be considered material.”

“If the memo’s reasoning depends on the current stock price being a reference point, J⁻ Anchoring is at least 1, regardless of whether price is named.”

The two-layer optimism counter — LLM optimism plus analyst optimism — is unique to this domain. The newsagent evaluator only counters one.

Confidence-tier mapping

Source-type provenance, not source-count, is the right tier driver in finance:

Tier	Sources
HIGH	Filed financials (10-K/10-Q), final court rulings, signed deal terms, official regulatory orders
MEDIUM	Sell-side consensus, management guidance, settled lawsuits without appeal, finalised regulatory comment periods
LOW	Management commentary on earnings calls, industry rumours, channel-check anecdotes, expert-network calls, unconfirmed press reports

Resolution rule: A LOW-confidence claim cannot be load-bearing for a BUY/SELL conclusion. If the conclusion depends on a LOW-confidence input, either downgrade the conclusion to “WATCH” or label the recommendation explicitly conditional.

Falsification triggers (per memo)

The synthesizer must produce 2–3 named falsifiers, e.g.:

“If next quarter’s gross margin contracts more than 80 bps, the operating-leverage thesis is wrong.”
“If competitor X enters segment Y before our 18-month mark, the moat thesis is wrong.”
“If regulation Z passes in its current form, the TAM revises down 30%+ and the position should be exited.”

These get tracked in a positions register post-publication. Falsifier-hit triggers a memo re-open, not a quiet quiet-mute.

What changes vs `newsagent`

Aspect	`newsagent`	Investment memos
Specialist count	2 (geopolitical, tariff)	4 (fundamentals, competitive, macro, regulatory)
Specialist tools	`google_search`	Mixed: EDGAR, market-data APIs, legal/regulatory search, channel-check notes
Cadence	Daily / ad-hoc	Weekly per position; 5–15 per week per team
Failure mode J⁻ targets	Surface restatement of headlines	Confirmation bias / pitch-deck framing
Evaluator anti-inflation priors	One: LLM optimism	Two: LLM optimism plus analyst-thesis optimism
Confidence-tier driver	Source breadth	Source primacy (filed > guided > anecdotal)
Resolution-rule consequence	Downgrade analytical claim	Downgrade recommendation (BUY → WATCH)
Output consumer	Reader	Investment committee voting on capital allocation

The role count doubles for specialists; the role kinds (Specialist / Synthesizer / Evaluator) stay at three. The pattern is unchanged; the instantiation expands.

Implementation sketch

investment_agent/
├── pyproject.toml
├── investment_agent/
│   ├── agents/
│   │   ├── fundamentals_agent.py
│   │   ├── competitive_agent.py
│   │   ├── macro_agent.py
│   │   ├── regulatory_agent.py
│   │   ├── thesis_synthesizer.py
│   │   └── risk_calibrator.py
│   ├── prompts/
│   │   ├── fundamentals.md
│   │   ├── competitive.md
│   │   ├── macro.md
│   │   ├── regulatory.md
│   │   ├── thesis.md           # synthesizer
│   │   └── risk_calibrator.md  # evaluator
│   ├── tools/
│   │   ├── edgar.py
│   │   ├── market_data.py
│   │   └── legal_search.py
│   ├── workflow.py
│   └── orchestrator.py
└── tests/
    └── ...                     # same three test categories: wiring, prompt content, externalisation lock-in

A 6-line prompt sketch (synthesizer head)

What changes structurally is small; the J⁻ vocabulary is what carries the domain shift:

You are the lead PM authoring an investment memo for the weekly committee.
Four analysts have filed:

FUNDAMENTALS BRIEF: {fundamentals_brief}
COMPETITIVE BRIEF: {competitive_brief}
MACRO BRIEF: {macro_brief}
REGULATORY BRIEF: {regulatory_brief}

Synthesize using fractal chain-of-thought across three iterations. Each
iteration must address gaps from the prior iteration's WHAT I MISSED.

J⁺ — MAXIMIZE thesis sharpness (causal mechanism, cross-domain coupling,
named catalyst, non-consensus, source-anchoring).

J⁻ — MINIMIZE pitch-deck framing (cherry-picking, hedge-then-recommend,
narrative-over-numbers, price-anchoring, survivorship in analogs).

Hill-climb rule: ΔJ⁺ ≥ +1.5 and ΔJ⁻ ≤ −1 vs. prior iteration, else REWRITE.

Before publication run the fact-stability audit (HIGH/MEDIUM/LOW tiers on
every load-bearing claim). LOW-tier facts cannot anchor a BUY conclusion.

Produce: HEADLINE THESIS, RISK BUDGET, CATALYSTS, FALSIFIERS, SIZING.

The rest of the prompt mirrors orchestrator.md structurally — three iterations, what-I-missed reflections, scores per iteration, hill-climb thresholds, fact-stability audit, publication order. The dimensions of J⁺ and J⁻ change; the machinery does not.

Pattern-fit observations

Specialist scope discipline matters more here. With four specialists, scope overlap is common (a regulatory ruling has macro consequences; a competitive shift affects fundamentals). The pattern’s response is the same: hard scope boundaries, with cross-cutting analysis owned solely by the synthesizer.
The fact-stability audit becomes load-bearing. Management guidance (MEDIUM) often contradicts filed financials (HIGH); sell-side consensus (MEDIUM) often contradicts channel-check anecdotes (LOW). The audit is the mechanism that prevents the synthesizer from quietly resolving these in the bullish direction.
The evaluator’s anti-inflation discipline does double duty. It fights both LLM self-grading optimism and analyst-thesis optimism. This is why the calibration anchors must be domain-specific. A generic anchor (“most memos land 6–8”) would miss the systematic finance-specific bias.

Domain-specific anti-patterns

Specialist-as-pitch-deck. A fundamentals_agent that writes a brochure instead of a brief. The fix is the same as in newsagent: enforce the COVERAGE GAPS section, and grade specialists on whether they flag missing data rather than fill it.
Synthesizer-with-price-target-bias. A thesis_synthesizer whose internal prior is “thesis ends with a price target,” which then back-fits the analysis to the target. Counter: the synthesizer prompt should explicitly forbid starting with a target.
Evaluator-with-house-view-bias. A risk_calibrator that defers to the synthesizer because “the firm has long owned this.” Counter: pin the evaluator to a different model from the synthesizer, and seed the calibrator with the firm’s historical error patterns rather than current positions.
Missing devil’s-advocate. If the firm’s culture rewards conviction, the evaluator role gets undercut socially. Counter: codify the evaluator’s output as a required field on the memo template — no committee vote without a calibrated risk-call.

One-paragraph summary of the transfer

The Critiqued Fan-In pattern moves from news monitoring to investment-memo
production with role-renaming and rubric-substitution. The synthesis machinery
(three iterations, J⁺ maximisation and J⁻ minimisation, hill-climb thresholds,
fact-stability audit, externally-calibrated evaluator) stays identical. What
shifts is what J⁻ minimises (here: confirmation bias instead of restatement),
what calibration anchors the evaluator applies (here: two-layer optimism
counter instead of one), and what the resolution rule downgrades when a LOW-
confidence fact is load-bearing (here: a BUY recommendation instead of an
analytical claim). The pattern’s three-role architecture, state-key
communication, and externalised-prompt governance carry over without
modification. If the same code skeleton ships unchanged and only the
prompts/ directory rewrites, the transfer is successful.

Critiqued Fan-In

Intent

Also Known As

Context

Problem

Forces

Solution

Detailed Solution

Role 1 — Specialist

Role 2 — Synthesizer (Orchestrator)

Role 3 — Evaluator (Critic)

Communication and state

Iteration discipline and hill-climbing

Externalised prompts

Diagram (concrete instantiation)

Consequences

Benefits

Liabilities

When NOT to apply

Implementation Notes

Choosing N iterations

Designing the rubrics

Anti-inflation calibration anchors

Confidence tiers

Test discipline

Known Uses

Variants

Anti-patterns

Related Patterns

Summary

Appendix A — Worked Example: Investment Memo Synthesis

Context

Role decomposition

Rubric adaptation

J⁺ — Thesis Sharpness (max)

J⁻ — Confirmation-Bias / Pitch-Deck Penalty (min)

Evaluator calibration anchors

Confidence-tier mapping

Falsification triggers (per memo)

What changes vs newsagent

Implementation sketch

A 6-line prompt sketch (synthesizer head)

Pattern-fit observations

Domain-specific anti-patterns

One-paragraph summary of the transfer

Share this:

Like this:

What changes vs `newsagent`