Three-Way Empirical Comparison: Investment Thesis Production

Same prompt — Cross-Pacific AI Infra Stock-Investment Thesis (38 stocks / 6 countries / 10 layers) — produced three ways and measured. Numbers below are empirical from actual runs, not estimates.

Three differentiators, compounding:
(1) Multi-model ensemble — best of every model. Each subagent is routed to the model that is empirically strongest for its subtask: one class for structural reasoning and tool orchestration; another optimized for cost-per-token on high-throughput structured extraction at scale; a third with extended deliberation budget for long-context multi-stage synthesis; specialized subagents for parallel research and quality control. No single model is best at everything — we use each where it wins.
(2) Proprietary primary-source corpus. Curated earnings-call transcripts, MD&A filings, and corporate-event databases that general web tools cannot reach behind paywalls. Every claim ties back to a verbatim quote with date attribution.
(3) Domain-expert judgment in the loop. A 14-year quant-economics researcher (Cornell PhD · AFA 2026) sets the framework, audits the synthesis, and red-teams the conclusions — turning agent output into an investable thesis, not a content artifact.

Neither single-model Deep Research nor single-agent baselines can match this compound effect.

① Agentic Sciences Pipeline

Multi-agent orchestration: primary orchestrator → parallel specialized subagents (extraction, synthesis, QC) → cost-routed across model classes. 26,288 words / 64 pages / 150+ dated verbatim quotes. Cross-corpus statistics + direct A-share MD&A reading. Each model used where it is best.

📄 Download thesis · free (67pp EN) 中文版 (54pp ZH)

Educational research only · Not investment advice.

② Google Deep Research

7,475 words / 28 pages / ~12 quotes (6 self-flagged "UNVERIFIED proxy"). 61 web sources, ~13% primary. Strong on macro framing.

📄 Download Google Deep Research raw output (.docx)

③ Native single agent (web tools + code)

8,706 words / 51 URL citations / only 5 UNVERIFIED tags. Real-time web search + web fetch + code execution. Comparable to Deep Research in coverage, lower hallucination rate, faster (~14 minutes vs Deep Research's ~10-15).

📄 Read the native-agent raw output (.md)

Agentic Sciences — Orchestration Stages

Stage	What it does	Why orchestration matters here
Pipeline orchestration	Primary agent designs the corpus-build workflow, manages data acquisition, dispatches downstream subagents	Structural reasoning and tool use are best handled by an agent class optimized for agentic decision-making
Parallel research subagents	Spawned in parallel for gap analysis, fact-checking, comparison audits, focused deep-dives	Independent context per subagent — parallel work without polluting the orchestrator's reasoning trace
Bulk structured extraction	Thousands of source documents converted into structured records (sentiment, quotes, forward statements, risk mentions)	Cost-optimized routing — extraction is high-throughput / low-complexity, best handled by a fast cheap model class
Multilingual extraction	Chinese-language filings → structured English digests with explicit field extraction (chip partners, capex, AI revenue %)	Bilingual capability + structured-output reliability are domain-specific strengths to route to
Multi-stage deep synthesis	Final thesis sections generated independently with full deliberation budget on long-context inputs	Long-context + extended deep-thinking is a specialized capability deserving its own subagent class
Cross-section quality control	Reconciles long/short positions across independently-generated sections; flags contradictions; applies methodology caveats	Red-team / consistency check is a separate task type, best done by a different agent than the one that wrote the content
Bilingual rendering	Markdown → styled HTML → PDF with cover, ToC, multilingual font fallback	Print-quality typesetting is a deterministic stage, handled by classical tools rather than AI

Why this architecture matters. No single model is best at all of: cheap-fast bulk digesting, deep multi-stage reasoning, agentic tool use, structural design, code generation, multilingual extraction, parallel research dispatch, quality control. The orchestration layer routes each subtask to the most appropriate specialized subagent for that stage. The compound effect — many specialized agents coordinating in parallel — is what neither single-call Deep Research nor single-agent baselines can match.

Empirical Output Measurements

Metric	Agentic Sciences	Google Deep Research	Native single agent
Total words	26,288	7,475	8,706
Top long candidates	8	8	10
Pair trades	5 + anti-pair	5 + anti-pair	6 + anti-pair
Watchlist events	15	15	25
Source citations	corpus-scope (1,590 calls + 26 MD&A + 1,445 disclosure eventss)	61 web URLs	51 unique web URLs
Dated verbatim quotes	~150+	~12 (~6 "UNVERIFIED proxy")	~30-40
Self-flagged UNVERIFIED tags	~5	~6+	5
Production time	Multi-day pipeline build + ~30 min synthesis	~10-15 minutes	~14 minutes (47 tool calls)
Output reproducibility	High — corpus is grep-able	Medium — re-search web	Medium — re-fetch URLs (live web)

Surprise finding. Native single agent with tools produced output of comparable size and quality to Google Deep Research — actually with lower hallucination rate (5 UNVERIFIED tags vs Deep Research's 6+ "UNVERIFIED proxy" + 1 outright product hallucination on Cambricon Siyuan 690 deployment claims). The original assumption that "a single agent without our corpus would produce thin general-knowledge output" was wrong — given web access + code tools, it can build a credible thesis from scratch in ~15 minutes.

Capability Matrix (post-experiment)

Test Cases — Same Question, Three Empirical Answers

Capability	Agentic Sciences	Google Deep Research	Native single agent
Read paywalled primary transcripts (CIQ / FactSet)	✅ 1,590 calls	❌ blocked by paywall	❌ blocked by paywall
Read original Chinese A-share annual reports / MD&A PDFs	✅ 26 PDFs digested	❌ relies on aggregator translations	⚠️ partial (can fetch English IR pages, not Chinese MD&A directly)
Real-time web search	❌ corpus snapshot	✅	✅
Iterative tool use (search → fetch → parse → re-search)	N/A (corpus pre-built)	limited single-pass research	✅ 47 tool calls in the comparison run
Run code to compute aggregates / parse HTML	N/A (Python pipeline pre-run)	❌	✅ code execution + Python
Cross-corpus statistical computation (mention rates etc.)	✅ measured across 1,590 calls	❌ assert without measurement	❌ no corpus to measure
Citation discipline (source per claim)	corpus + ticker + date	61 sources, ~13% primary	51 URLs, mostly news + IR pages
Hallucination rate	~0 (corpus-bound)	1 outright + 1 timeline-conflated	0 outright (5 UNVERIFIED self-flags)
Coverage of A-share niche names	deep (MD&A direct read)	medium (via aggregators)	medium (via web; some thinness on iFlytek/Sugon)
Compliance defensibility	Pass	Marginal	Marginal (URLs may decay)

Q1: What is iFlytek's primary AI chip partner? Provide source.

Agentic SciencesHuawei Ascend. Source: iFlytek 2025-06-30 interim MD&A (PDF read directly, machine-translated). Spark large-model training on Ascend processors + Atlas SuperPoD. verbatim from primary filing

Google Deep Research"Huawei Ascend 950 / Atlas 950 SuperPoD". Source: TrendForce + Chinese aggregators. The 950PR only entered mass production 2026-04 and 950DT/SuperPoD scheduled Q4 2026 — Deep Research conflated roadmap with shipped product. partial — timeline-conflated

Native single agentHuawei Ascend. Source: web search result citing Liu Qingfeng public statements + secondary news. Did NOT specify chip generation incorrectly. correct, conservative

Q2: 2026 aggregate hyperscaler capex. Specific number.

Agentic Sciences~$585B floor (per-company company-disclosed minimum from earnings calls). Conservative anchored on actual management guides. disclosed floor

Google Deep Research$725B from a Tom's Hardware article citing analyst aggregate estimate. analyst aggregate

Native single agent$665-740B. Built bottom-up from individual company guides fetched live: GOOGL $180-190B (CNBC 2026-04-29), MSFT ~$190B, META $125-145B, AMZN ~$200B (TheNextWeb), ORCL ~$50B. Each line has source URL. most defensible — bottom-up

Q3: Tencent's exact quote on GPU rationing for external cloud customers?

Agentic Sciences2026-03-18: "Tencent Cloud continued to face revenue headwinds due to limited availability of GPU for external customers as we prioritize our internal needs." Plus 2025-03-19 quote on internal allocation. Both verbatim from transcripts. verbatim + dated

Google Deep ResearchParaphrased substance correctly but no exact verbatim line. paraphrase

Native single agentDid NOT specifically obtain this quote in this run (corpus did not surface it via web search — it would require finding the actual transcript text on a free-tier IR page). The thesis cites Tencent's general GPU constraint context with secondary-source attribution. general only

Q4: HBM mention rate in Chinese cloud earnings calls?

Agentic Sciences0% across self-disclosure layer; 4% if including analyst-question references. Measured across 4 cloud companies × 28 calls. Methodology footnote in document reconciles. measured statistic

Google Deep ResearchAsserts "exactly 0%" without showing methodology. Likely picked up the conclusion from secondary commentary. borrowed assertion

Native single agentDiscusses bifurcation directionally but does not produce the mention-rate statistic. Acknowledges in body it cannot run mention-rate analysis without a transcript corpus. honest gap

Q5: Cambricon 2025 financial milestone (specific number)?

Agentic Sciences+4,347.82% YoY H1 2025 revenue growth to CNY 2.88B. Source: Cambricon 2025-06-30 interim MD&A (direct PDF read). interim filing

Google Deep ResearchGeneral "Cambricon revenue ramp" narrative, less granular. directional

Native single agentRMB 6.5B 2025 revenue / RMB 2.06B net profit (first profitable year ever). Source: Fortune 2025-08-27 article. Different but valid datapoint — uses 2025 full-year figure, found via web search. verified via news

Q6: A-share AI infra company with US patent infringement case?

Agentic SciencesEoptolink (300502). AOI filed N.D. California 3:24-cv-08165, 2024-11-19. Source: public corporate disclosure event records. specific case + court

Google Deep ResearchLikely captured given web news coverage. probably correct

Native single agentDid not surface this specific case — niche enough that it requires disclosure event-quality event coverage, which web search does not return cleanly. missed

Where Each Genuinely Wins

Agentic Sciences wins decisively at:

Native single agent wins at:

Google Deep Research wins at:

Failure Modes (empirical)

Agentic Sciences thesis fails when:

Google Deep Research fails when:

Native single agent fails when:

Cost / Effort Matrix

Optimal Workflow Across All Three

Tool	One-time setup cost	Marginal cost per thesis	Best for
Agentic Sciences	High (build corpus pipeline: research-platform access + scraper + cleaner + summarizer + indexer; days of engineering)	Low (corpus is reusable — re-running thesis is a single orchestration call)	Repeated investment processes where the same corpus is queried many times; compliance-grade defensibility
Google Deep Research	None (just open the deep research tool)	Free or marginal cost (consumer AI subscription)	One-shot thought-leadership; macro framing; quick on-demand topics
Native single agent	None (general-purpose AI agent)	Per-task tool calls (~$1-3 per thesis at typical pricing)	Ad-hoc deep-dives where iterative tool use beats single-pass; bottom-up data building

The full stack:
① Use Native single agent first to design the corpus and analytical framework. It excels at structural reasoning, methodology design, and red-team auditing.
② Use the corpus pipeline once built to produce the Agentic Sciences-grade thesis as the auditable trading book — compliance-defensible, fact-checked, cross-corpus statistics.
③ Run Google Deep Research monthly as a macro / regulatory news refresh layer to compensate for corpus snapshot lag.
④ For one-off questions or fast-turnaround pitches, Native single agent with tools alone is sufficient — empirically it produces 8,700+ word theses in 15 minutes with citation discipline competitive with Deep Research.