Three-Way Empirical Comparison
Same prompt — Cross-Pacific AI Infra Stock-Investment Thesis (38 stocks / 6 countries / 10 layers) — produced three ways and measured. Numbers below are empirical from actual runs, not estimates.
Methodology. Three independent production paths for the same investment thesis prompt:
(1) Agentic Sciences — proprietary multi-agent orchestration pipeline. A primary orchestrator agent designs the workflow and dispatches specialized subagents in parallel; each subagent is routed to the most appropriate underlying model for its subtask. Built on top of a custom-curated primary-source corpus.
(2) Google Deep Research — single-model deep research mode with web search.
(3) Native single agent (no corpus) — a general-purpose AI agent with web + code tools, but explicitly without access to the Agentic Sciences proprietary corpus. Single-agent execution; no orchestration layer.
Three differentiators, compounding:
(1) Multi-model ensemble — best of every model. Each subagent is routed to the model that is empirically strongest for its subtask: one class for structural reasoning and tool orchestration; another optimized for cost-per-token on high-throughput structured extraction at scale; a third with extended deliberation budget for long-context multi-stage synthesis; specialized subagents for parallel research and quality control. No single model is best at everything — we use each where it wins.
(2) Proprietary primary-source corpus. Curated earnings-call transcripts, MD&A filings, and corporate-event databases that general web tools cannot reach behind paywalls. Every claim ties back to a verbatim quote with date attribution.
(3) Domain-expert judgment in the loop. A 14-year quant-economics researcher (Cornell PhD · AFA 2026) sets the framework, audits the synthesis, and red-teams the conclusions — turning agent output into an investable thesis, not a content artifact.
Neither single-model Deep Research nor single-agent baselines can match this compound effect.
① Agentic Sciences Pipeline
Multi-agent orchestration: primary orchestrator → parallel specialized subagents (extraction, synthesis, QC) → cost-routed across model classes. 26,288 words / 64 pages / 150+ dated verbatim quotes. Cross-corpus statistics + direct A-share MD&A reading. Each model used where it is best.
Educational research only · Not investment advice.
② Google Deep Research
7,475 words / 28 pages / ~12 quotes (6 self-flagged "UNVERIFIED proxy"). 61 web sources, ~13% primary. Strong on macro framing.
③ Native single agent (web tools + code)
8,706 words / 51 URL citations / only 5 UNVERIFIED tags. Real-time web search + web fetch + code execution.
Comparable to Deep Research in coverage, lower hallucination rate, faster (~14 minutes vs Deep Research's ~10-15).
Agentic Sciences — Orchestration Stages
| Stage | What it does | Why orchestration matters here |
| Pipeline orchestration | Primary agent designs the corpus-build workflow, manages data acquisition, dispatches downstream subagents | Structural reasoning and tool use are best handled by an agent class optimized for agentic decision-making |
| Parallel research subagents | Spawned in parallel for gap analysis, fact-checking, comparison audits, focused deep-dives | Independent context per subagent — parallel work without polluting the orchestrator's reasoning trace |
| Bulk structured extraction | Thousands of source documents converted into structured records (sentiment, quotes, forward statements, risk mentions) | Cost-optimized routing — extraction is high-throughput / low-complexity, best handled by a fast cheap model class |
| Multilingual extraction | Chinese-language filings → structured English digests with explicit field extraction (chip partners, capex, AI revenue %) | Bilingual capability + structured-output reliability are domain-specific strengths to route to |
| Multi-stage deep synthesis | Final thesis sections generated independently with full deliberation budget on long-context inputs | Long-context + extended deep-thinking is a specialized capability deserving its own subagent class |
| Cross-section quality control | Reconciles long/short positions across independently-generated sections; flags contradictions; applies methodology caveats | Red-team / consistency check is a separate task type, best done by a different agent than the one that wrote the content |
| Bilingual rendering | Markdown → styled HTML → PDF with cover, ToC, multilingual font fallback | Print-quality typesetting is a deterministic stage, handled by classical tools rather than AI |
Why this architecture matters. No single model is best at all of: cheap-fast bulk digesting, deep multi-stage reasoning, agentic tool use, structural design, code generation, multilingual extraction, parallel research dispatch, quality control. The orchestration layer routes each subtask to the most appropriate specialized subagent for that stage. The compound effect — many specialized agents coordinating in parallel — is what neither single-call Deep Research nor single-agent baselines can match.
Empirical Output Measurements
| Metric | Agentic Sciences | Google Deep Research | Native single agent |
| Total words | 26,288 | 7,475 | 8,706 |
| Top long candidates | 8 | 8 | 10 |
| Pair trades | 5 + anti-pair | 5 + anti-pair | 6 + anti-pair |
| Watchlist events | 15 | 15 | 25 |
| Source citations | corpus-scope (1,590 calls + 26 MD&A + 1,445 disclosure eventss) | 61 web URLs | 51 unique web URLs |
| Dated verbatim quotes | ~150+ | ~12 (~6 "UNVERIFIED proxy") | ~30-40 |
| Self-flagged UNVERIFIED tags | ~5 | ~6+ | 5 |
| Production time | Multi-day pipeline build + ~30 min synthesis | ~10-15 minutes | ~14 minutes (47 tool calls) |
| Output reproducibility | High — corpus is grep-able | Medium — re-search web | Medium — re-fetch URLs (live web) |
Surprise finding. Native single agent with tools produced output of comparable size and quality to Google Deep Research — actually with lower hallucination rate (5 UNVERIFIED tags vs Deep Research's 6+ "UNVERIFIED proxy" + 1 outright product hallucination on Cambricon Siyuan 690 deployment claims). The original assumption that "a single agent without our corpus would produce thin general-knowledge output" was wrong — given web access + code tools, it can build a credible thesis from scratch in ~15 minutes.
Capability Matrix (post-experiment)
| Capability | Agentic Sciences | Google Deep Research | Native single agent |
| Read paywalled primary transcripts (CIQ / FactSet) | ✅ 1,590 calls | ❌ blocked by paywall | ❌ blocked by paywall |
| Read original Chinese A-share annual reports / MD&A PDFs | ✅ 26 PDFs digested | ❌ relies on aggregator translations | ⚠️ partial (can fetch English IR pages, not Chinese MD&A directly) |
| Real-time web search | ❌ corpus snapshot | ✅ | ✅ |
| Iterative tool use (search → fetch → parse → re-search) | N/A (corpus pre-built) | limited single-pass research | ✅ 47 tool calls in the comparison run |
| Run code to compute aggregates / parse HTML | N/A (Python pipeline pre-run) | ❌ | ✅ code execution + Python |
| Cross-corpus statistical computation (mention rates etc.) | ✅ measured across 1,590 calls | ❌ assert without measurement | ❌ no corpus to measure |
| Citation discipline (source per claim) | corpus + ticker + date | 61 sources, ~13% primary | 51 URLs, mostly news + IR pages |
| Hallucination rate | ~0 (corpus-bound) | 1 outright + 1 timeline-conflated | 0 outright (5 UNVERIFIED self-flags) |
| Coverage of A-share niche names | deep (MD&A direct read) | medium (via aggregators) | medium (via web; some thinness on iFlytek/Sugon) |
| Compliance defensibility | Pass | Marginal | Marginal (URLs may decay) |
Test Cases — Same Question, Three Empirical Answers
Q1: What is iFlytek's primary AI chip partner? Provide source.
Agentic SciencesHuawei Ascend. Source: iFlytek 2025-06-30 interim MD&A (PDF read directly, machine-translated). Spark large-model training on Ascend processors + Atlas SuperPoD. verbatim from primary filing
Google Deep Research"Huawei Ascend 950 / Atlas 950 SuperPoD". Source: TrendForce + Chinese aggregators. The 950PR only entered mass production 2026-04 and 950DT/SuperPoD scheduled Q4 2026 — Deep Research conflated roadmap with shipped product. partial — timeline-conflated
Native single agentHuawei Ascend. Source: web search result citing Liu Qingfeng public statements + secondary news. Did NOT specify chip generation incorrectly. correct, conservative
Q2: 2026 aggregate hyperscaler capex. Specific number.
Agentic Sciences~$585B floor (per-company company-disclosed minimum from earnings calls). Conservative anchored on actual management guides. disclosed floor
Google Deep Research$725B from a Tom's Hardware article citing analyst aggregate estimate. analyst aggregate
Native single agent$665-740B. Built bottom-up from individual company guides fetched live: GOOGL $180-190B (CNBC 2026-04-29), MSFT ~$190B, META $125-145B, AMZN ~$200B (TheNextWeb), ORCL ~$50B. Each line has source URL. most defensible — bottom-up
Q3: Tencent's exact quote on GPU rationing for external cloud customers?
Agentic Sciences2026-03-18: "Tencent Cloud continued to face revenue headwinds due to limited availability of GPU for external customers as we prioritize our internal needs." Plus 2025-03-19 quote on internal allocation. Both verbatim from transcripts. verbatim + dated
Google Deep ResearchParaphrased substance correctly but no exact verbatim line. paraphrase
Native single agentDid NOT specifically obtain this quote in this run (corpus did not surface it via web search — it would require finding the actual transcript text on a free-tier IR page). The thesis cites Tencent's general GPU constraint context with secondary-source attribution. general only
Q4: HBM mention rate in Chinese cloud earnings calls?
Agentic Sciences0% across self-disclosure layer; 4% if including analyst-question references. Measured across 4 cloud companies × 28 calls. Methodology footnote in document reconciles. measured statistic
Google Deep ResearchAsserts "exactly 0%" without showing methodology. Likely picked up the conclusion from secondary commentary. borrowed assertion
Native single agentDiscusses bifurcation directionally but does not produce the mention-rate statistic. Acknowledges in body it cannot run mention-rate analysis without a transcript corpus. honest gap
Q5: Cambricon 2025 financial milestone (specific number)?
Agentic Sciences+4,347.82% YoY H1 2025 revenue growth to CNY 2.88B. Source: Cambricon 2025-06-30 interim MD&A (direct PDF read). interim filing
Google Deep ResearchGeneral "Cambricon revenue ramp" narrative, less granular. directional
Native single agentRMB 6.5B 2025 revenue / RMB 2.06B net profit (first profitable year ever). Source: Fortune 2025-08-27 article. Different but valid datapoint — uses 2025 full-year figure, found via web search. verified via news
Q6: A-share AI infra company with US patent infringement case?
Agentic SciencesEoptolink (300502). AOI filed N.D. California 3:24-cv-08165, 2024-11-19. Source: public corporate disclosure event records. specific case + court
Google Deep ResearchLikely captured given web news coverage. probably correct
Native single agentDid not surface this specific case — niche enough that it requires disclosure event-quality event coverage, which web search does not return cleanly. missed
Where Each Genuinely Wins
Agentic Sciences wins decisively at:
- Cross-corpus statistical claims — only one with the underlying data to actually measure things like "0% Chinese-cloud HBM mention" or "sanctions discourse climbs the stack 0%→21%→41% by layer".
- A-share original-language reading — direct PDF→text on Chinese filings yields supplier identifications (iFlytek→Huawei specifically; Sugon→Hygon stock-for-stock merger; Innolight 3-way customer mix) that web search cannot surface.
- Niche dated events — Eoptolink lawsuit, Innolight HK IPO consideration, Hygon two consecutive +55% years — coverage that requires comprehensive disclosure event capture.
- Verbatim quote density — 150+ dated transcript-grounded quotes vs CC's ~30-40 (web-IR-page pulls) vs Deep Research's ~12 (paraphrased).
- Compliance defensibility — every numeric claim trace-back is a corpus file path, not a URL that may decay.
Native single agent wins at:
- Speed-to-output — ~14 minutes from cold-start to a complete 8,700-word thesis with 51 cited URLs. No multi-day corpus build required.
- Bottom-up aggregate construction — built the $665-740B 2026 capex number by fetching each company's individual guide live, with each citation. More defensible than Deep Research's analyst-aggregate borrowing.
- Iterative refinement — 47 tool calls means it could re-search when initial results were thin (e.g., A-share names), which Deep Research's single-pass model cannot.
- Recent (Q1 2026 calls held in last 30 days) — same as Deep Research, both better than corpus-bound thesis.
- Code-driven analysis — can run actual Python to compute aggregations, dedupe sources, parse HTML — Deep Research cannot.
Google Deep Research wins at:
- Macro narrative framing — "Gigawatt Hourglass" metaphor; rhetorically the most polished.
- Synthesis fluency — single-pass synthesis writes more cohesively than the iterative-build style of CC.
- Sovereign / regulatory primary documents — cites USTR statements, Dutch export-control documents, US-China commission reports more readily than CC managed in this run.
Failure Modes (empirical)
Agentic Sciences thesis fails when:
- Question requires post-corpus-cutoff data (e.g., SK Hynix Q1 2026 ₩52.58T print released after our Q1 2026 cutoff)
- Real-time regulatory / policy news matters
- Sell-side analyst consensus needed (corpus has primary only)
Google Deep Research fails when:
- Compliance asks "where exactly does this come from" — many "UNVERIFIED proxy" tags
- Specific product deployment timing matters — empirical hallucination on Cambricon Siyuan 690 + Ascend 950DT/SuperPoD timeline
- Cross-corpus statistical claims required
Native single agent fails when:
- Specific verbatim transcript quote required and the quote is behind paywall (subscription research platforms) — empirical: Tencent GPU rationing exact-quote test failed
- Niche A-share disclosure events required — empirical: Eoptolink lawsuit case not surfaced
- Cross-corpus statistical computation — empirical: HBM mention rate not derivable
- Original-language Chinese A-share interim reports — fetches English summaries but not the source PDFs
Cost / Effort Matrix
| Tool | One-time setup cost | Marginal cost per thesis | Best for |
| Agentic Sciences |
High (build corpus pipeline: research-platform access + scraper + cleaner + summarizer + indexer; days of engineering) |
Low (corpus is reusable — re-running thesis is a single orchestration call) |
Repeated investment processes where the same corpus is queried many times; compliance-grade defensibility |
| Google Deep Research |
None (just open the deep research tool) |
Free or marginal cost (consumer AI subscription) |
One-shot thought-leadership; macro framing; quick on-demand topics |
| Native single agent |
None (general-purpose AI agent) |
Per-task tool calls (~$1-3 per thesis at typical pricing) |
Ad-hoc deep-dives where iterative tool use beats single-pass; bottom-up data building |
Optimal Workflow Across All Three
The full stack:
① Use Native single agent first to design the corpus and analytical framework. It excels at structural reasoning, methodology design, and red-team auditing.
② Use the corpus pipeline once built to produce the Agentic Sciences-grade thesis as the auditable trading book — compliance-defensible, fact-checked, cross-corpus statistics.
③ Run Google Deep Research monthly as a macro / regulatory news refresh layer to compensate for corpus snapshot lag.
④ For one-off questions or fast-turnaround pitches, Native single agent with tools alone is sufficient — empirically it produces 8,700+ word theses in 15 minutes with citation discipline competitive with Deep Research.