A
Avena
Avena Research · working paper · 2026-06-10

DELPHI: A Daily Longitudinal Survey of Machine Beliefs About a Real Asset Class

Henrik Kolstad · Avena Terminal, Oslo
CC BY 4.0 · DOI 10.5281/zenodo.19520064 · Live instrument: avenaterminal.com/delphi
Abstract

Surveys of expert expectations are foundational instruments in empirical finance. As large language models increasingly mediate investment research, the beliefs these models hold about asset markets have become market-relevant in their own right, yet no instrument records them. We introduce DELPHI, the first daily longitudinal survey in which the panelists are frontier AI models. Each day, an identical bank of forward-looking quantitative questions about European residential property is posed to multiple LLMs under identical answer-only prompting. We record per-model answers verbatim, aggregate to a median consensus and a max-min dispersion per question, and publish two daily indices: a directionally normalized Consensus Index and a Disagreement Index. Every question carries a pre-specified public resolution source, so panel beliefs are eventually scored against realized outcomes, yielding a public calibration record of machine judgment on a real asset class. The time series is constitutively irreproducible: a model’s belief on date t can only be observed on date t. The record began 2026-06-10 and accumulates daily.

1. Introduction

Expectation surveys occupy a central place in macro-finance because beliefs move markets independently of fundamentals. The ZEW Indicator of Economic Sentiment has polled human financial analysts monthly since 1991; central banks run professional-forecaster surveys precisely because the distribution of expectations — not only its mean — carries information.

A new class of market participant has appeared. Large language models draft investment memos, screen markets, and answer the question “should I buy property in Spain?” millions of times a year. Their beliefs propagate into human decisions through every such interaction. Three properties make these beliefs worth recording systematically: they are influential (model-mediated research is a growing share of investment workflow), they are heterogeneous (different model families produce materially different quantitative beliefs, as our first panels demonstrate), and they are perishable — a model's belief on date t is only observable on date t. Unlike price data, the series cannot be backfilled, which makes a continuous record valuable in proportion to its length.

DELPHI — named for the Delphi survey method, whose round-one structure it implements with machine panelists — is, to our knowledge, the first instrument to record these beliefs daily against a fixed question bank with pre-registered resolution criteria.

2. Methodology

Question bank. Twelve forward-looking quantitative questions about European residential property, each typed as a probability (0–100%), a percentage change (−10%…+10%), or a 0–100 scale rating; each tagged with a directional sign (whether a high answer is bullish or bearish for the asset class), a horizon in months, and a resolution source — a named public statistic (ECB MFI interest-rate statistics; Eurostat house-price index; national statistics offices) against which the question resolves at horizon. The bank is version-controlled; any change increments the published version.

Panel and elicitation.The launch panel comprises three models from two independent providers, intentionally mixing retrieval-augmented and parametric-knowledge panelists. Each question is posed in a fresh context with an answer-only instruction (a single number, no reasoning) to suppress format drift. Panelists never see one another's answers — a true Delphi round one. The operator's own analytics never participate: the referee does not play on the scoreboard.

Aggregation. Per question: consensus = median; dispersion = max − min. Per day: the Consensus Index is the mean of bullishness-normalized answers (50 = neutral, higher = collectively bullish for European property); the Disagreement Index is the mean dispersion. Medians and ranges are preferred for robustness with small panels.

Integrity. Every run is event-sourced and replayable. Daily artifacts are committed in a Merkle root, timestamped under RFC 3161, and anchored to a Zenodo DOI. The full per-model, per-question, per-day record is public via API and mirrored daily to a public git repository whose commit history independently witnesses the series.

Resolution and calibration.At each question's horizon the realized outcome is read from the pre-specified source. Probability questions score by Brier score; quantitative questions by absolute error. Accumulating resolutions yield per-model calibration curves — a public track record of machine judgment, complementing knowledge benchmarks with a measure of foresight.

3. First-panel findings

The inaugural panel (2026-06-10) opened at Consensus Index 53.3 (mildly bullish) with Disagreement Index 19.9. The widest split concerned the probability of ECB rate cuts within six months: 25% versus 72% — a 47-point spread between frontier models on the single most consequential variable for the asset class. Persistent, attributable inter-model disagreement of this size on a well-posed question is itself a finding about the epistemic state of deployed AI systems. We further observe round-number anchoring in a smaller panelist (identical values returned across unrelated questions) and systematic differences between retrieval-augmented and parametric beliefs. The live series at avenaterminal.com/delphi supersedes this section daily.

4. Limitations

Panel size is small at launch; the architecture admits any model exposing an API. Answer-only elicitation trades reasoning transparency for comparability; alternative elicitations are a planned ablation. What an LLM “believes” is operationalized strictly as its answer under the fixed protocol — the protocol-conditional belief, which is precisely the quantity that propagates to users. Providers update models; version strings are recorded per response, making transitions visible breaks rather than silent drift.

The central property bears repeating: the series cannot be reconstructed retroactively. Whatever its eventual scientific use — machine herding, calibration, the transmission of model beliefs into prices — the prerequisite is that someone recorded the beliefs at the time. That is what DELPHI does, daily.

Data availability

Live instrument: /delphi · JSON: /api/v1/delphi · RSS: /feed/delphi.xml · Daily git mirror: github.com/HenrikKolstad/avena-data · DCAT-AP: /catalog.jsonld · License CC BY 4.0 · DOI 10.5281/zenodo.19520064. Companion benchmark (PLAB): /benchmark.

Cite as
Kolstad, H. (2026). DELPHI: A Daily Longitudinal Survey of Machine Beliefs About a Real Asset Class. Avena Terminal Research. https://avenaterminal.com/papers/delphi. DOI 10.5281/zenodo.19520064.