← Back to Validation Report
Prompt Optimization: Cluster-Level LLM Crime Prediction
Goal: Find the prompt design that best predicts crime policy survey responses for 15 voter clusters
Model: claude-sonnet-4-5-20250929 (temperature=0)
Setup: 21 prompt variants × 15 clusters = 315 API calls. Each variant's predictions
applied to all 4,670 ANES respondents with complete crime data.
Held-out questions: Urban unrest (1–7), death penalty (1–4), crime spending (1–5)
Primary metric: Exact match accuracy (did the model predict the exact integer?)
Date: March 24, 2026
Key Finding: Two Non-Dominated Designs. Clean Combo Best Overall, Modal Best on Exact Match
Across 22 variants, two designs stand out: in the two-way scatter
of exact match vs. within-±1, no other variant beats both on both metrics simultaneously:
- v22: Clean combo: 32.5% exact, 76.7% within-±1. Combining v01's 4-step reasoning structure with v05's visible CoT and v18's modal instruction achieves the best within-±1 of any variant and the second-highest exact match. It dominates v01 (32.1%, 74.4%) and v05 (30.6%, 76.1%) on both metrics simultaneously.
- v18: Modal prediction: 35.9% exact, 64.7% within-±1. Asking the model to predict the most common response produces the highest exact-match rate, but at a large cost to within-±1 coverage. Not beaten on exact match by any other variant.
On the within-±1 metric, a caveat applies: for questions with a narrow response
scale (specifically death penalty, 1–4): being within ±1 of any response is nearly automatic,
since a random prediction on a 4-point scale has a >50% chance of landing within one point of
the true answer. Within-±1 is most informative on the 7-point urban unrest scale, where chance
performance is lower. Exact match is therefore the more discriminating metric across all three
questions.
The clean combo works because its three components are complementary rather than conflicting:
the 4-step structure identifies which positions matter, the visible CoT commits the model to a
consistent interpretation before predicting, and the modal instruction pushes predictions toward
discrete integer peaks rather than rounded means. By contrast, earlier combination attempts (v21)
stacked redundant or conflicting techniques (two-stage ideology, archetypes, and rich scales on
top of CoT and modal) and underperformed all three non-dominated variants.
Deployment choice: v22 (clean combo) should replace v05 in the live chat.
It dominates v05 on both metrics (76.7% vs 76.1% within-±1; 32.5% vs 30.6% exact) while
preserving the visible reasoning structure that transfers naturally to the conversational context.
v18 remains the best option when exact match is the sole criterion.
Counterintuitively, more information is not always better: filtering to only
crime-adjacent positions (v04) or converting numbers to natural language (v08)
both hurt performance. And the explicit "don't moderate" instruction (v14) improves
exact match slightly (31.3%) but degrades within-±1.
Overview: Within-±1 vs. Exact Match
Each bubble is one prompt variant. X-axis = exact match accuracy (did the model hit the right integer?). Y-axis = within-±1 accuracy (was the model off by at most 1 point?). Bubble size = inverse MAE (larger = lower error). Hover for details.
Per-Question Breakdown: Within-±1 Accuracy
Within-±1 accuracy for each variant, split by crime question. Urban unrest (1-7 scale) is hardest to predict; crime spending (1-5) easiest.
Per-Question Breakdown: Exact Match Accuracy
Exact match accuracy for each variant, split by crime question. This is the more discriminating metric, especially on wide scales like urban unrest (1-7) where within-±1 is nearly automatic for narrow-scale responses.
Full Rankings
Sorted by composite within-±1 accuracy (average across 3 questions). Primary metric for deployment decision: exact match. See analysis below.
| # |
Variant |
Design |
Category |
Within±1 |
Exact |
MAE |
Analysis
What Worked
1. Clean combo (v22, +2.3 pts within-±1 over baseline, best overall)
Combining v01's 4-step reasoning structure with v05's visible CoT and v18's modal instruction
achieves the best within-±1 (76.7%) and second-best exact match (32.5%) simultaneously.
This is the only variant that dominates both v01 and v05 on both metrics. The three components
are complementary (structure identifies relevant positions, CoT commits to a consistent
interpretation, modal pushes toward discrete peaks) and do not create conflicting instructions.
2. Modal prediction (v18, best exact match at 35.9%)
Asking the model for the most common (modal) response rather than a mean commits it to a discrete
integer peak, producing the highest exact-match rate. The tradeoff is steep: within-±1 drops to
64.7%. Best if exact match is the sole criterion and within-±1 is disregarded. Not dominated by
any other variant: v22 loses on exact match (32.5% vs 35.9%).
3. Rich scale descriptions (v19, +2.8 pts)
Spelling out every integer on the response scale ("1 = Primarily solve underlying problems of
racism and police violence … 7 = Use all available force") reduced ambiguity about what each
scale point means. The model could map the cluster's ideological profile more precisely to the
available response options.
4. Archetype reference table (v15, +1.5 pts)
Including a lookup table of typical crime stances by ideology (hard progressive → urban_unrest=1-2,
etc.) gave the model a concrete anchor. Combined with strong ideology inference from the cluster
name, this variant scored well without requiring any chain-of-thought.
What Didn't Work
1. "Don't moderate" instruction (v14, worst within-±1)
The explicit instruction to "predict extreme responses if data shows extreme views" backfired.
Rather than pushing predictions toward the true distribution, it pushed them toward the extremes
of the scale, which are often not where respondents actually cluster. The instruction was
taken too literally.
2. Crime-adjacent positions only (v04, -5.5 pts vs baseline)
Filtering to only the 13 most crime-relevant variables removed contextual signal that the model
apparently uses to triangulate ideological position. The full 43-variable context seems to help
even when variables aren't directly crime-related, as they constrain the ideological inference.
3. Natural language position descriptions (v08, -6.5 pts)
Converting numeric means to verbal descriptors ("strongly at high end") introduced ambiguity.
The same phrase means different things for different scale directions, and the model was
not given enough context to resolve the mapping. Raw numbers are unambiguous.
4. Deviation framing (v09, -6.6 pts)
Showing positions as "X pts above/below average" degraded performance. This representation
provides less absolute information than raw means, and the model can't easily reconstruct
where on the scale the cluster sits. Relative framing requires the model to mentally
add back a baseline it was never given.
5. Name + demographics only (v17, -7.1 pts)
Using only the cluster label and basic demographics confirms that the policy position data is
genuinely load-bearing. Even a descriptive name like "Hard-Right Nationalists" is less
informative than the actual profile: the model's label-based prior is noisier than the data.
The Exact Match vs. Within-±1 Tradeoff
There is a genuine tension between the two metrics, and only two variants are non-dominated:
- Best overall: v22 (32.5% exact, 76.7% within-±1). Dominates both v01 and v05 on both metrics. Best balance of both objectives.
- Best exact match: v18 (35.9%). Asking for the modal response commits the model to a discrete peak, producing the most exact hits but at a large cost to within-±1 coverage (64.7%).
Note that within-±1 is less discriminating on narrow scales. On the death penalty question (1–4),
a uniform random predictor already achieves ~57% within-±1, so high scores there carry less weight.
The urban unrest question (1–7) is where within-±1 meaningfully separates the designs.
The Combination Attempts: What Failed and What Worked
v21 merged: CoT visible + two-stage ideology + modal prediction + rich scales + archetype reference
Result: 27.6% exact, 70.4% within-±1, MAE 1.225. Dominated by v22 on both metrics.
Combining five techniques introduced conflicting instructions: modal framing clashes with
chain-of-thought reasoning; ideology labels and archetype tables provide redundant anchors
the model weights inconsistently. Prompt features do not compose cleanly.
v22 clean combo: 4-step reasoning + visible CoT + modal prediction
Result: 32.5% exact, 76.7% within-±1, MAE 1.005. New best overall.
Restricting the combination to only the three non-dominated ingredients (v01, v05, v18)
avoids the conflicting instructions that sank v21. Each component is complementary:
structure selects relevant positions, CoT commits to interpretation, modal targets discrete peaks.
Final decision: Deploy v22 (clean combo). It dominates
v05 on both metrics (76.7% vs 76.1% within-±1; 32.5% vs 30.6% exact) while preserving the
visible reasoning structure that transfers naturally to conversational contexts.
v18 remains the choice when exact match is the sole criterion.
Implications for the Chat Feature
The v05 CoT reasoning design has been integrated into both the chat and
deliberation prompts. Rather than asking personas to announce their reasoning,
the instruction guides them to internally work through their positions before speaking, then
open with natural conversational openers like "Look, the way I see it…" or "Honestly, this comes
down to…". This preserves the calibration benefit of explicit reasoning without breaking the
conversational register of the chat feature.
All 22 Prompt Designs
Each entry shows the system prompt and the key structural elements of the user prompt. Policy position lists are abbreviated (all variants use the same 43-variable ANES dataset unless otherwise noted).
Appendix: Prompts for Selected Variants
Full prompt designs for the two non-dominated variants plus the three ingredients that compose
the clean combo, as compared in the Validation Report.
Ingredients of the clean combo (v22)