← Back to Validation Report

Prompt Optimization: Cluster-Level LLM Crime Prediction

Goal: Find the prompt design that best predicts crime policy survey responses for 15 voter clusters
Model: claude-sonnet-4-5-20250929 (temperature=0)
Setup: 21 prompt variants × 15 clusters = 315 API calls. Each variant's predictions applied to all 4,670 ANES respondents with complete crime data.
Held-out questions: Urban unrest (1–7), death penalty (1–4), crime spending (1–5)
Primary metric: Exact match accuracy (did the model predict the exact integer?)
Date: March 24, 2026

Key Finding: Two Non-Dominated Designs. Clean Combo Best Overall, Modal Best on Exact Match

Across 22 variants, two designs stand out: in the two-way scatter of exact match vs. within-±1, no other variant beats both on both metrics simultaneously:

On the within-±1 metric, a caveat applies: for questions with a narrow response scale (specifically death penalty, 1–4): being within ±1 of any response is nearly automatic, since a random prediction on a 4-point scale has a >50% chance of landing within one point of the true answer. Within-±1 is most informative on the 7-point urban unrest scale, where chance performance is lower. Exact match is therefore the more discriminating metric across all three questions.

The clean combo works because its three components are complementary rather than conflicting: the 4-step structure identifies which positions matter, the visible CoT commits the model to a consistent interpretation before predicting, and the modal instruction pushes predictions toward discrete integer peaks rather than rounded means. By contrast, earlier combination attempts (v21) stacked redundant or conflicting techniques (two-stage ideology, archetypes, and rich scales on top of CoT and modal) and underperformed all three non-dominated variants.

Deployment choice: v22 (clean combo) should replace v05 in the live chat. It dominates v05 on both metrics (76.7% vs 76.1% within-±1; 32.5% vs 30.6% exact) while preserving the visible reasoning structure that transfers naturally to the conversational context. v18 remains the best option when exact match is the sole criterion.

Counterintuitively, more information is not always better: filtering to only crime-adjacent positions (v04) or converting numbers to natural language (v08) both hurt performance. And the explicit "don't moderate" instruction (v14) improves exact match slightly (31.3%) but degrades within-±1.

Overview: Within-±1 vs. Exact Match

Each bubble is one prompt variant. X-axis = exact match accuracy (did the model hit the right integer?). Y-axis = within-±1 accuracy (was the model off by at most 1 point?). Bubble size = inverse MAE (larger = lower error). Hover for details.

Per-Question Breakdown: Within-±1 Accuracy

Within-±1 accuracy for each variant, split by crime question. Urban unrest (1-7 scale) is hardest to predict; crime spending (1-5) easiest.

Per-Question Breakdown: Exact Match Accuracy

Exact match accuracy for each variant, split by crime question. This is the more discriminating metric, especially on wide scales like urban unrest (1-7) where within-±1 is nearly automatic for narrow-scale responses.

Full Rankings

Sorted by composite within-±1 accuracy (average across 3 questions). Primary metric for deployment decision: exact match. See analysis below.

# Variant Design Category Within±1 Exact MAE

Analysis

What Worked

1. Clean combo (v22, +2.3 pts within-±1 over baseline, best overall)
Combining v01's 4-step reasoning structure with v05's visible CoT and v18's modal instruction achieves the best within-±1 (76.7%) and second-best exact match (32.5%) simultaneously. This is the only variant that dominates both v01 and v05 on both metrics. The three components are complementary (structure identifies relevant positions, CoT commits to a consistent interpretation, modal pushes toward discrete peaks) and do not create conflicting instructions.
2. Modal prediction (v18, best exact match at 35.9%)
Asking the model for the most common (modal) response rather than a mean commits it to a discrete integer peak, producing the highest exact-match rate. The tradeoff is steep: within-±1 drops to 64.7%. Best if exact match is the sole criterion and within-±1 is disregarded. Not dominated by any other variant: v22 loses on exact match (32.5% vs 35.9%).
3. Rich scale descriptions (v19, +2.8 pts)
Spelling out every integer on the response scale ("1 = Primarily solve underlying problems of racism and police violence … 7 = Use all available force") reduced ambiguity about what each scale point means. The model could map the cluster's ideological profile more precisely to the available response options.
4. Archetype reference table (v15, +1.5 pts)
Including a lookup table of typical crime stances by ideology (hard progressive → urban_unrest=1-2, etc.) gave the model a concrete anchor. Combined with strong ideology inference from the cluster name, this variant scored well without requiring any chain-of-thought.

What Didn't Work

1. "Don't moderate" instruction (v14, worst within-±1)
The explicit instruction to "predict extreme responses if data shows extreme views" backfired. Rather than pushing predictions toward the true distribution, it pushed them toward the extremes of the scale, which are often not where respondents actually cluster. The instruction was taken too literally.
2. Crime-adjacent positions only (v04, -5.5 pts vs baseline)
Filtering to only the 13 most crime-relevant variables removed contextual signal that the model apparently uses to triangulate ideological position. The full 43-variable context seems to help even when variables aren't directly crime-related, as they constrain the ideological inference.
3. Natural language position descriptions (v08, -6.5 pts)
Converting numeric means to verbal descriptors ("strongly at high end") introduced ambiguity. The same phrase means different things for different scale directions, and the model was not given enough context to resolve the mapping. Raw numbers are unambiguous.
4. Deviation framing (v09, -6.6 pts)
Showing positions as "X pts above/below average" degraded performance. This representation provides less absolute information than raw means, and the model can't easily reconstruct where on the scale the cluster sits. Relative framing requires the model to mentally add back a baseline it was never given.
5. Name + demographics only (v17, -7.1 pts)
Using only the cluster label and basic demographics confirms that the policy position data is genuinely load-bearing. Even a descriptive name like "Hard-Right Nationalists" is less informative than the actual profile: the model's label-based prior is noisier than the data.

The Exact Match vs. Within-±1 Tradeoff

There is a genuine tension between the two metrics, and only two variants are non-dominated:

Note that within-±1 is less discriminating on narrow scales. On the death penalty question (1–4), a uniform random predictor already achieves ~57% within-±1, so high scores there carry less weight. The urban unrest question (1–7) is where within-±1 meaningfully separates the designs.

The Combination Attempts: What Failed and What Worked

v21 merged: CoT visible + two-stage ideology + modal prediction + rich scales + archetype reference
Result: 27.6% exact, 70.4% within-±1, MAE 1.225. Dominated by v22 on both metrics. Combining five techniques introduced conflicting instructions: modal framing clashes with chain-of-thought reasoning; ideology labels and archetype tables provide redundant anchors the model weights inconsistently. Prompt features do not compose cleanly.
v22 clean combo: 4-step reasoning + visible CoT + modal prediction
Result: 32.5% exact, 76.7% within-±1, MAE 1.005. New best overall. Restricting the combination to only the three non-dominated ingredients (v01, v05, v18) avoids the conflicting instructions that sank v21. Each component is complementary: structure selects relevant positions, CoT commits to interpretation, modal targets discrete peaks.

Final decision: Deploy v22 (clean combo). It dominates v05 on both metrics (76.7% vs 76.1% within-±1; 32.5% vs 30.6% exact) while preserving the visible reasoning structure that transfers naturally to conversational contexts. v18 remains the choice when exact match is the sole criterion.

Implications for the Chat Feature

The v05 CoT reasoning design has been integrated into both the chat and deliberation prompts. Rather than asking personas to announce their reasoning, the instruction guides them to internally work through their positions before speaking, then open with natural conversational openers like "Look, the way I see it…" or "Honestly, this comes down to…". This preserves the calibration benefit of explicit reasoning without breaking the conversational register of the chat feature.

All 22 Prompt Designs

Each entry shows the system prompt and the key structural elements of the user prompt. Policy position lists are abbreviated (all variants use the same 43-variable ANES dataset unless otherwise noted).

Appendix: Prompts for Selected Variants

Full prompt designs for the two non-dominated variants plus the three ingredients that compose the clean combo, as compared in the Validation Report.

Ingredients of the clean combo (v22)