Diagnostic Architecture · Practitioner Reference

The Ledger Health Check: statistical validation & stage mapping

24 questions → 8 categories → 2 ledgers → 4 measurable stages → 1 coupling gap. Complete scoring architecture, disparity analysis, and cycle localization.

IAXAI.ai · FourthPillar LLC · February 2026
Part I
The Structural Map
1.1 — The Architecture

How 24 questions serve four analytical objectives

The Health Check needs to accomplish four things simultaneously from a single survey administration. Each objective requires a different analytical lens on the same 24 data points.

The key design constraint: a single 24-question survey, administrable via any standard forms tool (Google Forms, Microsoft Forms, Typeform, SurveyMonkey), must produce all four analyses. No custom tooling required for data collection. Analysis can be done in a spreadsheet.

1.2 — The Triple Hierarchy

Questions nest three ways simultaneously

Each question belongs to three hierarchies at once. This is the structural innovation that allows one survey to serve four objectives.

Hierarchy 1: Ledger (2 groups of 12)

The primary diagnostic axis. Reality Ledger (R1–R12) measures shared truth. Delivery Ledger (D1–D12) measures owned action. The gap between them reveals the coupling state.

Hierarchy 2: Category (8 groups of 3)

The granular diagnostic. Each ledger contains 4 categories of 3 questions each. Categories identify which domain within a ledger is weakest — is it the facts, the tradeoffs, the ownership, or the authority?

Hierarchy 3: IAXAI Stage (4 groups of 6)

The cycle localization axis. By pairing two categories per stage, we can determine where in the I→A→X→A→I cycle execution breaks. This is the hierarchy that no other diagnostic provides.

I
Insight
R1–R3 + R7–R9
A
Alignment
R4–R6 + R10–R12
X
eXecution
D1–D3 + D4–D6
A
Accountability
D7–D9 + D10–D12
I
Intelligence
inferred from Δ
Part II
The Complete Question Map
2.1 — All 24 Questions, Triple-Tagged

Every question, every hierarchy, every stage

Scale: 1 = Never true, 2 = Rarely, 3 = Sometimes, 4 = Usually, 5 = Always true

IDQuestionLedgerCategoryStage
R1When a problem surfaces, all stakeholders are working from the same set of facts.RealityShared FactsInsight
R2Data and status updates reach the people who need them without someone having to chase it down.RealityShared FactsInsight
R3New team members can find out what is actually happening without relying on tribal knowledge.RealityShared FactsInsight
R7Resource limitations (time, money, people) are acknowledged openly, not quietly absorbed.RealityTrue ConstraintsInsight
R8Deadlines reflect actual capacity, not aspirational thinking.RealityTrue ConstraintsInsight
R9When something is not going to work, people say so before it fails — not after.RealityTrue ConstraintsInsight
R4Tradeoffs are stated out loud before decisions are made — not discovered after.RealityHonest TradeoffsAlignment
R5People feel safe raising bad news or contradicting the prevailing narrative.RealityHonest TradeoffsAlignment
R6When two priorities conflict, the organization resolves it explicitly rather than pretending both will get done.RealityHonest TradeoffsAlignment
R10Reports to leadership reflect what is actually happening, not a polished version of it.RealityNo SpinAlignment
R11The story told to investors, the board, or external partners matches internal reality.RealityNo SpinAlignment
R12People do not have to translate between what is said and what is meant in this organization.RealityNo SpinAlignment
D1Every active initiative has a single person who owns the outcome — not just the tasks.DeliveryExplicit OwnershipeXecution
D2When something goes wrong, it is clear who is accountable without a blame conversation.DeliveryExplicit OwnershipeXecution
D3Ownership is assigned at the start of work, not figured out as things unfold.DeliveryExplicit OwnershipeXecution
D4People with accountability also have the authority to make decisions in their domain.DeliveryClear AuthorityeXecution
D5A decision made by the right person stays decided — it does not get relitigated.DeliveryClear AuthorityeXecution
D6Managers do not need to escalate routine decisions; they have real decision rights.DeliveryClear AuthorityeXecution
D7It is clear which decisions require group input and which are made by an individual.DeliveryDecision RightsAccountability
D8Meetings end with explicit next steps and named owners, not vague consensus.DeliveryDecision RightsAccountability
D9Cross-team decisions have a defined process — they do not require a leader to broker every time.DeliveryDecision RightsAccountability
D10The same fire does not have to be fought more than once.DeliverySustainable RhythmAccountability
D11Leaders can take time off without the system stalling.DeliverySustainable RhythmAccountability
D12The pace of work is one the team can maintain for the next twelve months.DeliverySustainable RhythmAccountability
2.2 — Why This Mapping

The conceptual logic of the stage assignments

Insight — "Can you see what's real?"

Shared Facts + True Constraints → 6 questions

Insight is about whether the raw material of shared reality exists. Do people have access to the same facts (R1–R3)? Are the actual limitations visible rather than hidden (R7–R9)? If these score low, the organization cannot begin the cycle — it's operating on divergent or aspirational versions of reality. The diagnosis hasn't happened yet.

Alignment — "Do you agree on what it means?"

Honest Tradeoffs + No Spin → 6 questions

Alignment is about whether shared reality is agreed upon and honest. Are tradeoffs named explicitly (R4–R6)? Does what's reported internally match what's communicated externally (R10–R12)? If these score low, the facts may exist but they haven't been processed into shared commitments. The organization sees reality but hasn't converged on what to do about it.

eXecution — "Has truth converted to commitment?"

Explicit Ownership + Clear Authority → 6 questions

eXecution is the coupling point — the trunk of the tree. Has shared, agreed-upon reality been converted into named, empowered ownership? Do owners exist (D1–D3) and do they have the authority to act (D4–D6)? If these score low while Reality scores high, the organization is in Paralysis — strong roots, wilting canopy. The coupling is broken at the handoff.

Accountability — "Does the commitment deliver?"

Decision Rights + Sustainable Rhythm → 6 questions

Accountability is whether the delivery system functions under load. Is the decision-making process itself clear (D7–D9)? Can the system sustain without heroic effort (D10–D12)? If these score low while eXecution scores adequately, ownership was assigned but the system for maintaining and enforcing it doesn't hold. The canopy grew but can't sustain itself.

Intelligence — "Did the system learn?"

Not directly measured — inferred from longitudinal change

Intelligence is the meta-stage. It has no dedicated questions because it examines the coupling itself — the health of the loop. It is measured by change across administrations: did the coupling gap narrow? Did the weakest stage improve? Did variance decrease? Intelligence is the delta, not the snapshot. It's why we re-measure.

Part III
Scoring Architecture
3.1 — The Numbers

Every score the instrument produces

LevelComponentsRangeWhat It Reveals
OverallAll 24 questions24–120Gross system health. Screening measure.
Ledger (×2)12 questions each12–60Which half of the coupled system is weaker.
Category (×8)3 questions each3–15Which domain within a ledger is weakest.
Stage (×4)6 questions each6–30Where in I→A→X→A the cycle breaks.
Coupling Gap|Reality% – Delivery%|0–100%The balance between the two systems.

Interpretation thresholds (percentage of max)

RangeLevelMeaning
80–100%StrongSystem is well-designed. Monitor for drift.
60–79%ModerateReal strengths but clear gaps. Debt accumulating in specific areas.
40–59%Needs WorkSystem is under-designed. Leadership compensating for structural gaps.
Below 40%CriticalSystem significantly incomplete. Leadership exhaustion is systemic.

These thresholds apply at every level: overall, ledger, category, and stage. A category at 80%+ with another at 40% tells a sharper story than the ledger average alone.

3.2 — Failure Mode Determination

Three failure modes from two scores

Failure ModeConditionRoot/Canopy ReadPrimary Intervention
ParalysisReality ≥ 60%, Delivery < 60%Strong roots, wilting canopyStart at eXecution stage — assign ownership, match authority
ChaosReality < 60%, Delivery ≥ 60%Big canopy, shallow rootsStart at Insight stage — establish shared facts before acting
FirefightingBoth < 50%Both systems degradedStart at coupling — both ledgers simultaneously
3.3 — Stage Localization

Objective 4: pinpointing where the cycle breaks

This is the analysis no other diagnostic produces. By computing a score for each of the four measurable stages, we identify where specifically the execution cycle fails.

── Stage Scores (individual or team mean) ──

Insight_Score = (R1 + R2 + R3 + R7 + R8 + R9) / 30 × 100
Alignment_Score = (R4 + R5 + R6 + R10 + R11 + R12) / 30 × 100
Execution_Score = (D1 + D2 + D3 + D4 + D5 + D6) / 30 × 100
Accountability_Score = (D7 + D8 + D9 + D10 + D11 + D12) / 30 × 100

── Weakest Stage = Primary Failure Point ──

Failure_Stage = min(Insight, Alignment, Execution, Accountability)

── Stage Gap = difference between strongest and weakest ──

Stage_Gap = max(all stages) − min(all stages)
→ Gap > 20pts: the cycle is breaking at a specific point
→ Gap < 10pts: degradation is distributed, not localized

Reading the stage profile

Profile PatternDiagnosisIntervention Entry Point
I low, A/X/A moderate+The organization can't see clearly. Facts are siloed, constraints hidden. Downstream stages are working from incomplete reality.Begin at Insight: kill competing data sources, surface true constraints, establish single source of truth.
I adequate, A low, X/A moderate+Facts exist but aren't agreed upon. Tradeoffs are implicit. Internal narrative differs from external. Reality is available but not shared.Begin at Alignment: force-rank priorities, document tradeoffs explicitly, eliminate spin.
I/A adequate, X low, A moderateThe coupling is broken. Shared truth exists but hasn't converted to owned commitment. Classic Paralysis — everyone sees it, no one owns it.Begin at eXecution: name singular owners, match authority, document decision rights.
I/A/X adequate, A₂ lowOwnership exists but the delivery system can't sustain it. Decisions get relitigated. The same fires recur. Leaders are load-bearing walls.Begin at Accountability: clarify decision process, break recurring cycles, establish sustainable rhythm.
All low, small gapDistributed degradation. Firefighting. The leader is the system.Begin at the coupling — both ledgers simultaneously. Apply all four Operator Rules.
Part IV
Disparity Analysis
4.1 — Objective 1: Executive View (Mode A)

What the individual administration produces

The executive completes all 24 questions alone. No guidance on answers — the value is in their honest perception. This produces:

── Executive Scores ──

Exec_Reality = sum(R1..R12) → score/60 → percentage
Exec_Delivery = sum(D1..D12) → score/60 → percentage
Exec_Overall = Reality + Delivery → score/120
Exec_Gap = |Reality% − Delivery%|
Exec_Failure_Mode = determined by ledger percentages
Exec_Stage[4] = 4 stage scores → weakest = failure point
Exec_Cat[8] = 8 category scores → weakest two = priority focus

After completion, the practitioner walks through results together. The first question: "Does this match what you feel in your day-to-day?" Discrepancies between the scored result and the leader's gut feeling are themselves diagnostic. The score shows what the leader believes about the system. The gut shows what the leader experiences. When those diverge, it usually means the leader is compensating for structural gaps without realizing it.

4.2 — Objective 2: Team View (Mode B)

Aggregate perception and internal agreement

Each team member completes the 24 questions independently and anonymously. Responses are aggregated to produce both central tendency (what the team believes) and dispersion (how much they agree).

── Per Question (q = each of 24 questions) ──

Team_Mean[q] = average of all responses for question q
Team_SD[q] = standard deviation of responses for question q
Team_Min[q] = lowest response (the most concerned person)
Team_Max[q] = highest response
Team_Range[q] = Max − Min

── Per Category (c = each of 8 categories) ──

Cat_Mean[c] = mean of 3 constituent question means
Cat_SD[c] = pooled SD across 3 constituent questions

── Per Ledger ──

Team_Reality = sum of question means for R1..R12
Team_Delivery = sum of question means for D1..D12

── Per Stage ──

Stage_Mean[s] = mean of 6 constituent question means / 30 × 100
Stage_SD[s] = pooled SD across 6 constituent questions

The variance finding

The Most Important Metric in Team Mode

Standard deviation per question

Any question where SD > 1.2 (on the 5-point scale) means people experience that aspect of the system fundamentally differently. For context: if half the team answers 2 and half answers 4, SD ≈ 1.0. If the spread is wider — 1s and 5s — SD climbs above 1.4.

High variance is not a Delivery problem. It is a Reality Ledger failure. People are not seeing the same system. The Health Check has just demonstrated the very failure it's designed to detect — in real time, with their own data.

Sample size considerations

Team SizeStatistical ApproachNotes
N ≥ 8Full analysis: means, SDs, perception gaps, all thresholds applyPreferred. Adequate for parametric statistics on Likert data.
N = 5–7Means and SDs valid but interpret cautiously. Flag where N is small.SD thresholds still useful but single outliers have more influence.
N < 5Report medians and ranges rather than means/SDs. Treat as directional.Too few for reliable variance measures. Use for conversation, not diagnosis.
4.3 — Objective 3: Disparity Analysis (Mode C)

Executive vs. team — and team vs. team

This is the most powerful deployment. The executive takes Mode A. The team takes Mode B. Then we compute two types of disparity.

Disparity Type 1: Perception Gap (Executive vs. Team)

── Per Question ──

Perception_Gap[q] = Exec_Score[q] − Team_Mean[q]

→ Positive gap: leader sees system as healthier than team does
→ Negative gap: leader more critical than team

── Thresholds ──

|Gap| ≥ 2.0Critical divergence. Leader and team see different systems.
|Gap| ≥ 1.5Notable divergence. Worth investigating.
|Gap| < 1.0 → Reasonable alignment on this dimension.

Four diagnostic patterns

PatternConditionWhat It Means
Blind SpotExec ≥ 4, Team Mean ≤ 3The leader believes this works because it works for them. The team experiences a different reality. Most common and most dangerous pattern.
Shared PainBoth ≤ 3Everyone agrees it's broken. Start here. Alignment already exists — move directly to Operator Rules.
False ConfidenceExec ≥ 4, Team SD > 1.2Appears functional from the top. Inconsistently experienced at the working level. Leader anchors on the successful instances; team lives the variance.
Unacknowledged StrengthTeam Mean > Exec by ≥ 1.5Leader carries concern about something the team has already resolved. Frees attention for real gaps.

Disparity Type 2: Intra-Team Fragmentation

── Per Question ──

Fragmentation[q] = Team_SD[q]

SD > 1.2Fragmented perception. People see this differently.
SD 0.8–1.2 → Normal variation. Some disagreement, not structural.
SD < 0.8 → Strong consensus. Team agrees on this dimension.

── Per Category ──

Cat_Fragmentation[c] = mean SD of 3 constituent questions

── Per Stage ──

Stage_Fragmentation[s] = mean SD of 6 constituent questions
→ High stage fragmentation = team doesn't agree on
whether this part of the cycle works. That IS the finding.
Why Variance Matters More Than the Mean

A category mean of 3.5 with SD of 0.5 means "everyone thinks this is mediocre." A category mean of 3.5 with SD of 1.4 means "some people think this is great, some think it's terrible, and the average is meaningless." The second case is a more urgent finding — because the disagreement itself is a Reality Ledger failure. People are experiencing different organizations.

4.4 — The Combined View: Comparative Report Structure

What the facilitated session works from

The Mode C report has five sections, each mapping to a specific conversation the practitioner facilitates.

SectionContentThe Conversation It Opens
1. OverviewExec scores vs. Team means — overall, per ledger, per stage. Direction and magnitude of each gap."Here's what you see. Here's what the team sees. Let's talk about the distance between them."
2. Perception GapAll 24 questions sorted by gap magnitude. Flagged: gap ≥ 2 (critical), gap ≥ 1.5 (notable)."These are the specific dimensions where you and your team see different systems."
3. Fragmentation MapAll 24 questions sorted by team SD. Flagged: SD > 1.2 (fragmented)."These are the dimensions where your team doesn't agree with each other. This is a Reality Ledger failure demonstrated in real time."
4. Stage ProfileFour stage scores (exec and team). Weakest stage highlighted. Stage gap computed."This is where the cycle breaks. This is where we start."
5. Priority ActionsIf large gap: align before acting. If small gap + low scores: apply Operator Rules directly. Derived from the comparative analysis, not from either score alone."Here's the one thing to fix first. Here's who owns it. Here's when we measure again."
Part V
Statistical Validity
5.1 — Measurement Properties

Is the math sound?

Likert scale treatment

The 5-point Likert scale produces ordinal data. The standard practice in organizational survey research — supported by Carifio & Perla (2008), Norman (2010), and Sullivan & Artino (2013) — is to treat 5-point Likert items as interval data when computing means and standard deviations, provided items are aggregated into scales of 3 or more. The Health Check meets this criterion at every level: 3 items per category, 6 per stage, 12 per ledger.

Internal consistency

Each category contains 3 questions measuring the same construct. This is the minimum for computing Cronbach's alpha (α). Target: α ≥ 0.70 per category. Each stage contains 6 questions (two categories), which provides better reliability. After the first 20+ administrations, alpha should be computed per category and per stage. If any category falls below 0.65, the constituent questions may need revision — they may be measuring different constructs.

Why 3 questions per category (not 4 or 5)

Three is the minimum for internal consistency measurement while keeping the total instrument at 24 questions (5-minute administration). Expanding to 4 per category would require 32 questions, increasing completion time to ~7 minutes and introducing fatigue effects. The 24-question design is optimized for executive tolerance — the people who most need to take it are the people with the least patience for surveys.

SD threshold of 1.2

On a 5-point scale, SD of 1.2 represents 24% of the full scale range (4 points from min to max). For context:

Team DistributionApproximate SDInterpretation
All respondents answer 3 or 4~0.5Strong consensus. Minor variation.
Split between 2, 3, and 4~0.8Normal variation. Not alarming.
Half answer 2, half answer 4~1.0Emerging divergence. Worth noting.
Spread across 1–4 or 2–5~1.2Threshold. Fragmented perception.
Bimodal: 1s and 5s~1.6+Fundamentally different experiences of the same system.

The 1.2 threshold catches meaningful disagreement without being overly sensitive. It fires when the team is genuinely split, not when there's normal variation in perspective.

Perception gap threshold of ≥ 2.0

A 2-point gap on a 5-point scale means the executive and team mean are separated by 40% of the scale range. This is a strong signal — the executive answering "Usually" while the team mean is "Rarely." The 1.5 threshold (30% of range) is flagged as "notable" — enough to investigate, not enough to alarm.

The coupling gap

Computed as |Reality% – Delivery%|, where each percentage is the ledger score divided by its maximum (60). Using percentages rather than raw scores normalizes for the fact that both ledgers have identical scales, making the gap directly interpretable.

Coupling_Gap = |Reality_Score/60 − Delivery_Score/60| × 100

── Interpretation ──

Gap < 10% → Ledgers are reasonably balanced. Focus on overall level.
Gap 10–20% → Imbalance emerging. One ledger pulling ahead or behind.
Gap > 20%Single-ledger plateau territory. The stronger ledger
was treated. The weaker is pulling it back.
5.2 — What the Instrument Does NOT Measure

Boundaries of the diagnostic

Intellectual honesty about what the Health Check doesn't do:

Limitations

It measures perception, not objective reality. The Health Check measures how people experience the system, not whether the system is objectively well-designed. This is a feature, not a bug — perception IS the operational reality. But it means the scores can be influenced by recent events, recency bias, or organizational mood.

It doesn't measure the Intelligence stage directly. Intelligence (the learning loop) is inferred from longitudinal change, not from a single administration. One snapshot can localize the failure. Only repeated measurement can tell you whether the system is learning.

3 questions per construct is the minimum, not ideal. With 3 items, one poorly understood question can skew a category score significantly. This is acceptable for a 5-minute diagnostic but means individual category scores should be treated as directional, not precise. Stage scores (6 items) and ledger scores (12 items) are more reliable.

It doesn't explain causality. The Health Check identifies where the system is weak. It doesn't explain why. That's what the practitioner conversation is for — the facilitated session after the data is presented. The instrument creates the diagnostic map. The practitioner reads it.

Part VI
Implementation
6.1 — Survey Setup

Any standard forms tool — nothing custom required

Form structure

FieldTypePurpose
Role identifierDropdown: "Executive" / "Team Member"Separates Mode A from Mode B in the same data set. Enables comparative analysis without separate forms.
R1 through R125-point scale (radio or slider)Reality Ledger questions. Labeled: Never / Rarely / Sometimes / Usually / Always
D1 through D125-point scale (radio or slider)Delivery Ledger questions. Same labels.

Total: 25 fields (1 role identifier + 24 Likert items). No open-text fields. No conditional logic. Any forms tool that supports radio buttons can run this.

Administration sequence

Present Reality Ledger questions first (R1–R12), then Delivery Ledger (D1–D12). Within each ledger, present in category order. Do not randomize — the conceptual flow from Shared Facts → Honest Tradeoffs → True Constraints → No Spin creates a natural progression that aids honest reflection. Same for Delivery: Ownership → Authority → Decision Rights → Sustainable Rhythm builds from "who owns it" to "can the system sustain."

Framing language (pre-survey)

"This is a system diagnostic, not a performance review. Answer about the system, not about any individual. There are no right answers — only your honest experience of how the organization operates day to day."

6.2 — Spreadsheet Analysis

The formulas — ready to paste

After exporting responses to CSV/Excel, the analysis requires straightforward formulas. Below assumes row 1 = headers, row 2 = executive response, rows 3+ = team responses. Columns B through Y contain the 24 question responses (R1–R12 in B–M, D1–D12 in N–Y).

── LEDGER SCORES (for any individual row) ──

Reality_Score = SUM(B:M) ← sum of R1..R12
Delivery_Score = SUM(N:Y) ← sum of D1..D12
Reality_% = Reality_Score / 60 × 100
Delivery_% = Delivery_Score / 60 × 100
Coupling_Gap = ABS(Reality_% − Delivery_%)

── STAGE SCORES ──

Insight = (R1+R2+R3+R7+R8+R9) / 30 × 100
Alignment = (R4+R5+R6+R10+R11+R12) / 30 × 100
Execution = (D1+D2+D3+D4+D5+D6) / 30 × 100
Accountability = (D7+D8+D9+D10+D11+D12) / 30 × 100

── TEAM AGGREGATES (per question column, rows 3+) ──

Team_Mean[q] = AVERAGE(q3:qN)
Team_SD[q] = STDEV(q3:qN)
Team_Min[q] = MIN(q3:qN)
Team_Max[q] = MAX(q3:qN)

── PERCEPTION GAP (per question) ──

Gap[q] = Exec_Score[q] − Team_Mean[q]

── FLAGS ──

Blind_Spot[q] = IF(Exec[q]>=4 AND Team_Mean[q]<=3, "BLIND SPOT", "")
Shared_Pain[q] = IF(Exec[q]<=3 AND Team_Mean[q]<=3, "SHARED PAIN", "")
False_Conf[q] = IF(Exec[q]>=4 AND Team_SD[q]>1.2, "FALSE CONFIDENCE", "")
Fragmented[q] = IF(Team_SD[q]>1.2, "FRAGMENTED", "")
6.3 — The Longitudinal Dimension (Mode D)

Objective 4+: measuring Intelligence through change

When the Health Check is administered a second time (recommended: 6–8 weeks after intervention begins), the Intelligence stage becomes measurable.

── Delta Analysis (T2 − T1) ──

Δ_Stage[s] = Stage_Score_T2[s] − Stage_Score_T1[s]
Δ_Coupling_Gap = Coupling_Gap_T2 − Coupling_Gap_T1
Δ_Variance[q] = Team_SD_T2[q] − Team_SD_T1[q]

── Intelligence Indicators ──

Coupling gap narrowing → system is re-coupling
Weakest stage improving → intervention is correctly targeted
Team variance decreasing → shared reality is strengthening
Perception gap narrowing → exec and team converging on same view

── Warning Signs ──

Coupling gap widening → intervention is single-ledger; plateau incoming
Treated stage improved but
untreated stage declined → coupling degradation; untreated side eroding
Variance increasing → shared reality is fragmenting, not converging

This is where the palindrome earns its name. Intelligence (the measured change) feeds the next Insight (what to diagnose next). The cycle either compounds or it doesn't — and the delta data tells you which.

Part VII
Integrity Check
7.1 — Does the Instrument Deliver All Four Objectives?

Validation matrix

ObjectiveRequired DataProduced ByStatus
1. Executive ViewIndividual scores: overall, ledger, category, stage, failure modeMode A: single administration, standard scoring✓ Complete
2. Team ViewAggregate: means, SDs, medians per question/category/ledger/stageMode B: anonymous team, standard aggregation✓ Complete
3a. Exec vs. TeamPerception gaps, blind spots, shared pain, false confidenceMode C: comparative analysis (gap = exec − team mean)✓ Complete
3b. Intra-TeamPer-question SD, fragmentation flags, consensus mappingMode B analysis: SD per question, threshold flagging✓ Complete
4. Cycle LocalizationStage scores, stage profile, weakest stage identificationStage mapping: 6 questions per stage, min-stage = failure point✓ Complete
4+. IntelligenceDeltas across administrations: stage, coupling gap, varianceMode D: longitudinal comparison (T2 − T1)✓ Complete
7.2 — Open Questions for Field Testing

What we'll learn from the first 5 administrations

Validate After First Deployments

Internal consistency. Compute Cronbach's alpha per category after N ≥ 20 individual responses. If any category α < 0.65, the constituent questions may be measuring different things and need revision.

Stage mapping validity. Does the stage localization match the practitioner's independent diagnosis? If the instrument says "stuck at Alignment" but the practitioner's field reading says "stuck at eXecution," either the mapping needs adjustment or the practitioner has a blind spot.

Threshold calibration. The 60% boundary between failure modes, the 1.2 SD threshold for fragmentation, and the 2.0 perception gap threshold are all reasonable starting points but should be refined with empirical data. They may need adjustment by industry (healthcare vs. financial services) or by organization size.

Question clarity. Any question where the team's SD is consistently high across multiple organizations may be poorly worded — people might be interpreting it differently, not experiencing the system differently. R5 ("People feel safe raising bad news") is the most likely candidate — "safe" means different things in different cultures.