Ledger Health Check — Diagnostic Architecture & Statistical Validation

Part I

The Structural Map

1.1 — The Architecture

How 24 questions serve four analytical objectives

The Health Check needs to accomplish four things simultaneously from a single survey administration. Each objective requires a different analytical lens on the same 24 data points.

Four Objectives

1. Executive View — The leader's perception of system health, scored by ledger, category, and stage. Their mental model of where the system is strong and where it's broken.

2. Team View — The aggregate perception of the people who operate within the system daily. Central tendency and dispersion — not just what the team believes, but how much they agree.

3. Disparity Analysis — The gap between the executive's view and the team's view (blind spots, shared pain, false confidence). The gap within the team (fragmented perception). Both are diagnostic findings, not just data points.

4. Cycle Localization — Mapping scores to the I→A→X→A→I stages to determine where specifically in the execution cycle the system breaks down. "Stuck at Insight" demands different intervention than "stuck at eXecution."

The key design constraint: a single 24-question survey, administrable via any standard forms tool (Google Forms, Microsoft Forms, Typeform, SurveyMonkey), must produce all four analyses. No custom tooling required for data collection. Analysis can be done in a spreadsheet.

1.2 — The Triple Hierarchy

Questions nest three ways simultaneously

Each question belongs to three hierarchies at once. This is the structural innovation that allows one survey to serve four objectives.

Hierarchy 1: Ledger (2 groups of 12)

The primary diagnostic axis. Reality Ledger (R1–R12) measures shared truth. Delivery Ledger (D1–D12) measures owned action. The gap between them reveals the coupling state.

Hierarchy 2: Category (8 groups of 3)

The granular diagnostic. Each ledger contains 4 categories of 3 questions each. Categories identify which domain within a ledger is weakest — is it the facts, the tradeoffs, the ownership, or the authority?

Hierarchy 3: IAXAI Stage (4 groups of 6)

The cycle localization axis. By pairing two categories per stage, we can determine where in the I→A→X→A→I cycle execution breaks. This is the hierarchy that no other diagnostic provides.

Insight

R1–R3 + R7–R9

→

Alignment

R4–R6 + R10–R12

→

eXecution

D1–D3 + D4–D6

→

Accountability

D7–D9 + D10–D12

↻

Intelligence

inferred from Δ

Part II

The Complete Question Map

2.1 — All 24 Questions, Triple-Tagged

Every question, every hierarchy, every stage

Scale: 1 = Never true, 2 = Rarely, 3 = Sometimes, 4 = Usually, 5 = Always true

ID	Question	Ledger	Category	Stage
R1	When a problem surfaces, all stakeholders are working from the same set of facts.	Reality	Shared Facts	Insight
R2	Data and status updates reach the people who need them without someone having to chase it down.	Reality	Shared Facts	Insight
R3	New team members can find out what is actually happening without relying on tribal knowledge.	Reality	Shared Facts	Insight
R7	Resource limitations (time, money, people) are acknowledged openly, not quietly absorbed.	Reality	True Constraints	Insight
R8	Deadlines reflect actual capacity, not aspirational thinking.	Reality	True Constraints	Insight
R9	When something is not going to work, people say so before it fails — not after.	Reality	True Constraints	Insight
R4	Tradeoffs are stated out loud before decisions are made — not discovered after.	Reality	Honest Tradeoffs	Alignment
R5	People feel safe raising bad news or contradicting the prevailing narrative.	Reality	Honest Tradeoffs	Alignment
R6	When two priorities conflict, the organization resolves it explicitly rather than pretending both will get done.	Reality	Honest Tradeoffs	Alignment
R10	Reports to leadership reflect what is actually happening, not a polished version of it.	Reality	No Spin	Alignment
R11	The story told to investors, the board, or external partners matches internal reality.	Reality	No Spin	Alignment
R12	People do not have to translate between what is said and what is meant in this organization.	Reality	No Spin	Alignment
D1	Every active initiative has a single person who owns the outcome — not just the tasks.	Delivery	Explicit Ownership	eXecution
D2	When something goes wrong, it is clear who is accountable without a blame conversation.	Delivery	Explicit Ownership	eXecution
D3	Ownership is assigned at the start of work, not figured out as things unfold.	Delivery	Explicit Ownership	eXecution
D4	People with accountability also have the authority to make decisions in their domain.	Delivery	Clear Authority	eXecution
D5	A decision made by the right person stays decided — it does not get relitigated.	Delivery	Clear Authority	eXecution
D6	Managers do not need to escalate routine decisions; they have real decision rights.	Delivery	Clear Authority	eXecution
D7	It is clear which decisions require group input and which are made by an individual.	Delivery	Decision Rights	Accountability
D8	Meetings end with explicit next steps and named owners, not vague consensus.	Delivery	Decision Rights	Accountability
D9	Cross-team decisions have a defined process — they do not require a leader to broker every time.	Delivery	Decision Rights	Accountability
D10	The same fire does not have to be fought more than once.	Delivery	Sustainable Rhythm	Accountability
D11	Leaders can take time off without the system stalling.	Delivery	Sustainable Rhythm	Accountability
D12	The pace of work is one the team can maintain for the next twelve months.	Delivery	Sustainable Rhythm	Accountability

2.2 — Why This Mapping

The conceptual logic of the stage assignments

Insight — "Can you see what's real?"

Shared Facts + True Constraints → 6 questions

Insight is about whether the raw material of shared reality exists. Do people have access to the same facts (R1–R3)? Are the actual limitations visible rather than hidden (R7–R9)? If these score low, the organization cannot begin the cycle — it's operating on divergent or aspirational versions of reality. The diagnosis hasn't happened yet.

Alignment — "Do you agree on what it means?"

Honest Tradeoffs + No Spin → 6 questions

Alignment is about whether shared reality is agreed upon and honest. Are tradeoffs named explicitly (R4–R6)? Does what's reported internally match what's communicated externally (R10–R12)? If these score low, the facts may exist but they haven't been processed into shared commitments. The organization sees reality but hasn't converged on what to do about it.

eXecution — "Has truth converted to commitment?"

Explicit Ownership + Clear Authority → 6 questions

eXecution is the coupling point — the trunk of the tree. Has shared, agreed-upon reality been converted into named, empowered ownership? Do owners exist (D1–D3) and do they have the authority to act (D4–D6)? If these score low while Reality scores high, the organization is in Paralysis — strong roots, wilting canopy. The coupling is broken at the handoff.

Accountability — "Does the commitment deliver?"

Decision Rights + Sustainable Rhythm → 6 questions

Accountability is whether the delivery system functions under load. Is the decision-making process itself clear (D7–D9)? Can the system sustain without heroic effort (D10–D12)? If these score low while eXecution scores adequately, ownership was assigned but the system for maintaining and enforcing it doesn't hold. The canopy grew but can't sustain itself.

Intelligence — "Did the system learn?"

Not directly measured — inferred from longitudinal change

Intelligence is the meta-stage. It has no dedicated questions because it examines the coupling itself — the health of the loop. It is measured by change across administrations: did the coupling gap narrow? Did the weakest stage improve? Did variance decrease? Intelligence is the delta, not the snapshot. It's why we re-measure.

Part III

Scoring Architecture

3.1 — The Numbers

Every score the instrument produces

Level	Components	Range	What It Reveals
Overall	All 24 questions	24–120	Gross system health. Screening measure.
Ledger (×2)	12 questions each	12–60	Which half of the coupled system is weaker.
Category (×8)	3 questions each	3–15	Which domain within a ledger is weakest.
Stage (×4)	6 questions each	6–30	Where in I→A→X→A the cycle breaks.
Coupling Gap	\|Reality% – Delivery%\|	0–100%	The balance between the two systems.

Interpretation thresholds (percentage of max)

Range	Level	Meaning
80–100%	Strong	System is well-designed. Monitor for drift.
60–79%	Moderate	Real strengths but clear gaps. Debt accumulating in specific areas.
40–59%	Needs Work	System is under-designed. Leadership compensating for structural gaps.
Below 40%	Critical	System significantly incomplete. Leadership exhaustion is systemic.

These thresholds apply at every level: overall, ledger, category, and stage. A category at 80%+ with another at 40% tells a sharper story than the ledger average alone.

3.2 — Failure Mode Determination

Three failure modes from two scores

Failure Mode	Condition	Root/Canopy Read	Primary Intervention
Paralysis	Reality ≥ 60%, Delivery < 60%	Strong roots, wilting canopy	Start at eXecution stage — assign ownership, match authority
Chaos	Reality < 60%, Delivery ≥ 60%	Big canopy, shallow roots	Start at Insight stage — establish shared facts before acting
Firefighting	Both < 50%	Both systems degraded	Start at coupling — both ledgers simultaneously

The Critical Insight

Paralysis and Chaos are single-ledger failures — the direct consequence of treating only one side. They are what State B looks like in the data. Firefighting is coupled degradation — State C. The failure mode doesn't just name the problem. It names which ledger was treated and which was neglected.

3.3 — Stage Localization

Objective 4: pinpointing where the cycle breaks

This is the analysis no other diagnostic produces. By computing a score for each of the four measurable stages, we identify where specifically the execution cycle fails.

── Stage Scores (individual or team mean) ──

Insight_Score = (R1 + R2 + R3 + R7 + R8 + R9) / 30 × 100
Alignment_Score = (R4 + R5 + R6 + R10 + R11 + R12) / 30 × 100
Execution_Score = (D1 + D2 + D3 + D4 + D5 + D6) / 30 × 100
Accountability_Score = (D7 + D8 + D9 + D10 + D11 + D12) / 30 × 100

── Weakest Stage = Primary Failure Point ──

Failure_Stage = min(Insight, Alignment, Execution, Accountability)

── Stage Gap = difference between strongest and weakest ──

Stage_Gap = max(all stages) − min(all stages)
→ Gap > 20pts: the cycle is breaking at a specific point
→ Gap < 10pts: degradation is distributed, not localized

Reading the stage profile

Profile Pattern	Diagnosis	Intervention Entry Point
I low, A/X/A moderate+	The organization can't see clearly. Facts are siloed, constraints hidden. Downstream stages are working from incomplete reality.	Begin at Insight: kill competing data sources, surface true constraints, establish single source of truth.
I adequate, A low, X/A moderate+	Facts exist but aren't agreed upon. Tradeoffs are implicit. Internal narrative differs from external. Reality is available but not shared.	Begin at Alignment: force-rank priorities, document tradeoffs explicitly, eliminate spin.
I/A adequate, X low, A moderate	The coupling is broken. Shared truth exists but hasn't converted to owned commitment. Classic Paralysis — everyone sees it, no one owns it.	Begin at eXecution: name singular owners, match authority, document decision rights.
I/A/X adequate, A₂ low	Ownership exists but the delivery system can't sustain it. Decisions get relitigated. The same fires recur. Leaders are load-bearing walls.	Begin at Accountability: clarify decision process, break recurring cycles, establish sustainable rhythm.
All low, small gap	Distributed degradation. Firefighting. The leader is the system.	Begin at the coupling — both ledgers simultaneously. Apply all four Operator Rules.

Part IV

Disparity Analysis

4.1 — Objective 1: Executive View (Mode A)

What the individual administration produces

The executive completes all 24 questions alone. No guidance on answers — the value is in their honest perception. This produces:

── Executive Scores ──

Exec_Reality = sum(R1..R12) → score/60 → percentage
Exec_Delivery = sum(D1..D12) → score/60 → percentage
Exec_Overall = Reality + Delivery → score/120
Exec_Gap = |Reality% − Delivery%|
Exec_Failure_Mode = determined by ledger percentages
Exec_Stage[4] = 4 stage scores → weakest = failure point
Exec_Cat[8] = 8 category scores → weakest two = priority focus

After completion, the practitioner walks through results together. The first question: "Does this match what you feel in your day-to-day?" Discrepancies between the scored result and the leader's gut feeling are themselves diagnostic. The score shows what the leader believes about the system. The gut shows what the leader experiences. When those diverge, it usually means the leader is compensating for structural gaps without realizing it.

4.2 — Objective 2: Team View (Mode B)

Aggregate perception and internal agreement

Each team member completes the 24 questions independently and anonymously. Responses are aggregated to produce both central tendency (what the team believes) and dispersion (how much they agree).

── Per Question (q = each of 24 questions) ──

Team_Mean[q] = average of all responses for question q
Team_SD[q] = standard deviation of responses for question q
Team_Min[q] = lowest response (the most concerned person)
Team_Max[q] = highest response
Team_Range[q] = Max − Min

── Per Category (c = each of 8 categories) ──

Cat_Mean[c] = mean of 3 constituent question means
Cat_SD[c] = pooled SD across 3 constituent questions

── Per Ledger ──

Team_Reality = sum of question means for R1..R12
Team_Delivery = sum of question means for D1..D12

── Per Stage ──

Stage_Mean[s] = mean of 6 constituent question means / 30 × 100
Stage_SD[s] = pooled SD across 6 constituent questions

The variance finding

The Most Important Metric in Team Mode

Standard deviation per question

Any question where SD > 1.2 (on the 5-point scale) means people experience that aspect of the system fundamentally differently. For context: if half the team answers 2 and half answers 4, SD ≈ 1.0. If the spread is wider — 1s and 5s — SD climbs above 1.4.

High variance is not a Delivery problem. It is a Reality Ledger failure. People are not seeing the same system. The Health Check has just demonstrated the very failure it's designed to detect — in real time, with their own data.

Sample size considerations

Team Size	Statistical Approach	Notes
N ≥ 8	Full analysis: means, SDs, perception gaps, all thresholds apply	Preferred. Adequate for parametric statistics on Likert data.
N = 5–7	Means and SDs valid but interpret cautiously. Flag where N is small.	SD thresholds still useful but single outliers have more influence.
N < 5	Report medians and ranges rather than means/SDs. Treat as directional.	Too few for reliable variance measures. Use for conversation, not diagnosis.

4.3 — Objective 3: Disparity Analysis (Mode C)

Executive vs. team — and team vs. team

This is the most powerful deployment. The executive takes Mode A. The team takes Mode B. Then we compute two types of disparity.

Disparity Type 1: Perception Gap (Executive vs. Team)

── Per Question ──

Perception_Gap[q] = Exec_Score[q] − Team_Mean[q]

→ Positive gap: leader sees system as healthier than team does
→ Negative gap: leader more critical than team

── Thresholds ──

|Gap| ≥ 2.0 → Critical divergence. Leader and team see different systems.
|Gap| ≥ 1.5 → Notable divergence. Worth investigating.
|Gap| < 1.0 → Reasonable alignment on this dimension.

Four diagnostic patterns

Pattern	Condition	What It Means
Blind Spot	Exec ≥ 4, Team Mean ≤ 3	The leader believes this works because it works for them. The team experiences a different reality. Most common and most dangerous pattern.
Shared Pain	Both ≤ 3	Everyone agrees it's broken. Start here. Alignment already exists — move directly to Operator Rules.
False Confidence	Exec ≥ 4, Team SD > 1.2	Appears functional from the top. Inconsistently experienced at the working level. Leader anchors on the successful instances; team lives the variance.
Unacknowledged Strength	Team Mean > Exec by ≥ 1.5	Leader carries concern about something the team has already resolved. Frees attention for real gaps.

Disparity Type 2: Intra-Team Fragmentation

── Per Question ──

Fragmentation[q] = Team_SD[q]

SD > 1.2 → Fragmented perception. People see this differently.
SD 0.8–1.2 → Normal variation. Some disagreement, not structural.
SD < 0.8 → Strong consensus. Team agrees on this dimension.

── Per Category ──

Cat_Fragmentation[c] = mean SD of 3 constituent questions

── Per Stage ──

Stage_Fragmentation[s] = mean SD of 6 constituent questions
→ High stage fragmentation = team doesn't agree on
whether this part of the cycle works. That IS the finding.

Why Variance Matters More Than the Mean

A category mean of 3.5 with SD of 0.5 means "everyone thinks this is mediocre." A category mean of 3.5 with SD of 1.4 means "some people think this is great, some think it's terrible, and the average is meaningless." The second case is a more urgent finding — because the disagreement itself is a Reality Ledger failure. People are experiencing different organizations.

4.4 — The Combined View: Comparative Report Structure

What the facilitated session works from

The Mode C report has five sections, each mapping to a specific conversation the practitioner facilitates.

Section	Content	The Conversation It Opens
1. Overview	Exec scores vs. Team means — overall, per ledger, per stage. Direction and magnitude of each gap.	"Here's what you see. Here's what the team sees. Let's talk about the distance between them."
2. Perception Gap	All 24 questions sorted by gap magnitude. Flagged: gap ≥ 2 (critical), gap ≥ 1.5 (notable).	"These are the specific dimensions where you and your team see different systems."
3. Fragmentation Map	All 24 questions sorted by team SD. Flagged: SD > 1.2 (fragmented).	"These are the dimensions where your team doesn't agree with each other. This is a Reality Ledger failure demonstrated in real time."
4. Stage Profile	Four stage scores (exec and team). Weakest stage highlighted. Stage gap computed.	"This is where the cycle breaks. This is where we start."
5. Priority Actions	If large gap: align before acting. If small gap + low scores: apply Operator Rules directly. Derived from the comparative analysis, not from either score alone.	"Here's the one thing to fix first. Here's who owns it. Here's when we measure again."

Part V

Statistical Validity

5.1 — Measurement Properties

Is the math sound?

Likert scale treatment

The 5-point Likert scale produces ordinal data. The standard practice in organizational survey research — supported by Carifio & Perla (2008), Norman (2010), and Sullivan & Artino (2013) — is to treat 5-point Likert items as interval data when computing means and standard deviations, provided items are aggregated into scales of 3 or more. The Health Check meets this criterion at every level: 3 items per category, 6 per stage, 12 per ledger.

Internal consistency

Each category contains 3 questions measuring the same construct. This is the minimum for computing Cronbach's alpha (α). Target: α ≥ 0.70 per category. Each stage contains 6 questions (two categories), which provides better reliability. After the first 20+ administrations, alpha should be computed per category and per stage. If any category falls below 0.65, the constituent questions may need revision — they may be measuring different constructs.

Why 3 questions per category (not 4 or 5)

Three is the minimum for internal consistency measurement while keeping the total instrument at 24 questions (5-minute administration). Expanding to 4 per category would require 32 questions, increasing completion time to ~7 minutes and introducing fatigue effects. The 24-question design is optimized for executive tolerance — the people who most need to take it are the people with the least patience for surveys.

SD threshold of 1.2

On a 5-point scale, SD of 1.2 represents 24% of the full scale range (4 points from min to max). For context:

Team Distribution	Approximate SD	Interpretation
All respondents answer 3 or 4	~0.5	Strong consensus. Minor variation.
Split between 2, 3, and 4	~0.8	Normal variation. Not alarming.
Half answer 2, half answer 4	~1.0	Emerging divergence. Worth noting.
Spread across 1–4 or 2–5	~1.2	Threshold. Fragmented perception.
Bimodal: 1s and 5s	~1.6+	Fundamentally different experiences of the same system.

The 1.2 threshold catches meaningful disagreement without being overly sensitive. It fires when the team is genuinely split, not when there's normal variation in perspective.

Perception gap threshold of ≥ 2.0

A 2-point gap on a 5-point scale means the executive and team mean are separated by 40% of the scale range. This is a strong signal — the executive answering "Usually" while the team mean is "Rarely." The 1.5 threshold (30% of range) is flagged as "notable" — enough to investigate, not enough to alarm.

The coupling gap

Computed as |Reality% – Delivery%|, where each percentage is the ledger score divided by its maximum (60). Using percentages rather than raw scores normalizes for the fact that both ledgers have identical scales, making the gap directly interpretable.

Coupling_Gap = |Reality_Score/60 − Delivery_Score/60| × 100

── Interpretation ──

Gap < 10% → Ledgers are reasonably balanced. Focus on overall level.
Gap 10–20% → Imbalance emerging. One ledger pulling ahead or behind.
Gap > 20% → Single-ledger plateau territory. The stronger ledger
was treated. The weaker is pulling it back.

5.2 — What the Instrument Does NOT Measure

Boundaries of the diagnostic

Intellectual honesty about what the Health Check doesn't do:

Limitations

It measures perception, not objective reality. The Health Check measures how people experience the system, not whether the system is objectively well-designed. This is a feature, not a bug — perception IS the operational reality. But it means the scores can be influenced by recent events, recency bias, or organizational mood.

It doesn't measure the Intelligence stage directly. Intelligence (the learning loop) is inferred from longitudinal change, not from a single administration. One snapshot can localize the failure. Only repeated measurement can tell you whether the system is learning.

3 questions per construct is the minimum, not ideal. With 3 items, one poorly understood question can skew a category score significantly. This is acceptable for a 5-minute diagnostic but means individual category scores should be treated as directional, not precise. Stage scores (6 items) and ledger scores (12 items) are more reliable.

It doesn't explain causality. The Health Check identifies where the system is weak. It doesn't explain why. That's what the practitioner conversation is for — the facilitated session after the data is presented. The instrument creates the diagnostic map. The practitioner reads it.

Part VI

Implementation

6.1 — Survey Setup

Any standard forms tool — nothing custom required

Form structure

Field	Type	Purpose
Role identifier	Dropdown: "Executive" / "Team Member"	Separates Mode A from Mode B in the same data set. Enables comparative analysis without separate forms.
R1 through R12	5-point scale (radio or slider)	Reality Ledger questions. Labeled: Never / Rarely / Sometimes / Usually / Always
D1 through D12	5-point scale (radio or slider)	Delivery Ledger questions. Same labels.

Total: 25 fields (1 role identifier + 24 Likert items). No open-text fields. No conditional logic. Any forms tool that supports radio buttons can run this.

Administration sequence

Present Reality Ledger questions first (R1–R12), then Delivery Ledger (D1–D12). Within each ledger, present in category order. Do not randomize — the conceptual flow from Shared Facts → Honest Tradeoffs → True Constraints → No Spin creates a natural progression that aids honest reflection. Same for Delivery: Ownership → Authority → Decision Rights → Sustainable Rhythm builds from "who owns it" to "can the system sustain."

Framing language (pre-survey)

"This is a system diagnostic, not a performance review. Answer about the system, not about any individual. There are no right answers — only your honest experience of how the organization operates day to day."

6.2 — Spreadsheet Analysis

The formulas — ready to paste

After exporting responses to CSV/Excel, the analysis requires straightforward formulas. Below assumes row 1 = headers, row 2 = executive response, rows 3+ = team responses. Columns B through Y contain the 24 question responses (R1–R12 in B–M, D1–D12 in N–Y).

── LEDGER SCORES (for any individual row) ──

Reality_Score = SUM(B:M) ← sum of R1..R12
Delivery_Score = SUM(N:Y) ← sum of D1..D12
Reality_% = Reality_Score / 60 × 100
Delivery_% = Delivery_Score / 60 × 100
Coupling_Gap = ABS(Reality_% − Delivery_%)

── STAGE SCORES ──

Insight = (R1+R2+R3+R7+R8+R9) / 30 × 100
Alignment = (R4+R5+R6+R10+R11+R12) / 30 × 100
Execution = (D1+D2+D3+D4+D5+D6) / 30 × 100
Accountability = (D7+D8+D9+D10+D11+D12) / 30 × 100

── TEAM AGGREGATES (per question column, rows 3+) ──

Team_Mean[q] = AVERAGE(q3:qN)
Team_SD[q] = STDEV(q3:qN)
Team_Min[q] = MIN(q3:qN)
Team_Max[q] = MAX(q3:qN)

── PERCEPTION GAP (per question) ──

Gap[q] = Exec_Score[q] − Team_Mean[q]

── FLAGS ──

Blind_Spot[q] = IF(Exec[q]>=4 AND Team_Mean[q]<=3, "BLIND SPOT", "")
Shared_Pain[q] = IF(Exec[q]<=3 AND Team_Mean[q]<=3, "SHARED PAIN", "")
False_Conf[q] = IF(Exec[q]>=4 AND Team_SD[q]>1.2, "FALSE CONFIDENCE", "")
Fragmented[q] = IF(Team_SD[q]>1.2, "FRAGMENTED", "")

6.3 — The Longitudinal Dimension (Mode D)

Objective 4+: measuring Intelligence through change

When the Health Check is administered a second time (recommended: 6–8 weeks after intervention begins), the Intelligence stage becomes measurable.

── Delta Analysis (T2 − T1) ──

Δ_Stage[s] = Stage_Score_T2[s] − Stage_Score_T1[s]
Δ_Coupling_Gap = Coupling_Gap_T2 − Coupling_Gap_T1
Δ_Variance[q] = Team_SD_T2[q] − Team_SD_T1[q]

── Intelligence Indicators ──

Coupling gap narrowing → system is re-coupling
Weakest stage improving → intervention is correctly targeted
Team variance decreasing → shared reality is strengthening
Perception gap narrowing → exec and team converging on same view

── Warning Signs ──

Coupling gap widening → intervention is single-ledger; plateau incoming
Treated stage improved but
untreated stage declined → coupling degradation; untreated side eroding
Variance increasing → shared reality is fragmenting, not converging

This is where the palindrome earns its name. Intelligence (the measured change) feeds the next Insight (what to diagnose next). The cycle either compounds or it doesn't — and the delta data tells you which.

Part VII

Integrity Check

7.1 — Does the Instrument Deliver All Four Objectives?

Validation matrix

Objective	Required Data	Produced By	Status
1. Executive View	Individual scores: overall, ledger, category, stage, failure mode	Mode A: single administration, standard scoring	✓ Complete
2. Team View	Aggregate: means, SDs, medians per question/category/ledger/stage	Mode B: anonymous team, standard aggregation	✓ Complete
3a. Exec vs. Team	Perception gaps, blind spots, shared pain, false confidence	Mode C: comparative analysis (gap = exec − team mean)	✓ Complete
3b. Intra-Team	Per-question SD, fragmentation flags, consensus mapping	Mode B analysis: SD per question, threshold flagging	✓ Complete
4. Cycle Localization	Stage scores, stage profile, weakest stage identification	Stage mapping: 6 questions per stage, min-stage = failure point	✓ Complete
4+. Intelligence	Deltas across administrations: stage, coupling gap, variance	Mode D: longitudinal comparison (T2 − T1)	✓ Complete

7.2 — Open Questions for Field Testing

What we'll learn from the first 5 administrations

Validate After First Deployments

Internal consistency. Compute Cronbach's alpha per category after N ≥ 20 individual responses. If any category α < 0.65, the constituent questions may be measuring different things and need revision.

Stage mapping validity. Does the stage localization match the practitioner's independent diagnosis? If the instrument says "stuck at Alignment" but the practitioner's field reading says "stuck at eXecution," either the mapping needs adjustment or the practitioner has a blind spot.

Threshold calibration. The 60% boundary between failure modes, the 1.2 SD threshold for fragmentation, and the 2.0 perception gap threshold are all reasonable starting points but should be refined with empirical data. They may need adjustment by industry (healthcare vs. financial services) or by organization size.

Question clarity. Any question where the team's SD is consistently high across multiple organizations may be poorly worded — people might be interpreting it differently, not experiencing the system differently. R5 ("People feel safe raising bad news") is the most likely candidate — "safe" means different things in different cultures.