About | CAIT-NV

About CAIT-NV

Development, methodology, and technical foundation

Introduction

CAIT-NV started as an experiment to answer two questions: How much better is IRT than classical test theory? Turns out, significantly better. Second: Is adaptive testing actually worthwhile? Absolutely. The results speak for themselves.

Origins & Development

EqusG made CAIT as a WAIS-IV clone. Didn't quite hit the mark—CAIT's g-loading is ~.76 vs WAIS-IV's ~.90. Strip out the verbal stuff and it's basically the same at ~.75.

Instead of throwing it out, we retooled it with IRT. Could IRT fix CAIT? We made it nonverbal to reach more people, culturally fair, and much shorter for practical reasons. Original CAIT took 60 minutes. Ours? 10 minutes. 6x faster and more accurate. Average g-loading went from ~.76 to ~.83.

Spearman's Law of Diminshing Returns

Comparison of g-loading across different testing methods. CTT and IRT are full-length (83 items), whereas CAT is 30 items long.

This graph demonstrates the power of IRT and adaptive testing. What drives accuracy is item quality, and IRT weighs items by their discrimination and difficulty parameters rather than treating them all equally. Adaptive testing then selects the most informative items for each ability level, maximizing measurement precision. The Spearman's Law of Diminishing Returns (SLODR) effect is visible here: at ability extremes (very high or very low IQ), item quality tends to degrade, resulting in less precise measurements and lower g-loadings. This is why prioritizing high-quality items matters so much. CTT falls short because it treats every item identically regardless of psychometric quality. IRT provides superior accuracy by accounting for item characteristics, even though CTT remains easier to implement for basic testing scenarios.

Test Design & Item Types

CAIT-NV borrows three item types from WAIS, each targeting different facets of g:

Figure Weights

Measures quantitative and analogical reasoning. Examinees must deduce the weight that balances a scale by identifying underlying mathematical relationships.

Visual Puzzles

Assesses spatial visualization. Examinees mentally assemble deconstructed 2D shapes to match a target figure, evaluating mental manipulation abilities.

Block Design

Evaluates perceptual organization and visual-spatial processing. Examinees analyze a 2D pattern and select the block configuration that recreates it.

Using multiple item types gives better coverage of g and reduces method variance issues.

Item Response Theory

CAIT-NV uses an IRT model to calibrate items and estimate ability. It accounts for:

Item Discrimination (a-parameter)

Indicates how effectively an item distinguishes between examinees with different ability levels. Higher values mean the item provides more information about the latent trait, especially around a specific ability point.

Item Difficulty (b-parameter)

Represents the point on the ability scale (typically ranging from -3 to +3) where the item is most informative. This differs from classical difficulty (proportion correct), as it's expressed in terms of the latent trait.

Item Characteristic Curve (2PL)

This graph shows the effects of varying discrimination (a) and difficulty (b) using a 2PL model. Higher 'a' yields steeper curves, indicating better sensitivity to ability; higher 'b' shifts curves rightward, requiring greater ability for success.

The probability of getting an item correct:

P(θ) = 1 / (1 + exp(-a(θ - b)))

Where θ is your ability, 'a' is item discrimination, and 'b' is item difficulty.

With item parameters in hand, Bayesian inference finds the most likely ability score for your response pattern.

Adaptive Testing Algorithm

CAIT-NV implements a maximum information selection algorithm that dynamically tailors the test to each individual. The algorithm selects items that provide the most information at your current ability estimate, creating a more efficient and personalized testing experience compared to traditional fixed-length assessments.

Adaptive Item Selection Process

This (very simplified) algorithm adjusts item difficulty based on responses: correct answers (Pass) lead to harder items, incorrect answers (Fail) to easier ones. This dynamic selection efficiently pinpoints the examinee's ability in real-time.

The algorithm operates through these key steps:

Initialization: Testing begins with an initial ability estimate of θ = 0 (representing average ability), with slight randomization to vary the sequence of opening items across test-takers.
Ability Re-estimation: After each response, the algorithm re-estimates your ability using Bayesian inference, incorporating all previous responses to refine the estimate.
Item Selection: The next item is selected to maximize information at the current ability estimate, ensuring each question contributes meaningfully to measurement precision.
Content Balancing: The algorithm maintains balanced representation across all three item types (Figure Weights, Visual Puzzles, Block Design) throughout testing.
Termination Criteria: Testing concludes when measurement precision reaches SE ≤ 0.5 or after administering 30 items, whichever comes first.

Adaptive vs. Full-length Test Correlation

This plot demonstrates the strong correlation (r = .984) between scores from CAIT-NV's adaptive version and its full-length counterpart. This high agreement validates the efficiency of adaptive testing, which significantly reduces test length and examinee fatigue without substantially compromising measurement accuracy.

This adaptive approach delivers several key advantages over traditional testing:

Efficiency: Achieves equivalent reliability with just 30 items compared to the full 83-item bank, reducing testing time by approximately 64%.
Personalization: Items are matched to your ability level, minimizing frustration from overly difficult items and boredom from overly easy ones.
Precision: Provides more accurate measurement across the entire ability spectrum, with particular improvements at the extremes where traditional tests often show reduced precision.

Bifactor Model Insights

A bifactor analysis of CAIT-NV's structure reveals important insights into how each item type measures intelligence. Figure Weights items emerge as the purest measure of general intelligence (g) among the three item types, showing high g-loadings with minimal influence from specific factors. In contrast, Visual Puzzles and Block Design demonstrate loadings on both the general g-factor and specific spatial reasoning factors, indicating they measure a blend of general and specialized cognitive abilities.

Bifactor Model of CAIT-NV Subtests

This model illustrates the factor loadings. Figure Weights items exhibit high g-loadings with negligible specific factor influence, highlighting their strong measurement of general intelligence. Visual Puzzles and Block Design contribute to 'g' but also tap into specific spatial reasoning factors.

This bifactor structure has practical implications for measurement. Figure Weights items provide a cleaner signal of g with less measurement noise from specific abilities. The adaptive item selection algorithm leverages this insight, allowing strategic weighting of items based on their g-loading characteristics. This approach enables the test to provide more accurate estimates of general cognitive ability by prioritizing items with stronger g-saturation when precision is needed most.

Scoring Methodology

1. Norming & Scaling

Calibrated on an international sample of ~7,000 people. IRT scores are scaled to Luke Atronio's CTT norms (mean 100, SD 15 for US population).

2. Latent Ability Estimation

IRT with a bifactor model extracts g-scores from your response patterns. Unlike CTT which just counts correct answers, IRT weighs each response by item discrimination and difficulty to estimate your latent ability on the g-factor specifically.

CAIT-NV employs a bifactor IRT model to extract pure g-scores from response patterns. The adaptive algorithm maximizes measurement precision by selecting items with the highest information value at your current ability estimate, continuing until measurement error reaches SE ≤ 0.5 or 30 items have been administered. The reported 95% confidence interval is ±20 IQ points. This interval is wider than what you'd typically see in CTT-based tests, but the comparison isn't straightforward. CAIT-NV specifically isolates general intelligence, meaning the confidence interval reflects validity (how accurately we're measuring pure g) rather than just reliability. This is analogous to the distinction between omega hierarchical and omega total in factor analysis. If we were reporting a composite score that incorporated group factors such as spatial reasoning, the confidence interval would narrow to approximately ±10 IQ points, but at the cost of construct purity.