About CAIT-NV
Development, methodology, and technical foundation
Introduction
CAIT-NV exists as an experimental answer to a question: How much does item response theory improve accuracy over classical test theory? Significantly. The second question it aims to answer: Is adaptive testing cope? Not at all.
Origins & Development
CAIT was originally created by EqusG as an effort to "clone" WAIS-IV. It does so somewhat disappointingly. Our factor analysis shows that CAIT loads on g at ~.76, versus ~.90 for WAIS-IV. Removing the verbal items puts it down to ~.75, basically the same.
Instead of simply dismissing CAIT as a dogshit test, we took the initiative to adapt it for an IRT framework. Could IRT salvage CAIT? In order to make it appealing to as wide an audience as possible, making it nonverbal for an international reach, given cultural fairness, and shortening the length is a great idea, given the brainrot attention span of people these days. Instead of the full test taking 60 minutes, our adaptive version only takes about 10. That is 6x faster, yet a lot more accurate. Our version has an average g-loading of ~.83 versus ~.76.
Spearman's Law of Diminshing Returns

This graph shows how insane IRT is, and the very fact that adaptive testing is a no-brainer given a small penalty. One can see the importance of weighing items by quality right here, which is what IRT does. Secondly, adaptive testing shortens administration time by simply selecting the most informative items for a given ability level. The main driving force behind SLODR is simply item quality. As we get to extreme levels, either very low or very high IQs, item quality is more dubious, leading to a muddier measure, hence a lower g-loading. If one doesn't maximize high-quality items' contribution to the measure by treating every item the same, then accuracy will be suboptimal. This is why CTT is simply inferior, however practical it may be.
Test Design & Item Types
CAIT-NV borrows three item types from WAIS, each targeting different facets of g:
The inclusion of multiple item types increases construct coverage and improves accuracy through method variance control.
Item Response Theory
CAIT-NV employs an IRT model to calibrate items and estimate ability. This model accounts for:
Item Discrimination (a-parameter)
Indicates how effectively an item distinguishes between examinees with different ability levels. Higher values mean the item provides more information about the latent trait, especially around a specific ability point.
Item Difficulty (b-parameter)
Represents the point on the ability scale (typically ranging from -3 to +3) where the item is most informative. This differs from classical difficulty (proportion correct), as it's expressed in terms of the latent trait.
Item Characteristic Curve (2PL)

The probability of a correct response (P) for an individual with ability θ on an item with discrimination 'a' and difficulty 'b' is given by:
Given the item parameters, it is possible to find the most probable ability score for a response pattern using Bayesian inference.
Adaptive Testing Algorithm
CAIT-NV implements a maximum information adaptive algorithm that tailors the test to each examinee, dynamically selecting items to maximize measurement precision at their estimated ability level. This creates a more efficient and personalized testing experience.
Adaptive Item Selection Process

The algorithm's key steps include:
- Initialization: An initial ability estimate (θ) is set (e.g., to 0, representing average ability), often with a slight randomization to vary initial item sequences.
- Ability Re-estimation: After each response, the examinee's ability is re-estimated using Bayesian inference, incorporating all prior responses.
- Optimal Item Selection: The next item is chosen to provide maximum information at the current ability estimate.
- Content Balancing: The algorithm ensures a balanced representation of all three item types throughout the test.
- Termination Criteria: Testing concludes when measurement precision reaches a target standard error (SE ≤ 0.5) or a maximum item count (e.g., 30 items) is administered.
Adaptive vs. Full-length Test Correlation

The adaptive nature of CAIT-NV offers several advantages:
- Efficiency: Reduced test length (average of ~20 items) while maintaining high reliability.
- Personalization: Customized difficulty minimizes examinee frustration and boredom.
- Precision: Higher measurement precision across the ability spectrum, especially at the extremes.
- Security: Increased resistance to item memorization and certain test-taking strategies.
Bifactor Model Insights
To further understand the structure of cognitive abilities measured by CAIT-NV, a bifactor analysis was conducted. This analysis revealed that Figure Weights items are the purest measure of the general intelligence factor (g) among the three item types. Unlike Visual Puzzles and Block Design, which show loadings on both 'g' and specific (group) factors, Figure Weights items primarily load on 'g' with minimal influence from other specific abilities.
Bifactor Model of CAIT-NV Subtests

This finding is significant: it suggests Figure Weights items provide a clearer "signal" of 'g'. This insight into the test's factor structure informs the adaptive item selection algorithm, allowing for strategic weighting of items based on their g-loading. This enhances the test's ability to provide an accurate measure of general cognitive ability efficiently.
Scoring Methodology
CAIT-NV employs a two-stage scoring approach that combines the interpretability of traditional psychometric frameworks with the precision of modern IRT:
1. Observed Score Calculation
Initial raw scores (sum of correct responses) are transformed into scaled scores. This uses Luke Atronio's normalization framework, establishing a baseline on the traditional IQ scale (mean 100, SD 15) per Classical Test Theory principles.
2. Latent Ability Estimation
Building on the normalized score, this phase uses the IRT model to derive a more precise estimate of the latent g-factor. This leverages item parameters (difficulty, discrimination) and the specific pattern of responses, not just the total correct.
This dual-framework approach offers scores that are both familiar (IQ scale) and psychometrically robust. The adaptive algorithm, by maximizing information at the examinee's ability level, ensures high measurement efficiency. CAIT-NV targets a standard error of 0.5 logits or completes after 30 items. This precision yields reported IQ scores with a 95% confidence interval of approximately ±7-8 IQ points, comparable to or better than many longer, traditional intelligence tests.
Reliability & Validity
In IRT-based adaptive tests like CAIT-NV, reliability is expressed through the standard error of measurement (SE), which can vary across the ability spectrum, rather than a single coefficient. This provides a more nuanced view of measurement precision.
Key Reliability Metrics
- Marginal reliability: 0.92 - Average Standard Error (SE): 0.35 logits - Test-retest correlation (30-day interval): 0.89 - Internal consistency (Information-weighted θ reliability): 0.94
Evidence of Validity
- Correlation with WAIS-IV FSIQ: 0.82 - Correlation with Raven's APM: 0.76 - Prediction of academic achievement: r = 0.65-0.72 - Factor analysis confirms high g-loading: ~0.83-0.85
The normative sample for CAIT-NV includes data from approximately 7,000 examinees from diverse international backgrounds. Statistical methods were applied to ensure the sample's representativeness. The non-verbal nature of the test facilitated this broad international sampling. Furthermore, Differential Item Functioning (DIF) analyses were conducted to identify and address potential bias across demographic groups (e.g., gender, ethnicity). Items showing significant DIF were revised or removed to promote test fairness. Ongoing validity studies continue to refine CAIT-NV.
Technical Implementation
CAIT-NV is delivered as a progressive web application (PWA), ensuring broad accessibility and a consistent user experience while maintaining measurement integrity. Key aspects of its architecture include:
Frontend
Built with React.js and Next.js for a responsive, fast-loading user interface. SVG-based item visualizations ensure clear rendering and scalability across all devices.
Backend
A serverless architecture using Next.js API routes, with Redis for efficient session management. IRT calculations are performed in real-time, optimized for sub-100ms response times.
Security Measures
Includes end-to-end encryption, rate limiting, input validation, and robust security headers to protect examinee data and test integrity, adhering to data protection best practices (e.g., GDPR principles).
Quality Assurance
Maintained through comprehensive test coverage, adherence to accessibility standards (WCAG 2.1), continuous monitoring of psychometric properties, and regular security audits to ensure a reliable testing environment.
References & Further Reading
- Baker, F. B., & Kim, S. H. (2017). The Basics of Item Response Theory Using R. Springer.
- Embretson, S. E., & Reise, S. P. (2013). Item Response Theory for Psychologists. Psychology Press.
- Kolen, M. J., & Brennan, R. L. (2014). Test Equating, Scaling, and Linking: Methods and Practices. Springer.
- van der Linden, W. J. (2018). Handbook of Item Response Theory, Volume Three: Applications. CRC Press.
- Wainer, H. (Ed.). (2000). Computerized Adaptive Testing: A Primer (2nd ed.). Lawrence Erlbaum Associates.
- Reckase, M. D. (2009). Multidimensional Item Response Theory. Springer.