
Journal of Applied Measurement
A publication of the Department of Educational Psychology and Counseling
National Taiwan Normal University
Abstracts for all Volumes
Newly accepted!
Chinese Teachers’ Assessment Self-efficacy and the Effects of Gender and Teaching Experience
Jinxin Zhu
Accepted on: 8 July 2025
Abstract
Understanding teachers’ assessment self-efficacy is critical in teaching. Whereas past studies showed unsatisfactory factor solutions for assessment self-efficacy, this study investigates the factorial structure and the psychometric properties of an instrument for teachers’ assessment self-efficacy. Moreover, past studies found mixed results regarding the effects of teaching experience and gender on assessment self-efficacy. This study addresses the gap with a sample of 158 in-service Chinese teachers (63.9% female) from Mainland China (22.2%) and Hong Kong (77.8%). Results showed one factor underlying Chinese teachers’ assessment self-efficacy and acceptable psychometric properties of the instrument, including acceptable item fits, good category structures, and good reliabilities. However, the study found items with differential item functioning related to location and gender. When comparing the teachers’ assessment self-efficacy across Hong Kong and Mainland or across genders, researchers should pay special attention to these items. Finally, teaching experience is positively associated with assessment self-efficacy, but gender is not.
Using an Exploratory Item Response Modeling Approach to Develop a Teacher Continuing Professional Development Progress Variable
Jerred Jolin and Alexander Blum
Accepted on: 15 May 2025
Abstract
Continuing professional development (CPD) involves ongoing learning activities undertaken by teachers to enhance their professional practice. Despite the importance of CPD for improving schools and enhancing student learning, there is limited research on the nature of CPD participation among rural school teachers. To address this gap, we administered the Continuing Professional Development Questionnaire (CPDQ) to 220 K-12 teachers from 15 rural school districts. We analyzed the data with a unidimensional partial credit model (PCM) to produce a CPD participation profile (CPP) for the sample of teachers, in the form of an item-person map. The CPP was developed by arranging the calibrated CPDQ items in an ascending order based on the locations of the first item threshold, which is the transition point within each item where respondents were more likely to report engaging in a given CPD activity at any frequency, versus never engaging in it. We then used the CPP to develop a CPD Engagement Progress Variable (CEPV) which defines four, progressively more involved levels of CPD engagement. In support of the validity of the CEPV we present evidence for the reliability and validity of the CPDQ. These findings contribute to the understanding of CPD engagement in rural schools by providing a picture of what CPD engagement actually looks like for this sample of rural teachers. The findings also have practical implications, such as a proposed sequence for the delivery of CPD activities for rural teachers by local educational agencies and providing logical “next steps” for CPD activities, based on a teacher’s location within the levels of the CEPV. The research that we report here is in the spirit of Professor Wen-Chung Wang’s work because it demonstrates the practical utility of using advanced measurement models in real-world educational settings to improve educational outcomes, which was a theme of much of his work.
Retrofitting the Partially Confirmatory Cognitive Diagnosis Modelling to Large-Scale Educational Assessments
Yi Jin and Jinsong Chen
Accepted on: 14 May 2025
Abstract
Cognitive diagnosis models are increasingly being applied in large-scale educational assessments. The construction of Q-matrices allows for the adaptation of non-diagnostic assessments for diagnostic use. In this study, we propose retrofitting a newly developed partially confirmatory diagnosis model to large-scale educational assessments, in which the Q-matrix can be partially specified by expert knowledge and partially inferred from response data. The efficacy of this framework is demonstrated by comparing it with the fully expert specification method and the PVAF-based Q-matrix validation method through real application scenarios. The results reveal the significant practical potential of our proposed approach.
A Systematic Review of Forced-Choice Measures and Item Response Theory Modelling
Xuelan Qiu and Jimmy de la Torre
Accepted on: 13 May 2025
Abstract
Forced-choice (FC) items have been used for decades to assess psychological traits such as personality, interests, and values because of their strengths in preventing response biases. Recently, there has been increased research interest in developing item response theory (IRT) models for FC items. This review paper discusses various formats of FC measures with specific examples of tests and the underlying humans’ preferential choice theories for forced-choice. Besides, it presents the existing IRT models and discusses the specification of the models and the properties of models. Based on the review, the paper suggests theoretical and application-oriented research lines for the future research on the FC measures.
Relating Selected Response to Constructed Response Items: Systematic Effects of Item Format
Mark Wilson, Weeraphat Suksiri, Linda Morell, Jonathan Osborne, and Sara Dozier
Accepted on: 12 May 2025
Abstract
In this study we explore the relationship between constructed-response (CR) item types and selected-response (SR) item types. We constructed stem-equivalent sets of SR and CR items, designed to assess multiple (high) levels of competency in argumentation using the Construct-Modelling approach, which is based on a previously validated construct map for argumentation. We analyzed data obtained from 741 middle school and high school students who were randomly assigned to the two different assessment conditions (i.e., CR and SR). Our findings indicate that the two assessment conditions generate two different, but correlated, psychometric dimensions. In particular, the item difficulty parameters from the SR items are highly correlated with those from the paired CR items, indicating that both sets are consistent with the original construct map for argumentation. However, the CR items were much harder for the students than the SR items, the equivalent of a grade level, and appeared even more difficult to them. We interpret this finding to show that in the CR case, the students are hampered by the requirement to write their responses in sentences that communicate their higher-level reasoning and capability. Thus, their facility with expression is a problem only when constructed response items are used to assess student knowledge. We use these results to review the usages of the two item formats and find value for both uses.
An Examination of an Epistemic Cognition Developmental Progression: A Rasch Analysis
Man Ching Esther Chan and Mark Wilson
Accepted on: 12 May 2025
Abstract
In the era of post-truth and misinformation, there are continuing calls for an emphasis on critical thinking in school education. Epistemic cognition has been proposed as foundational to critical thinking inside and outside of the classroom. However, due to a lack of understanding of the construct and its development, teachers are not well equipped to foster effective epistemic cognition and thus critical thinking in the classroom. Drawing from previous literature, an assessment tool was piloted and subsequently administered to 168 Year 8 students (13- to 14-year-olds) in Melbourne, Australia, to examine the theorised development of epistemic cognition. Students’ responses were examined qualitatively using think-aloud protocols and quantitively using Classical Test Theory and Rasch Modelling. The instrument showed good person-separation reliability (.76) and Cronbach’s alpha (.79). Based on the analysis, the student responses generally aligned with the theorised construct map demonstrating strong construct validity. The findings offer empirical evidence for a developmental progression in epistemic cognition, which may be used to inform teaching of critical thinking.
Assessing Differential Item Functioning in Computerized Adaptive Testing
Ching-Lin Shih, Kuan-Yu Jin, and Chia-Ling Hsu
Accepted on: 24 February 2025
Abstract
To implement computerized adaptive testing (CAT), monitoring the parameter stability of operational items and checking the quality of newly written items are critical. In particular, assessing differential item functioning (DIF) is a vital step in ensuring test fairness and improving test reliability and validity. This study investigated the performance in CAT of several nonparametric DIF assessment methods, the odds ratio (OR; Jin et al., 2018) approach, modified Mantel–Haenszel method (Zwick et al., 1994a, 1994b; Zwick & Thayer, 2002), modified logistic regression method (Lei et al., 2006), and CAT version of the simultaneous item bias test (SIBTEST) method (Nandakumar & Roussos, 2001), via a series of simulation studies. The simulation results showed that the OR outperformed the other three methods in controlling false positive rates and producing high true positive rates when there were many DIF items in a test. Moreover, combining the OR with a scale purification procedure further improved DIF assessment in CAT as the percentage of DIF items exceeded 10%.
Exploratory Differential Item Functioning Assessment With Multiple Grouping Variables: The Application of the DIF-Free-Then-DIF Strategy
Jyun-Hung Chen, Chi-Chen Chen, and Hsiu-Ti Chao
Accepted on: 18 December 2024
Abstract
To ensure test fairness and validity, it is crucial for test practitioners to assess differential item functioning (DIF) simultaneously for all grouping variables to avoid omitted variable bias (OVB; Chao et al., 2018). In testing practice, however, we often face challenges due to insufficient information, such as the absence of DIF-free anchor items, while conducting DIF assessment. This scenario, referred to as exploratory DIF assessment involving multiple grouping variables, has received limited attention, highlighting the importance of accurately identifying DIF-free items as anchors for all grouping variables. To address this issue, this study proposed the parallel DIF-free-then-DIF (p-DFTD) strategy, which selects DIF-free items simultaneously for each grouping variable and utilizes them as anchors in the constant item method for DIF assessment. A comprehensive simulation study was conducted to evaluate the performance of the p-DFTD strategy. The findings revealed that the conventional approach of assessing DIF with one grouping variable at a time was vulnerable to OVB, leading to an inflation of Type I error rates. In contrast, the p-DFTD strategy successfully identified DIF-free anchor items and effectively controlled Type I errors while maintaining satisfactory statistical power in most conditions. The empirical analysis further supported these findings, showing that the p-DFTD strategy provided more accurate and consistent DIF detection compared to methods that do not account for all grouping variables simultaneously. In conclusion, the p-DFTD strategy, which demonstrated a robust performance in this study, holds promise as a reliable approach for test developers to conduct exploratory DIF assessments involving multiple grouping variables, thereby ensuring fairness and validity in testing practices.
Enhancing DIF Detection in Cognitive Diagnostic Models: An Improved LCDM Approach Using TIMSS Data
Su-Pin Hung, Po-His Chen, and Hung-Yu Huang
Accepted on: 17 December 2024
Abstract
Cognitive diagnostic models (CDMs) are being increasingly utilized in educational research, especially for analyzing large-scale international assessment datasets comparing skill mastery among various countries and gender groups. However, establishing test invariance before making such comparisons is critical to ensure that differential item functioning (DIF) does not distort the results. Earlier research on DIF detection within CDMs has predominantly dealt with non-compensatory models, leaving several influential factors ambiguous. This study addresses these issues with the log-linear cognitive diagnosis model (LCDM; Henson et al., 2009), enhancing its practical applicability. The aim is to improve the LCDM-DIF method and evaluate the efficacy of the purification procedure using total scores as matching criteria with non-parametric methods—logistic regression (LR) and Mantel-Haenszel (MH). Factors examined include test length, percentage of DIF, DIF magnitude, group distributions, and sample size. Using data from the Trends in International Mathematics and Science Study (TIMSS), factoring nationality and gender of the participants, an empirical study gauges the performance of the proposed methods. The results reaffirm that the model-based method surpasses the MH and LR methods in controlling Type I errors and achieving higher power rates. Additionally, the LCDM-based approach offers broader insights into the results. The study discusses its value, potential applications, and future research areas, emphasizing the significance of tackling issues related to contaminated matching criteria in DIF detection within CDMs.
How Does Knowledge About Higher Education Develop at the End of High School: A Longitudinal Analysis of 11th and 12th Graders
Maria Veronica Santelices, Ximena Catalán, Magdalena Zarhi, Juan Acevedo, and Catherine Horn
Accepted on: 12 December 2024
Abstract
The information and guidance available to secondary education students are positively related to access to higher education. Information level, however, has been described as low, especially for students from low socioeconomic status, whose parents have not attended higher education. We explore how the knowledge about higher education changes between 11th and 12th grade, identifying possible differences between students by socio-demographic groups and as possible consequences of school activities. We use a complex multidimensional measure to capture Knowledge and Perception of Knowledge about Higher Education. Results from our study show that students exhibit low levels of information in both knowledge sub-dimensions. Despite positive variations observed from 11th to 12th grade, the low level of information remains in the last year of high school.
Making Multiple Regression Narratives Accessible: The Affordances of Wright Maps
Alexander Mario Blum, James M. Mason, Aaryan Shah, and Sam Brondfield
Accepted on: 12 December 2024
Abstract
Wright Maps have been an important tool in promoting meaning making about measurement results for measurement experts and substantive researchers alike. However, their potential to do so for latent regression results is underexplored. In this paper, we augmented Wright Maps with hypothetical group mean locations corresponding to realistic scenarios of interest. We analyzed data from an instrument measuring cognitive load experienced by medical fellows and residents while performing inpatient consults. We focused on Extraneous load (EL: i.e., distraction) and variables potentially associated with distraction. Through collaborative examination of the Wright Map, we found not only corresponding regions to construct levels but also a region with important practical consequences, namely that the third threshold represented a critical level of cognitive load, which could impact patient care. We augmented the Wright Map with the locations of two typical scenarios differing only in novelty of the consult, representing the lowest and highest levels of novelty, respectively. These group locations were plotted on the Wright Map approximately 1.5 logits apart, allowing for a kind of visual relative effect size, as this difference can be perceived relative to other features of the Wright Map. In this case, both scenarios were within the same band of the Wright Map, leading to the practical interpretation; although EL was significantly increased, the risk of cognitive overload was not. However, because of the problematic nature of the third threshold, a 1.5 logit difference does not have the same practical consequences along the entire scale; other realistic scenarios with increased initial EL are possible, where increased novelty could lead to cognitive overload. This area of visualization techniques, along with a combinatorial view of a multiple-regression analysis, could be helpful in other substantive and practical contexts, and with more complex regression models.
Examining Illusory Halo Effects Across Successive Writing Assessments: An Issue of Stability and Change
Thomas Eckes and Kuan-Yu Jin
Accepted on: 22 October 2024
Abstract
Halo effects are a common source of rating errors in assessments. When raters assign scores to examinees on multiple performance dimensions or criteria, they may fail to discriminate between them, lowering each criterion’s information value regarding an examinee’s performance. Using the mixture Rasch facets model for halo effects (MRFM-H), we studied halo tendencies in four successive high-stakes writing assessments administered to 15,677 examinees spanning 10 months involving 162 raters. The MRFM-H allows separating between illusory halo due to judgmental biases and true halo due to the actual overlap between the criteria. Applying this model, we aimed to detect illusory halo effects in the first exam, tracking the effects’ occurrence across the other three exams. We also ran the standard Rasch facets model (RFM) and computed raw-score correlational and standard deviation halo indices, rH and SDH, for comparison purposes. The findings revealed that (a) the MRFM-H fit the rating data better than the RFM in all four exams, (b) in the first exam, 11 out of 100 raters exhibited illusory halo effects, (c) the halo raters showed evidence of both stability and change in their rating tendencies over exams, (d) the non-halo raters mostly remained stable, (e) the rH and SDH statistics did not separate between the halo and non-halo raters, and (f) the illusory halo effects had a small but demonstrable impact on examinee rank orderings, which may have consequences for selection decisions. The discussion focuses on the model’s practical implications for performance assessments, such as rater training, monitoring, and selection, and highlights future research perspectives.
Semantic Factor Analysis using LLM Transformers: No Respondents Needed
Rense Lange, Haniza Yon, and Labiib Marzuki
Accepted on: 17 April 2025
Abstract
Two Large Language Model (LLM) based transformers – MiniLM and distilBERT - were used to compute the semantic similarity between the texts of the fifty items of Temple University’s Big Five personality questionnaire, without administering these items to any test-takers. Overall, MiniLM transformer performed noticeably better than did distilBERT. When using MiniLM, all factors were recovered correctly when five factors were extracted and rotated based on the similarity of items’ text. When using ten factors, only two of the five factors’ cluster membership agreed perfectly with Temple’s item classification derived via standard factor analysis, Agreeableness being the most problematic. We expect that given the rapid developments in LLM, research will advance rapidly such that item misclassifications will disappear as transformers are becoming increasingly powerful and accurate. These findings suggest developing tools to anticipate the factor structure of new sets of items. The success of LLM may have important implications for measurement in the social sciences in general, while providing a new perspective on factor analysis. In particular, the results align with Wittgenstein’s notion of language “games,” suggesting that the factor structure of psychological questionnaires represent shared linguistic practice rather than reality per se.
Development and Validation Study of the Preschool Mathematics Education Practices Survey
Toni May, Kristin Koskey, and Kathleen Provinzano
Accepted on: 5 March 2025
Abstract
Preschool mathematics skills have been shown to predict students' success in later mathematics and other cognitive domains. Yet, studies of what preschool teachers teach in terms of mathematics in their classrooms are limited. When such research is conducted, observational methods are typically applied as effective self-report measures are not widely available for use with preschool teachers. This validation study provided multiple sources of validity evidence for the Preschool Mathematics Education Practices Survey (PS-MEPS), aligned with the Head Start Early Learning Outcomes Framework. Both qualitative and quantitative data sources from various interested parties and experts were employed in a rigorous design-based research process applying the Rasch model with findings used iteratively to inform the PS-MEPS development and validity evidence. Strong evidence for content, response processes, consequences of testing, and internal structure was found for using the PS-MEPS with diverse preschool teachers.
Many-Facet Rasch Designs: How Should Raters be Assigned to Examinees?
Christine DeMars, Yelisey A. Shapovalov, and John D. Hathcoat
Accepted on: 16 January 2025
Abstract
In many-facet Rasch measurement, raters should be connected, and there are multiple ways to connect raters. This simulation contains two studies. In one study, two raters scored each examinee. Each rater was either paired with many different raters a few times, or repeatedly paired with only two other raters. The standard errors of both rater severity and examinee ability were higher when raters scored one examinee in common with many different raters compared to when they scored many examinees in common with two raters. However, the differences were small, especially for the standard error of examinee ability. In the next study, most examinees were scored by a single rater. Linking was accomplished either by assigning all raters to score the same small linking set of examinees, or by assigning a subset of papers to pairs of raters, with each rater participating in multiple pairs to provide links. Slightly smaller standard errors were achieved when all raters scored a common linking set, compared to paired ratings, keeping the total number of ratings constant. Overall, the design makes little difference, especially when examinees are double-scored.
Detecting Unfavorable Experience in the Face of Digital Disruption: Using Rasch Measures in PLS-SEM Analysis
Nor Irvoni Mohd Ishar, Zi Yan, Zali Mohd Nor, T. Ramayah, Rosmimah Mohd Roslin, and Trevor G. Bond
Accepted on: 20 December 2024
Abstract
In a modern digitized world, it has become increasingly difficult to meet consumer demands in a crowded market space. Providing exceptional services and experiences to discerning customers has become the key to gaining a competitive advantage. The experience economy now dictates the path for businesses wherein the goal is customer satisfaction and consequential loyalty. Yet, there are always unsatisfactory service experiences, and therefore, service providers are forced to contemplate the reasons. Based on a survey of 350 telco subscribers in Malaysia, this study seeks to understand the impact of adverse experiences on consumer behavior in the telecommunications sector. To demonstrate a hybrid data analysis method, the items identified as belonging to each construct were assessed for their fit to the requirements of the Rasch model. Subsequently, Partial Least Squares Structural Equation Modelling (PLS-SEM) was used to assess the convergent and discriminant validity of the reflective model before evaluating the structural model. The main results of the study show that unfavorable service experiences have a significant impact on dissatisfied customers' response behavior. The present study's contribution lies in the application of the Rasch analysis method to establish the measurement properties of each of the scales used in the study before person measures were imputed into the SEM analysis.
Differences in School Leaders’ and Teachers’ Perceptions on School Emphasis on Academic Success: An Exploratory Comparative Study
Sijia Zhang and Cheng Hua
Accepted on: 16 October 2024
Abstract
This quantitative study examined how principals and teachers from all participating countries and regions perceive school emphasis on academic success (SEAS) differently. Participants (N = 26,302) were all principals and teachers who filled out the SEAS scale from PIRLS 2021. A second-order confirmatory factor analysis and a many-faceted Rasch analysis were used to investigate the psychometric properties of the SEAS and whether there existed differences in school leaders’ and teachers’ perceptions of such construct within and across countries. Results from the factor analysis yielded a three-factor solution, and the SEAS scale demonstrated satisfying psychometric properties. Rasch analysis indicated great model-data fit, and item-level fit statistics. Future studies are encouraged to explore the psychometric properties of SEAS and how SEAS impacts other school-related variables and student outcomes. This study explored a new instrument to measure academic emphasis and compared leaders’ and teachers’ perceptions of SEAS in an international setting.
Analysis of Multidimensional Forced-Choice Items using Rasch Ipsative Models with ConQuest
Xuelan Qiu and Dan Cloney
Accepted on: 24 September 2024
Abstract
Multidimensional forced-choice (MFC) items have been widely used to assess career interests, values, and personality to prevent response biases. This tutorial first introduces the typical types of MFC items and the item response theory models to analyze MFC items. It further shows how to analyze the dichotomously and polytomously scored MFC items with paired statements based on the Rasch ipsative models using the computer program ACER ConQuest. The assessment of differential statement functioning using the ConQuest was also demonstrated.
Modeling the Effect of Reading Item Clarity on Item Discrimination
Paul Montuoro and Stephen Humphry
Accepted on: 16 August 2024
Abstract
The logistic measurement function (LMF) satisfies Rasch’s criteria for measurement while allowing for varying discrimination among sets of items. Previous research has shown how the LMF can be applied in test equating. This article demonstrates the advantages of dividing reading test items into three sets and subsequently applying the LMF instead of the Rasch model. The first objective is to examine the effect of item clarity and transparency on item discrimination using a new technique for dividing reading items into sets, referred to as an item clarity review. In this article, the technique is used to divide items in a reading test with different levels of discrimination into three sets. The second objective is to show that, where three such sets exist, the subsequent application of the LMF leads to improved item fit compared to the standard Rasch model and the subsequent retention of more items. The item sets were shown to have different between-set discrimination but relatively uniform within-set discrimination. The results show that, in this context, reading test item clarity and transparency affect item discrimination. These findings and other implications are discussed.
Psychometric Properties of the Statistical Anxiety Scale and the Current Statistics Self-Efficacy Using Rasch Analysis in a Sample of Community College Students
Samantha Estrada Aguilera, Emily Barena, and Erica Martinez
Accepted on: 2 February 2024
Abstract
Community college students have rarely been the focus of study within statistical education research. This study aims to examine the psychometric properties of two popular scales utilized within statistics education: Current Statistics Self-Efficacy (CSSE) and Statistical Anxiety Scale (SAS) focusing on a population of community college students. A survey was conducted on N = 161 community college students enrolled in an introductory statistics course. The unidimensional structure of the CSSE was confirmed utilizing a confirmatory factor analysis (CFA), and after selecting the rating scale model approach, we found no misfitting items and good reliability. Concurrent and discriminant validity was examined utilizing the SAS. The SAS three-factor structure was also assessed, examining the item fit. We found that an item in the SAS subscale, Fear of Asking for Help, was flagged as misfitting. Overall, both the CSSE and SAS demonstrated sound psychometric properties when utilized with a population of community college students.
Using Explanatory Item Response Models to Evaluate Surveys
Jing Li and George Engelhard
Accepted on: 30 January 2024
Abstract
This study evaluates the psychometric quality of surveys by using explanatory item response models. The specific focus is on how item properties can be used to improve the meaning and usefulness of survey results. The study uses a food insecurity survey (HFSSM) as a case study. The case study examines 500 households with data collected between 2012 and 2014 in the United States. Eleven items from the HFSSM are classified in terms of two item properties: referent (household, adult, and child) and content (worry, ate less, cut meal size, hungry, and not eat for the whole day). A set of explanatory linear logistic Rasch models is used to explore the relationships between these item properties and their locations on the food insecurity scale. The results suggest that both the referent and item content are significant predictors of item location on the food insecurity scale. It is demonstrated that the explanatory item response models are a potential method for examining the psychometric quality of surveys. Explanatory item response models can be used to enhance the meaning and usefulness of survey results by providing insights into the relationship between item properties and survey responses. This approach can help researchers improve the psychometric quality of surveys and ensure that they are measuring what they intend to measure. It can lead to better-informed policy decisions and interventions aimed at tackling social issues such as food insecurity, poverty, and inequality.
Examining Equivalence of Three Versions of Mathematics Tests in China’s National College Admission Examination Using a Single Group Design
Chunlian Jiang, Stella Yun Kim, Chuang Wang, and Jincai Wang
Accepted on: 12 September 2023
Abstract
The National College Admission Examination, also known as Gaokao, is the most competitive examination in China because students’ scores obtained in Gaokao are used as the only criterion to screen applicants for college admission. Chinese students’ scores in Gaokao are also accepted by many universities in the world. The one-syllabus-multiple-tests practice has been implemented since 1985, but not much has been explored as to the extent to which multiple tests are equivalent. This study attempts to examine the equivalence of three versions of Gaokao mathematics tests and illustrate the methodological procedure using a single group design with an item response theory (IRT) approach. The results indicated that the three versions were comparable in terms of content coverage; however, most items were found to be easy for the students so more challenging items are suggested to be included for distinguishing students with average and high mathematics competencies. Some differences were also noted in terms of differential item functioning analysis and the factor structure.
Examining the Psychometric Properties of the Spanish Version Life Orientation Test-Revised (LOT-R): Finding Multi-dimensionality and Issues with Translation for Reverse-Scored Items
Rosalba Hernandez, Kendon J. Conrad, John Ruiz, Melissa Flores, Judith T. Moskowitz, Linda C. Gallo, Erin L. Merz, Frank J. Penedo, Ramon A. Durazo-Arvizu, Angelica P. Gutierrez, Jinsong Chen, and Martha L. Daviglus
Accepted on: 18 September 2023
Abstract
The Life Orientation Test-Revised (LOT-R) is the most used instrument to assess dispositional optimism. We examined the psychometrics of the LOT-R in a diverse sample of U.S. Hispanics/Latinos. Data analysis included 5,140 adults ages 18–74 in the Hispanic Community Health Study/Study of Latinos and Sociocultural Ancillary Study. We employed the Rasch measurement model using Winsteps software. The Rasch person reliability for the 6-item LOT-R and Cronbach’s alpha both had values of 0.54. When testing convergent validity, correlations were statistically significant, but small to medium in magnitude. The ratio of percentage of variance explained by the measures to the variance explained in the first contrast and the correlation of subscales did not meet the expected unidimensionality criterion. The item, “I hardly expect things to go my way,” displayed differential item functioning by language (Spanish vs. English) and reverse-scored items were found to be problematic. Use of the LOT-R in its present form in U.S. Hispanics/Latinos is unsupported by psychometric evidence.
Validity and Test-Length Reduction Strategies for Complex Assessments
Lance M. Kruse, Gregory E. Stone, Toni A. May, and Jonathan D. Bostic
Accepted on: 12 April 2023
Abstract
Lengthy standardized assessments decrease instructional time while increasing concerns about student cognitive fatigue. This study presents a methodological approach for item reduction within a complex assessment setting using the Problem Solving Measure for Grade 6 (PSM6). Five item-reduction methods were utilized to reduce the number of items on the PSM6, and each shortened instrument was evaluated through validity evidence for test content, internal structure, and relationships to other variables. The two quantitative methods (Rasch model and point-biserial) resulted in the best psychometrically performing shortened assessments but were not representative of all content subdomains, while the three qualitative (content preservation) methods resulted in poor psychometrically performing assessments that retained all subdomains. Specifically, the ten-item Rasch and ten-item point-biserial shortened tests demonstrated the overall strongest validity evidence, but future research is needed to explore the psychometric performance of these versions in a new independent sample and the necessity for subdomain representation. Implications for the study provide a methodological framework for researchers to use and reduce the length of existing instruments while identifying how the various reduction strategies may sacrifice different information from the original instrument. Practitioners are encouraged to carefully examine to what extent their reduced instrument aligns with their pre-determined criteria.
Impact of Violation of Equal Item Discrimination on Rasch Calibrations
Chunyan Liu, Wenli Ouyang, and Raja Subhiyah
Accepted on: 8 March 2023
Abstract
The Rasch model, a widely used item response theory (IRT) system, assumes equal discrimination of all items when estimating item difficulties and examinee proficiencies. However, to some extent, the impact of item misfit due to violations of the equal item discrimination assumption on Rasch calibrations remains unclear. In this simulation study, we assess the effects of balanced and systematic variation of item discrimination on Rasch difficulty estimates and Rasch model fit statistics. Our findings suggest that item misfit due to unequal item discrimination can negatively impact item difficulty estimates and INFIT/OUTFIT statistics for both misfitting and well-fitting items. Test developers may find our results useful for improving the accuracy of item difficulty estimates and, ultimately, of the estimated examinee proficiencies.
A Rasch Analysis of the Mindful Attention Awareness Scale—Children
Daniel Strissel and Julia A. Ogg
Accepted on: 19 December 2022
Abstract
The Mindful Attention Awareness Scale—Children (MAAS-C) was developed using traditional psychometric methods to measure dispositional mindfulness in children. The MAAS-C is based on the MAAS, a highly cited mindfulness scale for adults. This study extended this effort by applying the Rasch model to investigate the psychometric properties of the MAAS-C. Evidence from Rasch Analyses conducted on the MAAS suggested local dependence of items and the need for modifications, including a rescoring algorithm. We aimed to examine how the MAAS-C performed when evaluated with Rasch analysis using a sample of 406 fifth- and sixth-grade children. Upon analysis, all 15 items on the MAAS-C worked in the same direction; the fit statistics fell within a range suitable for productive measurement; a PCAR of the residuals revealed an unpatterned distribution of residuals, and DIF was not found for any of the grouping variables. However, the items were not evenly distributed nor well-targeted for children with the highest levels of dispositional mindfulness. We demonstrated that the precision and item functioning of the MAAS-C could be improved by uniformly rescoring the response categories. Once rescored, the provided ordinal-to-interval conversion table (see Table 5) can be used to optimize scoring on the MAAS-C.
Psychometric Assessment of an Adapted Social Provisions Survey in a Sample of Adults with Prediabetes
Kathryn E. Wilson, Tzeyu L. Michaud, Cynthia Castro Sweet, Jeffrey A. Katula, Fabio A. Almeida, Fabiana A.
Brito, Robert Schwab, & Paul A. Estabrooks
Accepted on: 17 November 2022
Abstract
The relevance of social support for weight management is not well documented in people with prediabetes. An important consideration is the adequate assessment of social provisions related to weight management. We aimed to assess the factor structure and measurement invariance of an adapted Social Provisions Scale specific to weight management (SPS-WM). Participants of a diabetes prevention trial (n = 599) completed a demographic survey, and the SPS-WM. Confirmatory analyses tested the factor structure of the SPS-WM, and measurement invariance was assessed for gender, weight status, education level, and age. Removal of two items resulted in acceptable model fit, supporting six correlated factors for social provisions specific to weight management. Measurement invariance was supported across all subgroups. Results support score interpretations for these scales reflecting distinct components of social support specific to weight management in alignment with those of the original survey.
Exploring the Impact of Open-Book Assessment on the Precision of Test-Taker and Item Estimates Using an Online Medical Knowledge Assessment
Stefanie A. Wind, Cheng Hua, Stefanie S. Sebok-Syer
Accepted on: 7 November 2022
Abstract
Researchers concerned with the psychometric quality of assessments often examine indicators of model-data fit to inform the interpretation and use of parameter estimates for items and persons. Fit assessment techniques generally focus on overall model fit (i.e., global fit), as well as fit indicators for individual items and persons. In this study, we demonstrate how one can also use item-level information from individual responses (e.g., the use of outside assistance) to explore the impact of such behaviors on model-data fit. This study’s purpose was to use data from an open-book format assessment to examine the impact of examinee help-seeking behavior on the psychometric quality of item and person estimates. Open-book assessment formats, where test-takers are permitted to consult outside resources while completing assessments, have become increasingly common across a wide range of disciplines and contexts. With our analysis, we illustrated an approach for evaluating model-data fit by combining residual analyses with test-takers’ self-reported information about their test-taking behavior. Our results indicated that the use of outside resources impacted the pattern of expected and unexpected responses differently for individual test-takers and individual items. Analysts can use these techniques across a variety of assessment contexts where information about test-taker behavior is available to inform assessment development, interpretation, and use; this includes: evaluating psychometric properties following pilot testing, item bank maintenance, evaluating results from operational exam administrations, and making decisions about assessment interpretation and use.
Seeing Skill: Heuristic Reasoning when Forecasting Outcomes in Sports
Veronica U. Weser, Karen M. Schmidt, Gerald L. Clore, & Dennis R. Proffitt
Accepted on: 2 October 2022
Abstract
Expert advantage in making judgments about the outcomes of sporting events is well-documented. It is not known, however, whether experts have an advantage in the absence of objective information, such as the current score or the relative skill of players. Participants viewed 5-second clips of modern Olympic fencing matches and selected the likely winners. Participants’ predictions were compared with the actual winners to compute accuracy. Study 1 revealed small but significant differences between the accuracy of experts and novices, but it was not clear what fencing behaviors informed participants’ judgments. Rasch modeling was used to select stimuli for Study 2, in which fencing-naïve participants rated the gracefulness, competitiveness, and confidence of competitors before selecting winners. By using Rasch modeling to develop the stimuli set, fencing-naïve participants were able to identify winners at above chance rates. The results further indicate that in the absence of concrete information, competitiveness and confidence may be used as a heuristic for the selection of winning athletes.
The Effects of Textual Borrowing Training on Rater Accuracy When Scoring Students’ Responses to an Integrated Writing Task
Kevin R. Raczynski, Jue Wang, George Engelhard, Jr, Allan S. Cohen
Accepted on: 23 September 2022
Abstract
Integrated writing (IW) tasks require students to incorporate information from provided source material into their written responses. While the growth of IW tasks has outpaced research on scoring challenges that raters working in this assessment context experience, several researchers have reported that raters in their studies struggled to agree about whether students demonstrated successful integration of source material in their written responses. One suggestion offered for meeting this challenge was to provide rater training on textual borrowing, a topic not covered in all training programs. We randomly assigned 17 middle school teachers to two training conditions to determine whether teachers who completed an augmented training protocol specific to textual borrowing would score students’ responses to an IW task more accurately than teachers who completed a comparison training protocol that did not include instruction on textual borrowing. After training, all teachers scored the same set of 30 benchmark essays. We compared the teachers’ scores to professional raters’ scores, dichotomized based on whether the scores matched, and then analyzed the resulting data using a FACETS model for accuracy. As a group, the teachers who completed the augmented training scored more accurately than the comparison group. Policy implications for scoring rubric design and rater training are discussed.
The development and validation of a thinking skills assessment for students with disability using Rasch measurement approaches
Toshiko Kameia, Masa Pavlovicb
Accepted on: 23 September 2022
Abstract
21st century skills such as thinking are gaining prominence in curricula internationally as fundamental to thrive and learn in an evolving global environment. This study developed and investigated the validity of a thinking skills assessment based on a learning progression of students with disability. It followed an established method (Griffin, 2007) previously used to develop assessments based on learning progressions of foundational learning skills of students with disability. An initial review of research and co-design process with teachers with expertise in teaching students with disability was used to develop a set of assessment items based on a hypothetical criterion-referenced framework of thinking skills. This was followed by empirical exploration of the developed thinking skills assessment through a Rasch partial credit model (Masters, 1982) analysis using student assessment data from a field trial of the thinking skills assessment items involving 864 students. SME review, person and item fit statistics, and reliability coefficients provided evidence to support the validity of the assessment for its intended purpose and supported arguments for a single, underlying construct. A thinking skills assessment based on learning progression of school-age students with disability was derived, drawing on teacher interpretation and calibration of student assessment data. The resulting thinking skills assessment provided a practical tool that teachers can apply in the classroom to implement the teaching and learning of thinking skills to a cohort and level of learning previously not targeted.
Career Advancement Inventory: Assessing Decent Work among Individuals with Psychiatric Disabilities
Uma Chandrika Millner, Sarah A. Satgunam, James Green, Tracy Woods, Richard Love, Amanda Nutton, Larry
Ludlow
Accepted on: 14 April 2022
Abstract
Comprehensive assessments of the outcomes of vocational programs and interventions are necessary to ameliorate the significant employment disparities among individuals with psychiatric disabilities. In particular, measuring the attainment of decent work is critical for assessing their vocational outcomes. In the absence of existing vocational instruments that assess progress towards decent work among individuals with psychiatric disabilities, we developed the Career Advancement Inventory (CAI). The CAI was theoretically grounded in the Career Pathways Framework (CPF), review of focus group data and existing literature, and constructed utilizing an iterative scale development approach and a combination of classical test theory and item response theory principles, specifically Rasch modeling. The CAI included five subscales: Self-Efficacy, Environmental Awareness, Work Motivation, Vocational Identity, and Career Adaptabilities. Rasch analyses indicated mixed results where some items in the subscales mapped onto the hierarchical stage-like progression as proposed by CPF, while others did not. The results support construct validity of the subscales, with the exception of Work Motivation, and contribute to the expansion of the theoretical propositions of CPF. The CAI has the potential to be an effective career assessment for individuals with psychiatric disabilities and has implications for vocational psychology and vocational rehabilitation.
Extended Rater Representations in the Many-Facet Rasch Model
Mark Elliott, Paula J. Buttery
Accepted on: 14 March 2022
Abstract
Many-Facet Rasch Models (Eckes, 2009, 2015; Engelhard & Wind, 2017; Linacre, 1994) provide a framework for measuring rater effects for examiner-scored assessments, even under sparse data designs. However, the representation of a rater as a global scalar measure involves an assumption of uniformity of severity across the range of rating scales and criteria within each scale. We introduce extended rater representations of vectors or matrices of local measures relating to individual rating scales and criteria. We contrast these extended representations with previous work on local rater effects (Myford & Wolfe, 2003) and discuss issues related to their application, for raters and other facets. We conduct a simulation study to evaluate the models, using an extension of the CPAT algorithm (Elliott & Buttery, 2021). We conclude that extended representations more naturally and completely reflect the role of the rater within the assessment process and provide greater inferential power than the traditional global measure of severity. Extended representations also have applicability to other facets which may have non-uniform effects across items and thresholds.
Tracing Morals: Reconstructing the Moral Foundations Questionnaire in New Zealand and Sweden Using Mokken Scale Analysis and Optimal Scaling Procedure
Erik Forsberg, Anders Sjöberg
Accepted on: 21 March 2022
Abstract updated on: 8 April 2022
Abstract
The Moral Foundations Questionnaire, consisting of the Relevance subscale and the Judgment subscale, was constructed using the framework of classical test theory for the purpose of measuring five moral foundations. However, so far, no study has investigated the latent properties of the questionnaire. Two independent samples, one from the New Zealand Attitudes and Values Study (N = 3989), and one nationally representative sample from Sweden (N = 1004), were analysed using Mokken scale analysis and optimal scaling procedure. The results indicate strong shared effects across both samples. Foremost, the Moral Foundations Questionnaire holds two latent trait dimensions, corresponding to the theoretical partitioning between Individualizing and Binding foundations. However, while the Relevance subscale was, in all, reliable in ordering respondents on level of ability, the Judgment subscale was not. Moreover, the dimensionality analysis showed that the Relevance subscale carries three cross-cultural homogeneity outlier items (items for loyalty and disorder concerns) in both samples. Lastly, while the test for local independence indicated adequate fit for the Individualizing trait dimension, the Binding dimension was theoretically ambiguous. Suggestions for improvements and future directions are discussed.
Measuring the Complexity of Equity-Centered Teaching Practice Development and Validation of a Rasch/Guttman Scenario Scale
Wen-Chia C. Chang
Accepted on: 4 March 2022
Abstract
The Name Blinded for Review (TEES) Scale was developed to measure the complexity of teaching practice for equity by integrating Rasch measurement and Guttman facet theory. This paper extends the existing work to develop and validate an efficient, short-form TEES Scale that can be used for research and evaluation purposes. The Rasch rating scale model is used to analyze the responses of 354 teachers across the United States. Validity evidence, which addresses the data/theory alignment, item and person fit, rating scale functioning, dimensionality, generalizability, and relations to external variables, is examined to support the adequacy and appropriateness of the proposed score interpretations and uses. The short-form TEES Scale functions well to measure teaching practice for equity and provides evidence for research or evaluation studies on whether and to what extent teachers or candidates learn to enact equity-centered practice. Limitations and future directions of the scale are discussed.
Using An Exploratory Quantitative Text Analysis (EQTA) to Synthesize Research Articles
Cheng Hua, Catanya Stager, Stefanie A. Wind
Accepted on: 4 January 2022
Abstract
An Exploratory Quantitative Text Analysis (EQTA) method was proposed to synthesize large sets of scholarly publications and to examine thematic characteristics in the Journal of Applied Measurement (JAM). After synthesizing 578 articles published in JAM from 2000 to 2020, authors classified each article into five categories to compare the difference in three phases: (1) word frequency analysis from EQTA; (2) descriptive analysis in the trend of research articles and classifications in counts; and (3) thematic analysis in word frequency between article classifications. We found that (1) the most frequently used words are Item, Rasch model, and Measure; (2) most article’s authors are from North America (380/578; 65.74%), followed by Europe (68/578; 11.76%) and other countries (130/578; 22.5%); (3) articles are focusing on model comparisons (77/578; 13%), followed by methodological developments (69/578; 12%) and reviews/other (43/578; 7%); (4) differences in classifications between application and methodology are displayed using pyramid plots. The EQTA revealed deeper insight into the nature of JAM publications, including common topics and areas of emphasis, and the EQTA is worthy of recommendation for future relevant research, as it is not limited to the journal of JAM.
Examining Rating Designs with Cross-Classification Multilevel Rasch Models
Jue Wang, Zhenqiu Lu, George Engelhard Jr., Allan S. Cohen
Accepted on: 23 November 2021
Abstract
The scoring of rater-mediated assessments largely relies on human raters, and their ratings empirically reflect student proficiency of a specific skill. Incomplete rating designs are common in operational scoring procedures because raters do not typically score all student performances. The cross-classification mixed-effect models can be used for examining data with a complex structure. By incorporating Rasch measurement models into the multilevel models, the cross-classification multilevel Rasch model (CCM-RM) can examine both students and raters on a single latent continuum, and also examine random effects for higher-level units. In addition, the CCM-RM provides flexibilities for modeling characteristics of raters and features of student performances. This study investigates the effect of different rating designs on the estimation accuracy of CCM-RM with consideration of sample sizes and variances of rater through a simulation study. We also illustrate the use of CCM-RM for evaluating rater accuracy with different rating designs on data from a statewide writing assessment.
Effects of Item Misfit on Proficiency Estimates Under the Rasch Model
Chunyan Liu, Peter Baldwin, Raja Subhiyah
Accepted on: 12 November 2021
Abstract
When IRT parameter estimates are used to make inferences about examinee performance, assessment of model-data fit is an important consideration. Although many studies have examined the effects of violating IRT model assumptions, relatively few have focused on the effects of violating the equal discrimination assumption on examinee proficiency estimation conditional on true proficiency under the Rasch model. The findings of this simulation study suggest that systematic item misfit due to violating this assumption can have noticeable effects on proficiency estimation, especially for candidates with relatively high or low proficiency. We also consider the consequences of misfit for examinee classification and show that while the effects on overall classification (e.g., pass/fail) rates are generally very small, false-negative and false-positive rates can still be affected in important ways.