Abstract
The STAMP 4S and STAMP WS assessments, part of the STAMP (Standards-Based Measurement of Proficiency) family, include Writing and Speaking sections. Reliable and accurate scores are crucial for validating the intended uses of these tests.
This paper presents the results of a recent analysis of ratings in the Writing and Speaking sections for five STAMP 4S languages (Arabic, Spanish, French, Simplified Chinese, and Russian) and three STAMP WS languages (Amharic, Haitian Creole, and Vietnamese).
The analysis, which included over 23,000 responses, shows high scoring accuracy and reliability for both the Writing and Speaking sections, strongly supporting the validity of these scores for their intended purposes.
The Writing and Speaking Sections of STAMP
The STAMP family of tests assess real-world language skills.
STAMP 4S evaluates four language skills and is accredited by the American Council on Education (ACE), currently available in 15 languages.
STAMP WS, also ACE-accredited, tests Writing and Speaking skills and is available in 37 languages.
Two key factors in validating a test’s results are reliability and accuracy. This paper examines the reliability and accuracy of ratings in the Writing and Speaking sections of STAMP, scored by trained raters using levels from 0 (No proficiency) to 8 (Advanced-Mid).
In the Writing and Speaking sections, examinees respond to three real-world prompts, aiming to showcase their language skills. Each response is scored independently by certified raters who undergo rigorous training and ongoing monitoring to ensure consistency and quality.
Typically, 80% of responses are rated by a single rater, whose score becomes official. In 20% of cases, at least two raters score a response, with a manager stepping in if there’s a disagreement. Ratings are done independently, without any knowledge of other responses or scores, ensuring unbiased results.
An examinee’s final score for Writing or Speaking is based on the highest level they can consistently demonstrate across two out of three prompts.
As shown in Figure 1, an examinee’s official STAMP level is determined by the highest level they can consistently demonstrate in at least two out of three responses. For example, if an examinee receives Novice-Mid for their first response, Novice-High for their second, and Novice-High for their third, their final STAMP level is STAMP 3 (Novice-High). Alternatively, if they receive Intermediate-Low for the first response, Novice-High for the second, and Intermediate-Mid for the third, their final level is Intermediate-Low, as it is the highest level they sustained twice (in the first and third responses).
Using three independent prompts in both the Writing and Speaking sections of STAMP has two main benefits:
- Broader Topic Coverage: Assessing examinees across different topics ensures that the awarded proficiency level is more likely to generalize to other real-world situations.
- Minimizing Rater Bias: Coupled with the scoring method, using multiple prompts helps reduce potential rating bias from individual raters.
Next, we will discuss the definitions of reliability and accuracy.
Reliability
Reliability refers to the consistency of measurement (Bachman & Palmer, 1996). In simple terms, it is how much we can trust that the test scores will remain the same if an examinee takes the test again at different times or takes different versions of the test, assuming their proficiency has not changed.
For example, if an examinee scores Intermediate-Low today and Intermediate-High tomorrow, without any change in their knowledge or mental state, it suggests the test may not be highly reliable. Similarly, if an examinee scores Advanced-Low on one version of a test and Intermediate-Mid on another, it indicates a lack of consistency, pointing to an issue with the test’s reliability.
One factor contributing to a test’s reliability is how it is scored. In the STAMP test, the Reading and Listening sections are made up of multiple-choice questions that are scored automatically by a computer. This ensures that if an examinee provides the same answers on different occasions, they will always receive the same score.
However, the Writing and Speaking sections are scored by human raters. This means that scores can vary depending on who rates the response. With well-trained raters, we expect score variations to be minimal, reducing the impact of leniency, strictness, or potential bias.
Accuracy
Examinees expect their scores to reflect only their proficiency in the construct being measured (in STAMP, proficiency in each language domain).
Accuracy refers to how well the awarded score represents an examinee’s true ability. For example, if an examinee submits a Speaking response at the Intermediate-High level but receives an Intermediate-Low score from two raters, the score is inaccurate. Even if two other raters assign Intermediate-Low two months later, the score remains inaccurate, although it is reliable (since it is consistent across raters and over time).
Figure 2 illustrates the difference between reliability and accuracy. Ideally, tests should be both reliable and accurate, as this ensures the validity of the scores and their intended use.
Statistics Commonly Used to Evaluate the Reliability and Accuracy of Scores by Raters
When responses are scored by human raters, as in the case of STAMP, it’s crucial to ensure that scores reflect the quality of the response itself, not the characteristics of the rater. In other words, scores should depend solely on the examinee’s demonstrated proficiency, not on rater leniency, strictness, or bias.
Language test providers often use statistics to show how much scores may vary based on the rater. Typically, this involves comparing ratings from two separate raters on the same response. Ideally, raters should agree as often as possible, which indicates a reliable scoring process.
However, reliability must also be accompanied by accuracy. Two raters may assign the same score, but both could be incorrect. In a well-developed test, the goal is for raters to consistently agree and be accurate in their scoring.
Perfect agreement between human raters is not always realistic. Despite training and expertise, even qualified raters may disagree at times—just like doctors, engineers, or scientists. The aim is to achieve high agreement that is defensible given the intended use of the scores.
Below are the statistical measures we use at Avant Assessment to evaluate the quality of ratings provided by our raters. While many companies report only exact and adjacent agreement, we assess additional measures to get a comprehensive view of rating quality. The measures reported in this paper include:
Exact Agreement:
This measure is reported as a percentage that indicates the percentage of times, across the entire dataset analyzed, when the level awarded to a given response by Rater 1 is exactly the same as the level awarded by Rater 2. For example, if Rater 1 awards a STAMP level 5 to a response and Rater 2 also awards a STAMP level 5 to that same response, that would be considered an instance of exact agreement. Feldt and Brennan (1989) suggest that when two raters are used, there should be an exact agreement of at least 80%, with 70% being considered acceptable for operational use.
This measure is reported as a percentage, showing how often Rater 1 and Rater 2 assigned the same level to a response across the entire dataset. For example, if both raters assign a STAMP level 5 to the same response, it counts as an instance of exact agreement. According to Feldt and Brennan (1989), exact agreement should be at least 80%, with 70% considered acceptable for operational use.
Exact + Adjacent Agreement:
This measure is reported as a percentage showing how often Rater 1 and Rater 2 assigned either the same level or an adjacent level to a response across the entire dataset.
For example, STAMP level 5 is adjacent to level 4 and level 6. If Rater 1 assigns level 4 and Rater 2 assigns level 5, it counts towards this measure because the levels are adjacent. According to Graham et al. (2012), when a rating scale has more than 5-7 levels, as with the STAMP scale, the exact + adjacent agreement should be close to 90%.
Quadratic weighted kappa (QWK)
Cohen’s kappa (𝜅) measures reliability between two raters while accounting for the possibility of agreement by chance. For example, with the 9-point STAMP scale (from level 0 to level 8), there is an 11.11% chance that two raters would agree on a score purely by chance. At Avant, we also use quadratic weights when calculating kappa, meaning higher penalties are given to larger discrepancies between scores. For instance, a difference between STAMP level 3 and level 7 is more problematic than a difference between level 3 and level 4.
Williamson et al. (2012) recommend that quadratically weighted kappa (QWK) should be ≥ 0.70, while Fleiss (2003) notes that values above 0.75 indicate excellent agreement beyond chance. A QWK value of 0 means agreement is purely by chance, whereas a value of 1 indicates perfect agreement.
Standardized Mean Difference (SMD)
This measure shows how similarly two raters use a rating scale. It compares the difference in the mean of two sets of scores (Rater 1 vs. Rater 2), standardized by the pooled standard deviation of those scores. Ideally, neither rater should favor or avoid certain levels on the scale (e.g., avoiding STAMP 0 or STAMP 8). In other words, both raters should use the full range of the scale (STAMP 0 – STAMP 8), with scores reflecting the proficiency demonstrated in the response. The recommended value for this measure is ≤ 0.15 (Williamson et al., 2012), indicating that the distributions of both sets of scores are acceptably similar.
Spearman’s Rank-Order Correlation (ρ)
This measure indicates the strength of the association between two variables: the STAMP level assigned by Rater 1 and the level assigned by Rater 2. If raters are well-trained and understand the rating rubric, we expect both raters to assign similar levels—meaning the scores should move together. In other words, when Rater 1 assigns a high level, Rater 2 should also assign a high level, reflecting consistent evaluation of the same construct.
We use Spearman’s rank-order correlation coefficient instead of Pearson’s because Spearman’s is better suited for ordinal data, like STAMP proficiency levels. A correlation coefficient of 0.80 or above is considered strong in most fields (Akoglu, 2018).
2 STAMP Levels Apart
This measure, expressed as a percentage, shows how often two ratings for the same response differ by 2 STAMP levels (e.g., Rater 1 assigns STAMP level 4 and Rater 2 assigns STAMP level 6).
Reliability and Accuracy of Scores by Avant Raters Across Various Languages
We now focus on the quality of the ratings for the Writing and Speaking sections of STAMP 4S and STAMP WS, considering the statistics above across several representative languages. Below, we present results based on two different sets of comparisons:
Rater 1 vs Rater 2
We compare the STAMP level awarded by Rater 1 to the level awarded by Rater 2 across numerous responses rated by at least two raters. This comparison supports the reliability of ratings from two randomly assigned Avant raters. As noted earlier, two raters may agree on a score, but both could still be incorrect. Therefore, we do not include exact agreement measures between Rater 1 and Rater 2. Instead, we focus on Exact + Adjacent Agreement and report accuracy measures comparing scores from Rater 1 (who rates solo 80% of the time) with the official scores.
Rater 1 vs Official Score
To assess the accuracy of the levels assigned by Avant raters, we analyze instances where a response was rated by two or more raters. We compare the official score (derived from all individual ratings) to the score given by Rater 1 alone. This helps indicate how accurately a response is rated when only one rater is involved, which occurs 80% of the time.
Tables 1 and 2 present the statistical measures for the Writing and Speaking sections of five representative STAMP 4S languages.
Table 1
Measure | Arabic | Spanish | French | Chinese Simplified | Russian |
---|---|---|---|---|---|
Number of Responses in Dataset | n = 3,703 | n = 4,758 | n = 4,785 | n = 4,766 | n = 3,536 |
Exact Agreement (Rater 1 vs. Official Score) | 84.8% | 84.15% | 83.66% | 88.46% | 92.17% |
Exact + Adjacent Agreement (Rater 1 vs. Official Score) | 96.78% (98.62%) | 99.09% (99.79%) | 99.22% (99.79%) | 99.79% (99.91%) | 99.71% (99.88%) |
Quadratic Weight Kappa (QWK) (Rater 1 vs. Official Score) | 0.93 (0.96) | 0.91 (0.95) | 0.91 (0.95) | 0.95 (0.96) | 0.95 (0.97) |
Standardized Mean Difference (SMD) (Rater 1 vs. Rater 2) | 0.00 (0.01) | 0.00 (0.00) | 0.00 (0.00) | 0.00 (0.00) | 0.00 (0.00) |
Spearman’s Rank-Order Correlation (R) (Rater 1 vs. Official Score) | 0.94 (0.96) | 0.90 (0.95) | 0.91 (0.95) | 0.95 (0.97) | 0.94 (0.97) |
2 STAMP Levels Apart (Rater 1 vs. Rater 2) | 2.80% (1.24%) | 0.90% (0.20%) | 0.77% (0.20%) | 0.00% (0.00%) | 0.28% (0.11%) |
Table 2
Measure | Arabic | Spanish | French | Chinese Simplified | Russian |
---|---|---|---|---|---|
Number of Responses in Dataset | n = 3,363 | n = 4,078 | n = 4,530 | n = 4,651 | n = 3,392 |
Exact Agreement (Rater 1 vs. Official Score) | 84.96% | 80.37% | 80.19% | 82.24% | 88.30% |
Exact + Adjacent Agreement (Rater 1 vs. Official Score) | 96.07% (98.13%) | 98.13% (99.29%) | 98.54% (99.47%) | 99.31% (99.76%) | 98.99% (99.94%) |
Quadratic Weight Kappa (QWK) (Rater 1 vs. Official Score) | 0.92 (0.95) | 0.92 (0.96) | 0.91 (0.95) | 0.94 (0.95) | 0.92 (0.96) |
Standardized Mean Difference (SMD) (Rater 1 vs. Rater 2) | -0.02 (0.01) | 0.00 (0.00) | -0.01 (0.02) | 0.00 (0.00) | -0.01 (-0.01) |
Spearman’s Rank-Order Correlation (R) (Rater 1 vs. Official Score) | 0.93 (0.96) | 0.91 (0.95) | 0.92 (0.95) | 0.94 (0.96) | 0.91 (0.95) |
2 STAMP Levels Apart (Rater 1 vs. Rater 2) | 3.27% (1.42%) | 1.74% (0.00%) | 1.39% (0.00%) | 0.00% (0.00%) | 1.01% (0.00%) |
Tables 3 and 4 show the statistical measures for the Writing and Speaking sections of three
representative STAMP WS languages.
Table 3
Table 4
Discussion
A high level of reliability and accuracy is fundamental to the validity of test scores and their intended uses. What is deemed minimally acceptable in terms of reliability and accuracy will however depend on the specific field (medicine, law, sports, forensics, language testing, etc), as well as on the consequences of awarding an inaccurate level to a specific examinee’s set of responses, and on the rating scale itself. For example, agreement will tend to be lower the higher the number of categories available in a rating scale. In other words, more disagreement between any two raters can be expected if they must assign one of ten possible levels to a response than if they must assign one of only four possible levels.
The statistics seen above for the Writing and Speaking sections of both STAMP 4S and STAMP WS show a high level of both reliability (Rater 1 vs. Rater 2 scores) and accuracy (Rater 1 vs. Official Scores). Of the eight languages evaluated, the reliability seen by Exact + Adjacent Agreement between Rater 1 and Rater 2 is always at a minimum (and often considerably higher) of 96.78% for Writing and 96.07% for Speaking. Additionally, cases in which the ratings by two raters were more than two STAMP levels apart were very seldom observed. The level of accuracy for all eight languages, seen by the Exact Agreement statistics between Rater 1’s score and the Official score for each response is always at a minimum of 83.66% (but often considerably higher) for Writing and 80.19% for Speaking, with Exact + Adjacent Agreement always at a minimum of 98.62% for Writing and 98.13% for Speaking. The values for Quadratic Weighted Kappa (QWK) show a very high level of agreement between both Rater 1 vs. Rater 2 and between Rater 1 vs. Official Scores, while the correlation between Rater 1 and Rater 2 scores, as well as between Rater 1 and Official Scores,have been shown to be very high. Finally, the SMD (Standardized Mean Differences) coefficients show that the STAMP scale is being used in a very similar fashion by Avant raters.
The statistics above provide evidence of the high quality of the rater selection and training program at Avant Assessment and of our methodology in identifying operational raters who may need to be temporarily removed from the pool of raters and given targeted training. It shows that when any two raters may differ in the STAMP level assigned to a response, the difference will rarely be of more than 1 STAMP level, with both raters assigning the exact same level in the great majority of cases. Coupled with the fact that an examinee’s final, official score in either the Writing or Speaking section of STAMP is based on their individual STAMP scores across three independent prompts, the results herein provide strong evidence that an examinee’s final score for the Writing and Speaking sections of STAMP can be trusted to be a reliable and accurate representation of their level of language proficiency in these two domains.
References
Akoglu, H. (2018). User’s guide to correlation coefficients. Turkish journal of emergency medicine, 18(3), 91-93.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests (Vol. 1). Oxford University Press.
Feldt, L. S., & Brennan, R. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York: Macmillan.
Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical methods for rates and proportions. 3rd ed. Wiley.
Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and Promoting Inter-Rater Agreement of
Teacher and Principal Performance Ratings.
Matrix Education (2022). Physics Practical Skills Part 2: Validity, Reliability and Accuracy of Experiments. Retrieved on August 11, 2022 (click here to go to source).
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated
scoring. Educational measurement: issues and practice, 31(1), 2-13.