How Does Avant Rate Speaking and Writing Responses?
Who rates the STAMP tests?
The human rated responses in the Avant STAMP, PLACE, Arabic Proficiency Test and the Spanish Heritage Language tests are rated by Certified Avant Raters who are language educators/speakers who meet the following minimum requirements:
LANGUAGE SKILL: Raters must maintain advanced or higher level of language skills (determined by phone interview or test score from an approved assessment, i.e., STAMP4S, OPI, ILR Interview, MOPI, or Praxis/state teacher certification.)
EDUCATION: Raters must hold a bachelor’s degree or higher
TRAINING & CERTIFICATION: Raters must complete the language specific Avant Rater Training Program and score 90% agreement in the certification assessment
AVAILABILITY: Raters must be available to score a specified number of items (student responses) each week (determined by the specific language Rating Manager and rater)
How are Certified Avant Raters trained?
All raters must complete the Avant Rater Training Program and pass a certification test before they are allowed to score student responses. The training process includes five steps and generally takes about 11-13 hours individual work time and about 2-3 hours with a Rater Training Manager to complete.
ACADEMIC PREPARATION: Rater candidates study Avant Rater Training materials that explain the proficiency levels based on the ACTFL Proficiency Guidelines and ILR Proficiency Descriptors and learn how Avant applies these levels to test taker responses. The goal of this step is to orient the rater candidates to the issues experienced in rating constructed response items (speaking and writing) and outline the range of scores that Avant’s system can assign to test taker responses. Step 1 is an independent study phase with an expected time commitment of 3-5 hours.
GUIDED REVIEW AND INTRODUCTION TO RATER CONNECTION SYSTEM (RC): The rater candidate meets with their assigned Rater Training Manager to discuss and clarify the key elements of the training materials including proficiency levels and the details that are used to identify the various levels. The Rater Training Manager then assists the prospective rater in accessing Avant’s online Rater Connection Software (see Step 3) and walks them through several responses, showing how to use Rater Connection System. Step 2 is usually conducted as a virtual meeting between the rater candidate and Rater Training Manager with an expected time commitment of 2-3 hours
PRACTICE WITH RATER CONNECTION SYSTEM (RC) Avant’s RC allows the rater candidate to score selected training/anchor responses and receive immediate feedback on their rating of each response. The purpose of this stage of the training process is to expose the rater candidate to many test-taker responses that have been previously scored by Avant Master Raters. Each training response includes a detailed annotation or explanation of why the item was scored the way that it was. This process allows the rater candidate to evaluate a variety of responses across all levels with immediate feedback in order to internalize and apply the scoring criteria. Step 3 is an independent online practice session with an expected time commitment of 4-5 hours or longer if necessary, to complete the practice banks of responses.
GUIDED ANALYSIS OF RATER CONNECTION PRACTICE: Once the rater candidate has completed the training responses in the RC, the candidate meets with their assigned Rater Training Manager to discuss and clarify issues that came up during the practice scoring session. Specifically, responses that were not scored accurately are reviewed and questions about scoring criteria and level descriptions and their application to responses are answered. The Rater Training Manager is able to see which criteria the prospective rater struggled with and can quickly identify areas that need further support or training. At this point, the Rater Training Manager can decide to either have the prospective Rater repeat Step 3 of the Avant Rater Training Program or move to Step 5 for certification. Step 4 is usually conducted as a virtual meeting between rater candidates and the Rater Training Manager with an expected time commitment of one or more hours depending on the number of areas that must be reviewed.
CERTIFICATION: The final step in the Avant Rater Training Program consists of the rater candidate passing a certification test by obtaining 90% or higher agreement in scoring with Avant Master Raters. To complete this test, the rater candidate accesses the Rater Connection System and scores a certification bank of responses in a process that replicates the experience they will have when they score student responses in the live system. The certification bank consists of responses previously rated by Avant Master Raters, but without the annotations or comments provided during the training sessions. At the conclusion of the certification test, the rater candidate is notified of their score. The Rater Training Manager then meets with the rater candidate to identify rating issues, engaging them in retraining activities as necessary. Rater candidates, who attain 90% or higher agreement with Avant Master Raters, are designated as Certified Avant Raters who are qualified to rate active STAMP test responses.
LIVE RATING: After the rater candidate has completed all training elements and passed the certification test, s/he is given access to live responses in the Rater Connection System. The newly Certified Rater is then instructed to go into the system and rate a batch (25 responses), informing their Rater Manager when they complete the batch. The Rater Manager then goes into the Admin site to review each item scored by the Certified Rater to verify the scores are accurate. When the Rater Manager is satisfied with the accuracy of the newly Certified Rater's rating of responses, the Rater can continue rating. The Rater Manager continues to monitor the newly Certified Rater closely during the first few weeks of rating.
This Avant Rater Program has been developed and honed to meet the demands of establishing high levels of quality and accuracy in all Avant raters. Spot training also occurs on an ongoing basis as Avant language specific Rater Managers review Inter-Rater Reliability and accuracy statistics each day. The STAMP rating system facilitates constant monitoring of scoring trends and alerts the Rater Managers to scoring issues and anomalies so that just-in-time retraining can take place.
How are STAMP tests rated?
Human rating of Avant STAMP test item responses is conducted in the Rater Connection's online environment. The reading and listening test items (multiple choice) are computer scored. The constructed responses (speaking and writing) are rated by Certified Avant Raters through a web-based interface. Specifically, Avant’s online, distributed rating system Rater Connection System manages all student responses and facilitates scoring by distributing queues of 25 written or spoken responses to Certified Avant Raters when they login to the system. Raters score each response in a step–by-step wizard process, carefully considering each of four scoring elements or criteria:
Is the response ratable and on task?
What text type or amount of level specific language is evident?
What is the quality of the text in terms of the overall comprehensibility?
What is the overall accuracy of the response?
The scoring system tracks and calculates all ratings and then generates a composite score for each response, based upon these evaluation criteria.
What is inter-Rater Reliability (IRR) and how is it monitored?
Inter-Rater Reliability (IRR) is a measure of how consistent Certified Raters are applying Avant scoring criteria to student responses. Avant strives to maintain a high level of Inter-Rater Reliability through consistent comparison of ratings and delivery of ongoing training as needed. Specifically, Inter-Rater Reliability is tracked in the system as 20% of all responses are delivered to a second rater for a blind second rating of that response. This means that in each queue of 25 responses (the number of student responses scored in a batch) there are 5 responses that have been previously rated by another Certified Rater. The system then monitors how a second Certified Rater scores these responses. If there is a difference of assigned levels between the first and second Certified Rater, the RC sends that response to a third Certified Rater who arbitrates the score. Rater Managers are able to see which responses have received two scores and, more importantly, those responses that received three scores, and can track how each response was rated across the three different raters. Rater Managers can see trends in scoring and direct just-in-time training to any Certified Rater in need of retraining. Rater Managers then collect and use these “challenging” responses for training sessions.
How does Avant Measure Inter-Rater-Reliability (IRR)?
Avant utilizes a two criteria rubric to assign scores to spoken and written responses. The two criteria, as indicated above, are Text Type (amount of language) and Accuracy (comprehensibility). For our purposes, we place a higher weighting on the Text Type criteria for levels 1-6 (Novice-Low through Intermediate-High) and then a more balanced weighting for levels 7 and 8 (Advanced-Low and Mid). As Certified Raters evaluate student speaking and writing responses, they first determine the text type score with the following possible selections: Non-Ratable (0), Words (1), Phrases (2), Simple Sentences (3), Strings of Sentences (4), Connected Sentences (5), Emerging Paragraph (6), Paragraph Structure (7), Extended Paragraph (8). Once the Text Type criterion has been determined, the RC directs the rater to determine the Accuracy/Comprehensibility of the response with the following choices for the specific Text Type score: below average, average or above average. RC combines the scores from both of these criteria to determine the final score/level for that response. Avant is then able to review the agreement of the Certified Raters in each language to determine the IRR percentage for any language over any period of time.
How does Avant Measure Rating Accuracy and Monitor for Drift?
Equally important to Inter-Rater Reliability is the construct of Accuracy. The ideal is for all of the raters to be in agreement producing a high IRR, but if there is any drift from the standards (scoring the responses too high or too low), we need to know about that situation as well. To address drift, Avant injects anchor items (passages that have been selected and pre-scored by each language Rater Manager) into rating queues, then Rater Managers monitor how the Certified Raters score these special responses. Just like the IRR responses, these are delivered to the raters in a blind manner so that the raters are not able to identify these responses in any way. Rater Managers are then able to see if the raters are drifting from the standards. Based on this information, the Rater Managers can address any drift through retraining and support sessions. This is an important feature of our Rater Connection System and can be set to deliver anchor items at predetermined intervals.
How are the final levels for each skill derived and reported?
The Avant STAMP test delivers three speaking and three writing prompts to each test taker who is assigned that domain or phase of the test i.e., speaking or writing. The final reported score is calculated based upon the two highest scores out of the three samples. Thus the final assigned level considers each response that was submitted and scored by Avant Certified Raters and determines the level that the test taker was able to maintain across the three tasks. For example, a test taker who receives a 3 (Novice-High) for his/her first response, a 4 (Intermediate-Low) for his/her second response and a 3 (Novice-High) for his/her third response will receive a final score of 3 (Novice-High) for that domain. This indicates that at MINIMUM the student was able to maintain level 3 (Novice-High) proficiency. However, in this case, one response was actually rated at a higher level and thus a blue bar is included in the report to indicate that this student may be approaching the next higher level, and encourages the teacher to look at that specific response. Because the final score or level is derived from the outcome of all three responses, the system is able to handle any single response that may have been scored inaccurately or that the test taker may have just not been able to respond to and maintain accurate reporting for overall test taker ability for each domain. Thus, the process of utilizing the two highest speaking or writing scores to assign the final student speaking or writing level is employed to minimize reporting of either false-negative or false-positive ratings for the overall domain score.