HOW DOES AVANT RATE SPEAKING AND WRITING RESPONSES?
Tuesday, February 28, 2017 by David Bong
The human rated responses in the Avant STAMP, PLACE, Arabic Proficiency Test and the Spanish Heritage Language tests are rated by Certified Avant Raters who are language educators/speakers who meet the following minimum requirements:
LANGUAGE SKILL: Raters must maintain advanced or higher level of language skills (determined by phone interview or test score from an approved assessment, i.e., STAMP4S, OPI, ILR Interview, MOPI, or Praxis/state teacher certification.)
EDUCATION: Raters must hold a bachelor’s degree or higher
TRAINING & CERTIFICATION: Raters must complete the language specific Avant Rater Training Program and score 90% agreement in the certification assessment
AVAILABILITY: Raters must be available to score a specified number of items (student responses) each week (determined by the specific language Rating Manager and rater)
All raters must complete the Avant Rater Training Program and pass a certification test before they are allowed to score student responses. The training process includes five steps and generally takes about 11-13 hours individual work time and about 2-3 hours with a Rater Training Manager to complete.
Human rating of Avant STAMP test item responses is conducted in the Rater Connection's online environment. The reading and listening test items (multiple choice) are computer scored. The constructed responses (speaking and writing) are rated by Certified Avant Raters through a web-based interface. Specifically, Avant’s online, distributed rating system Rater Connection System manages all student responses and facilitates scoring by distributing queues of 25 written or spoken responses to Certified Avant Raters when they login to the system. Raters score each response in a step–by-step wizard process, carefully considering each of four scoring elements or criteria:
Is the response ratable and on task?
What text type or amount of level specific language is evident?
What is the quality of the text in terms of the overall comprehensibility?
What is the overall accuracy of the response?
The scoring system tracks and calculates all ratings and then generates a composite score for each response, based upon these evaluation criteria.
Inter-Rater Reliability (IRR) is a measure of how consistent Certified Raters are applying Avant scoring criteria to student responses. Avant strives to maintain a high level of Inter-Rater Reliability through consistent comparison of ratings and delivery of ongoing training as needed. Specifically, Inter-Rater Reliability is tracked in the system as 20% of all responses are delivered to a second rater for a blind second rating of that response. This means that in each queue of 25 responses (the number of student responses scored in a batch) there are 5 responses that have been previously rated by another Certified Rater. The system then monitors how a second Certified Rater scores these responses. If there is a difference of assigned levels between the first and second Certified Rater, the RC sends that response to a third Certified Rater who arbitrates the score. Rater Managers are able to see which responses have received two scores and, more importantly, those responses that received three scores, and can track how each response was rated across the three different raters. Rater Managers can see trends in scoring and direct just-in-time training to any Certified Rater in need of retraining. Rater Managers then collect and use these “challenging” responses for training sessions.
Avant utilizes a two criteria rubric to assign scores to spoken and written responses. The two criteria, as indicated above, are Text Type (amount of language) and Accuracy (comprehensibility). For our purposes, we place a higher weighting on the Text Type criteria for levels 1-6 (Novice-Low through Intermediate-High) and then a more balanced weighting for levels 7 and 8 (Advanced-Low and Mid). As Certified Raters evaluate student speaking and writing responses, they first determine the text type score with the following possible selections: Non-Ratable (0), Words (1), Phrases (2), Simple Sentences (3), Strings of Sentences (4), Connected Sentences (5), Emerging Paragraph (6), Paragraph Structure (7), Extended Paragraph (8). Once the Text Type criterion has been determined, the RC directs the rater to determine the Accuracy/Comprehensibility of the response with the following choices for the specific Text Type score: below average, average or above average. RC combines the scores from both of these criteria to determine the final score/level for that response. Avant is then able to review the agreement of the Certified Raters in each language to determine the IRR percentage for any language over any period of time.
Equally important to Inter-Rater Reliability is the construct of Accuracy. The ideal is for all of the raters to be in agreement producing a high IRR, but if there is any drift from the standards (scoring the responses too high or too low), we need to know about that situation as well. To address drift, Avant injects anchor items (passages that have been selected and pre-scored by each language Rater Manager) into rating queues, then Rater Managers monitor how the Certified Raters score these special responses. Just like the IRR responses, these are delivered to the raters in a blind manner so that the raters are not able to identify these responses in any way. Rater Managers are then able to see if the raters are drifting from the standards. Based on this information, the Rater Managers can address any drift through retraining and support sessions. This is an important feature of our Rater Connection System and can be set to deliver anchor items at predetermined intervals.
The Avant STAMP test delivers three speaking and three writing prompts to each test taker who is assigned that domain or phase of the test i.e., speaking or writing. The final reported score is calculated based upon the two highest scores out of the three samples. Thus the final assigned level considers each response that was submitted and scored by Avant Certified Raters and determines the level that the test taker was able to maintain across the three tasks. For example, a test taker who receives a 3 (Novice-High) for his/her first response, a 4 (Intermediate-Low) for his/her second response and a 3 (Novice-High) for his/her third response will receive a final score of 3 (Novice-High) for that domain. This indicates that at MINIMUM the student was able to maintain level 3 (Novice-High) proficiency. However, in this case, one response was actually rated at a higher level and thus a blue bar is included in the report to indicate that this student may be approaching the next higher level, and encourages the teacher to look at that specific response. Because the final score or level is derived from the outcome of all three responses, the system is able to handle any single response that may have been scored inaccurately or that the test taker may have just not been able to respond to and maintain accurate reporting for overall test taker ability for each domain. Thus, the process of utilizing the two highest speaking or writing scores to assign the final student speaking or writing level is employed to minimize reporting of either false-negative or false-positive ratings for the overall domain score.