TEST RELIABILITY AND VALIDITY
Tuesday, October 25, 2016 by David Bong
These are two of the most misunderstood terms in language testing. Both are very important in determining whether a particular test is appropriate in a given situation.
Simply stated, reliability means that if you give the same test to the same student s/he will get the same score. This is not easy to accomplish. For computer scored questions (items) in reading and listening, a test developer needs to conduct a statistical analysis of the items. This process is called psychometric analysis. The analysis is conducted on data from a number of test-takers, who ideally have a wide range of skill levels. If the item is a good one, the analysis will confirm that it consistently discerns the accurate level of the test taker. In other words, if it is an intermediate-low item, novice-level test takers will consistently get it wrong, and intermediate and above test takers will get it correct. The more consistently an item performs this way the better it is at differentiating the test taker’s language skill. The analysis will put each item on a spectrum from easy to hard. The result of that effort will show that not all intermediate-low items are created equal with some items at the same level being harder than others. That degree of difficulty within a level needs to be taken into account when building the test. A computer scored test that consists of a well laid out set of items that have been psychometrically identified as good items should be a highly reliable test of those skills.
Although there are some computer scored writing and speaking tests, generally creating a reliable test of speaking and writing requires very consistent human scoring. First of all, there need to be several raters scoring tests for there to be any way to measure the reliability of the rating. The degree of consistency of rating is determined by calculating what is called, “Inter-Rater Reliability” (IRR). In other words, how reliably consistent is the scoring among different raters. If the IRR is high, then the reliability of the test is high and you can rely on the test score to be accurate.
Validity is a much less precise or scientific thing. Simply stated, a test is valid if it is measuring the appropriate things for the use it is being put to. If a teacher wants to know whether learners memorized their French vocabulary homework s/he would give them a set of questions about the homework. S/he wouldn’t ask them about the history of China. If you want to measure learners’ proficiency levels you should ask them real world questions that they haven’t specifically prepared for, at a variety of levels to see what they can really do with the language. This would be a valid approach to measuring a test taker’s ability to accomplish real-world tasks (=proficiency).