Your Organization’s Examination Cannot Be Valid

Are you using terms like validity and reliability correctly? Are you practicing care when communicating about credentialing programs that are based on examination results? This article will help you think twice.

Thursday, February 21, 2019

By Robert C. Shaw, Jr., PhD, Vice President, Examinations, National Board for Respiratory Care

Excuse me? . . . What is that you say?

Yes, the examination cannot be valid.

But we have invested lots of time and money into our examinations while receiving many fancy reports. Our program is accredited; how can you even think about telling us our examination is invalid?

The word validity should be attributed to inferences made from examination results. All I am saying is that you have been incorrectly using the word.

Oh…

Primary Guide for Users of Tests

This article encourages people who are involved in occupational credentialing programs to avoid loosely throwing around measurement vocabulary. Vital guidance comes from the Standards for Educational and Psychological Testing, which is jointly produced by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (2014). The group that produced the last update of the Standards tries to encourage proper vocabulary use. Perhaps they too have suffered through cringe-inducing misuses of the word validity. I say this because the opening paragraph of the very first section, entitled 1. Validity, contains the following content on page 11 of the paper book:

The first sentence sets the stage when it states, “Validity refers to the degree to which evidence and theory support interpretations of test scores for proposed uses of tests.”
Details build as the paragraph continues, “The process of validation involves accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations. It is the interpretations of test scores for proposed uses that are evaluated, not the test itself.”
Toward the end of the paragraph, one finds, “Statements about validity should refer to particular interpretations for specified uses.”
In case the reader still misses the point, the last sentence in the paragraph states, “It is incorrect to use the unqualified phrase, the validity of the test” which is a final, earnest attempt to encourage accurate communications.

Consequences

Examinations are taken by people. Those who manage the program communicate with those who take an examination or use the results. These people deserve precise information that deepens their comprehension. Using measurement vocabulary incorrectly will cloud this comprehension leaving a false impression.

As an example, if someone took away the impression that it was the examination that was valid, then he or she could conclude that the examination could be given to anyone. As an example, my employer operates credentialing programs for general respiratory therapists and therapists in four specialties. If validity could be attributed to the general examination, then the general examination could be used for each specialty, right? Realizing the fault in that logic may help someone detect that the word validity should not be presented as a characteristic of an examination.

Maybe an Analogy Will Help

A blood sample was taken from me for the purpose of running laboratory tests that my physician had ordered. While discussing the results, if the physician’s nurse told me, “A valid machine produced your results,” then the statement incorrectly attributes validity to the machine. It is proper to say, “The physician reached a valid conclusion that you need no new medications after evaluating the lab results.”

There is a lot to disassemble in the last statement involving a checklist of assumptions:

Personnel in the laboratory verified the accuracy of values produced by the machine from which my results came.
My physician could evaluate a set of lab results leading to a clinically-useful conclusion.
The physician’s conclusion was integrated into my care plan in a way that makes sense.

Only after learning that all these pieces were in place should the word validity be linked to the use of test results.

The laboratory machine is analogous to an examination. The nurse is the communication conduit. The physician is like the board who oversees processes and makes decisions. Those decisions can be labeled as valid only when each part of the process is verified. One should call the interpretation of an examination result valid only after running through a checklist that at least includes the following:

The proper people took the examination.
Examination items stimulated responses about proper content without entangling irrelevant constructs to an important degree.
Examination responses were submitted in a standardized environment.
The combination of low error among test scores plus the placement of the passing standard led to sufficiently accurate decisions.

If one is willing to call a decision valid according to this checklist, then it is redundant to also cite accuracy, as in “Our examination is valid and reliable.” Doing so compounds the error of improperly attributing validity to the examination by restating a point about accuracy that is already known.

Reliability

Hopefully, the reader can now accept there is no such thing as a reliable examination either. The word reliability is best limited to discussions about a set of scores one has in hand. There is an understandable temptation to envision reliability as a stable feature of an examination after observing consistently high-reliability values across iterations. However, if the same set of items is administered to a very small group, then observing a reliability value that is noticeably lower or higher is no surprise. Hence, the word reliability can be attributed to a set of scores but should not be linked to the examination on which the scores are based.

A useful thought experiment starts by imagining an examination that has produced reliable scores. What if a form is given to a group who are mismatched compared to examination content? For example, what if the examination for general respiratory therapists is given to a group from one of the specialties. Should one expect reliability of .00? No, one should not. It is probable because general and specialty content is related, that the reliability index for this set of scores could be judged sufficiently high to otherwise contribute to confidence in the results. Despite this observation, an informed person should conclude that invalid pass/fail decisions would occur about the specialty. Knowledge about the mismatch between test takers and examination content is enough to invalidate such theoretic decisions. A key point about the relationship between reliability and validity is also revealed. Although sufficiently high reliability would be marked off in the validation checklist, it is theoretically possible to observe invalid results even when examination scores look reliable.

Take the thought experiment one step further by theoretically giving the general respiratory therapy examination to a group of healthcare practitioners outside of respiratory care (for example, pharmacists, laboratory scientists) or a group outside of healthcare (for example, architects, engineers). Scores showing progressively lower reliability values are likely the more exaggerated the mismatch becomes, but the index could still be greater than .00. Hence, one should take care to avoid interpreting a reliability statistic as if it was a validity index.

The temptation to do so is enabled when the people taking the examination and the examination content are well matched. However, the typical reliability statistic is just an evaluation of score consistency across replications of examination administrations as stated on page 33 of the Standards for Educational and Psychological Testing (2014). The Standards go on to say on page 34 that a given psychometrician may take an approach based on one of three theories (classical, generalizability, item response) to estimate reliability, but each approach yields a value inversely related to the error. In other words, a reliability value only tries to estimate the quantity of error. Documenting that a set of scores has a small error component is crucial to building confidence, but there are other steps to work through before one can justifiably label the use of examination results as valid.

Avoid Cringe-Inducing Moments

Consider the following examples linked to points raised in this article:

A little more time and space is needed to truthfully frame each point. A person listening to or reading each statement can still walk away while thinking that they just told me the “examination is valid” or “the examination is reliable.” However, avoiding the shorthand manner of communication that encourages these false impressions is worth the effort. Doing so frames the communicator as one who really knows his or her business and perhaps the listener will at least understand that confidence in results is based on a complicated system.

Summary

The article title was purposefully written to provoke engagement. The article was written to encourage care when communicating about occupational credentialing programs that are based on examination results. The words validity and reliability deserve care when used in communication because they describe complex relationships.

Reference

American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Published by American Educational Research Association, Washington, DC. ISBN 978-0-935302-35-6.