Test validity and test reliability are measures that ensure your tests are both fair and accurate. This is especially important in the field of candidate testing where the results of a test have a direct impact on someone’s ability to land a new job. It’s important to understand the differences between reliability and validity but the terms are all-encompassing and unhelpfully vague so this post aims to give you a much better understanding of the following:

  • What is test validity?
  • What are the different types of test validity?
  • Why is test validity important?
  • What is test reliability?
  • What are the different types of test reliability?
  • Why is test reliability important?
  • Tips to check your candidate tests are both valid and reliable

What is test validity?

Test validity refers to both what attribute is being measured by the test, and how accurate a test is at measuring a defined attribute. Validity is usually measured in three ways; internal validity, external validity and ecological validity.

What are the different types of test validity?

Construct validity, sometimes known as “internal validity” is the ability of the test to actually measure what it claims to and that this attribute is important for successful performance at the job. Some candidate tests have low construct validity when traits and characteristics, as defined by the test provider, don’t fully align with the user’s definition. Let’s say you wanted to use a test to measure someone’s integrity. The test provider’s definition will have to align perfectly with your own definition for you to get a valid measure of this trait. If your definition for someone with high integrity differs, and it may well, then the output will not be particularly valid.

External validity is the ability of a test’s output be generalised, into the ‘real world’ or a wider population. An off the shelf candidate test provider will usually have tested their method on a diverse, large population in order to confidently say it has strong external validity and can be used in a range of contexts. Nevertheless, this does not mean it can definitely translate into your unique work environment. This is due to the low ecological validity of the test.

Ecological validity is the ability of the output of a test to translate to a specific context or situation. Ecologically valid candidate tests are highly appropriate for specific work contexts. For example, working in a call centre may require certain traits, behaviours or characteristics, however, those characteristics may present themselves very differently in different call centres due to cultural differences. An experiment that is high in ecological validity will reflect as much of the specific context that is being measured as possible. The experiment designer will aim to reproduce as many of the ecological and environmental factors from the specific environment to get more reliable, and therefore more predictive, responses from participants.

Typically ecologically valid test and externally valid test are at odds with each other. The purpose of building an ecologically valid test is that its validity is unique to a specific context and not to be generalised across other organisations or contexts.

Why is test validity important?

Test validity is critical in being able to trust the results of a test. In other words, validity gives meaning to the test scores. Validity is also a term used to indicate the link between test performance and job performance.

Predictive validity, sometimes known as ‘criterion-related validity’ describes the degree to which you can make predictions about people based on their test scores. In other words, it indicates the usefulness of the test. The predictive validity of a test is measured by the validity coefficient. It is reported as a number between 0 and 1.00 that indicates the magnitude of the relationship, “r,” between the test and a measure of job performance (criterion). Test scores on ThriveMap’s personalised pre-hire assessments, for example, have been proven to be up to 5 times more predictive of new hire performance than traditional testing methods such as CV sifting and interviewing.

What is test reliability?

Test reliability is the degree to which a test produces similar scores each time it is used.

A simple example would be a weighing scale that keeps giving out different readings for the same item, in this case, we would conclude that the scales cannot be considered ‘reliable’. The same could be true when assessing the reliability of a candidate test. If we evaluate one participant on a specific attribute a number of times using the same method, and each instance gives us a drastically different output, we could reasonably deduce that the test method had low reliability.

What are the different types of test reliability?

Test-retest reliability is what we usually think of when testing reliability. It indicates the level of repeatability obtained by giving the same test twice at different times to candidates.

Parallel forms reliability uses one set of questions divided into two equivalent sets (“forms”), where both sets contain questions that measure the same construct, knowledge or skill. The two sets of questions are given to the same sample of people within a short period of time and an estimate of reliability is calculated from the two sets. Put simply, you’re trying to find out if test A measures the same thing as test B. In other words, you want to know if test scores stay the same when you use different instruments.

Inter-rater reliability refers to the likelihood that multiple assessors will give candidates the same score and make the same test decision. Inter-rater reliability is useful because the evaluators will not interpret results the same way; raters may disagree as to how well certain responses of the constructor skill being assessed.

Internal consistency reliability reflects the extent to which items within an test measure various aspects of the same characteristic or construct. For example, you want to test someone’s verbal reasoning ability and you have 5 different questions to evaluate it. The outcome of each question should be similar or the same. If they are then it means all the items measure the same characteristic reliably and can be used interchangeably. A wide variety of statistical tests are available to measure internal consistency; one of the most widely used is Cronbach’s Alpha.

Why is test reliability important?

Reliable tests produce dependable, repeatable, and consistent information about people. In order to meaningfully interpret test output and make useful hiring decisions, you need a reliable test.

The reliability of a test is indicated by something called a reliability coefficient. It is denoted by the letter “r,” and is expressed as a number ranging between 0 and 1.00, with r = 0 indicating no reliability, and r = 1.00 indicating perfect reliability. You will usually see the reliability of a test as a decimal, for example, r = .67 or r = .91. The larger the reliability coefficient, the more consistent the test scores.

Quick heads up on this, don’t expect to find a test with perfect reliability. Also, customised tests won’t have any data on how reliable their test will be, as they’re not built yet. Don’t let this put you off you can measure the reliability coefficient of any new test you create after you’ve launched. A good test partner will be able to make improvements to the test to improve its reliability over time.

Tips to check your candidate tests are both valid and reliable

  1. Conduct a job analysis. Any test should be directly linked to the attributes required to be successful in the job.  Determining the degree of similarity will require a job analysis. Job analysis is a systematic process used to identify the tasks, duties, responsibilities and working conditions associated with a job and the knowledge, skills, abilities, and other characteristics required to perform that job. When purchasing an off the shelf solution, you’ll have to work out if the measures of success correlated to how the job works in your organisation otherwise it’s best to build your own personalised assessment.
  2. Conduct a cultural analysis. If you’re building your own, customised candidate tests then you’ll want to go beyond a standard job analysis, by partnering with experts to identify how work happens within your unique culture and work environment. You’ll want to consider these factors in your test questions for maximum predictive validity.
  3. Ask your vendor for evidence. Off the shelf, ready to use candidate tests should come with reports on validity evidence, including detailed explanations of how validation studies were conducted.
  4. Conduct your own validation studies.  If you develop your own tests or procedures, you will need to conduct your own validation studies.

We hope this has helped to give you a better understanding of what the terms test reliability and validity mean. In summary, it’s important to note that “off the shelf” tests are often “pre-validated” however they ignore the contextual and cultural nuances of an organisation so will score low on ecological validity because of the following reasons.

  1. The questions themselves may not be representative of the actual capabilities required in your specific role.
  2. Your definition of the attributes you’d like to measure may not match the test providers definitions.
  3. Your standards for “what good looks like” may vary from the test providers definitions.
  4. The candidate is not immersed in a real-life scenario, therefore, their answers may not be representative of the choices they’d actually make in the job

If you’re thinking of using candidate testing then read our blog post on whether to buy one or build your own.