Pre-Employment Test Validity vs Test Reliability

9 minute read

Posted by Chris Platts on 18 March 2020

Test validity and test reliability are measures that ensure pre-employment tests are both fair and accurate. These measures are especially important in the field of candidate testing where the results have a direct impact on someone’s ability to secure a new job.

These terms are unhelpfully vague so this post aims to give you a much better understanding of the following areas:

What is test validity?
What are the different types of pre-employment test validity?
Why is validity important?
What is test reliability?
What are the different types of pre-employment test reliability?
Why is test reliability important?
Tips to check your pre-employment tests are both valid and reliable

What is test validity?

Validity refers to both what attribute is being measured by the test, and how accurate a test is at measuring a defined attribute. Ultimately it’s a scientific measure of the question “Can I trust this test?”.

It is usually measured in three ways; construct validity, external validity and ecological validity.

What are the different types of pre-employment test validity?

Construct validity, sometimes known as “internal validity”, is the ability of a test to actually measure what it claims to. It’s also used to validate that the attribute being measured is important for successful performance at the job. Some candidate tests, such as personality questionnaires, have poor construct validity because the traits and characteristics that are being measured, are defined by the test provider and therefore don’t fully align with the employer’s definition.

An Example: Let’s say you wanted to use a test to measure someone’s integrity. The test provider’s definition will have to align perfectly with your own definition for you to get a valid measure of this trait. If your definition of someone with high integrity differs from the test providers then the output will not be particularly valid.

External validity is the ability of a test’s output be generalised into the ‘real world’ or a wider population. A pre-written, off-the-shelf candidate test will usually have been validated against a diverse, large population in order to confidently say it has strong external validity and can be used in a range of contexts. Nevertheless, this does not mean it can definitely translate into your unique work environment. This is due to the low ecological validity of the test (see next point).

Ecological validity is the ability of a test to translate to a specific context or environment. Ecologically valid assessments are highly appropriate and usually tailored to specific work contexts.

For example, working in a call centre may require certain traits, behaviours or characteristics, however, those characteristics may present themselves very differently in different call centres due to cultural and operational differences.

An assessment that is high in ecological validity will reflect as much of the specific context in which someone is working within as possible. The assessment designer will aim to reproduce as many of the ecological factors from the specific environment to get more reliable, and therefore more predictive, responses from participants.

Typically ecologically valid tests and externally valid tests are at odds with each other. The purpose of building an ecologically valid test is that its validity is unique to a specific context and not to be generalised across other organisations or contexts.

Why is validity important?

Test validity is critical in being able to trust the results of a test. In other words, validity gives meaning to test scores. Validity is also a term used to indicate the link between test performance and job performance.

Predictive validity, sometimes known as ‘criterion-related validity’ describes the degree to which you can make predictions about people based on their test scores. In other words, it indicates the usefulness of the test. The predictive validity of a test is measured by the validity coefficient. It is reported as a number between 0 and 1.00 that indicates the magnitude of the relationship, “r,” between the test and a measure of job performance (criterion).

Validity scores on ThriveMap‘s ecologically valid pre-hire assessments, for example, have been proven to be up to 5 times more predictive of new hire performance than traditional testing methods such as CV sifting or generic testing.

What is test reliability?

Test reliability is the degree to which a test produces similar scores each time it’s used.

A simple example would be a weighing scale that keeps giving out different readings for the same item, in this case, we would conclude that the scales cannot be considered ‘reliable’. The same could be true when assessing the reliability of a candidate test. If we evaluate one participant on a specific attribute a number of times using the same method, and each instance gives us a drastically different output, we could reasonably deduce that the test method had low reliability.

What are the different types of reliability?

Test-retest reliability is what we usually think of when testing reliability. It indicates the level of repeatability obtained by giving the same test twice at different times to candidates.

Parallel forms reliability uses one set of questions divided into two equivalent sets (“forms”), where both sets contain questions that measure the same construct, knowledge or skill. The two sets of questions are given to the same sample of people within a short period of time and an estimate of reliability is calculated from the two sets. Put simply, you’re trying to find out if test A measures the same thing as test B. In other words, you want to know if test scores stay the same when you use different instruments.

Inter-rater reliability refers to the likelihood that multiple assessors will give candidates the same score and make the same test decision. Inter-rater reliability is useful because the evaluators will not interpret results the same way; raters may disagree as to how well certain responses of the constructor skill being assessed.

Internal consistency reliability reflects the extent to which items within a test measure various aspects of the same characteristic or construct. For example, you want to test someone’s verbal reasoning ability and you have 5 different questions to evaluate it. The outcome of each question should be similar or the same; if they are then it means all the items measure the same characteristic reliably and can be used interchangeably.

A wide variety of statistical tests are available to measure internal consistency; the one we use at ThriveMap is Cronbach’s Alpha. This process checks the correlation between questions loading onto the same factor. Cronbach Alpha (CA) values range from 0 – 1.0. In most cases, the value should be at least 0.70. You might consider deleting a question if doing so dramatically improves your CA.

Why is reliability important?

Reliable tests produce dependable, repeatable, and consistent information about people. In order to meaningfully interpret test output and make useful hiring decisions, you need a reliable test.

The reliability of a test is indicated by something called a reliability coefficient. It is denoted by the letter “r,” and is expressed as a number ranging between 0 and 1.00, with r = 0 indicating no reliability, and r = 1.00 indicating perfect reliability. You will usually see the reliability of a test as a decimal, for example, r = .67 or r = .91. The larger the reliability coefficient, the more consistent the test scores.

Quick heads up on this, don’t expect to find a test with perfect reliability. Also, customised tests won’t have any data on how reliable their test will eventually be, as they’re not built yet. Don’t let this put you off you can measure the reliability coefficient of any new test you create after you’ve launched. A great test partner will also be able to make improvements to the test to improve its reliability over time.

Tips to check your candidate tests are both valid and reliable

Conduct a job analysis. Any test should be directly linked to the attributes required to be successful in the job. Determining the specific attributes to measure will require a job analysis; a systematic process used to identify the tasks, duties, responsibilities and working conditions associated with a job and the knowledge, skills, abilities, and other characteristics required to perform that job. When purchasing an off the shelf solution, you’ll have to work out if the measures of success correlate to how the job works in your organisation.
Conduct a cultural analysis. If you’re building your own, customised candidate tests then you’ll want to go beyond a standard job analysis, by partnering with experts to identify how work happens within your unique culture and work environment. You’ll want to consider these factors in your test questions for maximum predictive validity.
Ask your vendor for evidence. Off the shelf, ready to use candidate tests should come with reports on validity evidence, including detailed explanations of how validation studies were conducted.
Conduct your own validation studies. If you develop your own tests or procedures, you will need to conduct your own validation studies, your assessment partner may be able to help you with this.

We hope this has helped to give you a better understanding of what the terms test reliability and test validity mean. In summary, it’s important to note that “off the shelf” tests are often “pre-validated” however they ignore the contextual and cultural nuances of an organisation so will score low on ecological validity because of the following reasons.

The questions themselves may not be representative of the actual capabilities required in your specific role.
Your definition of the attributes you’d like to measure may not match the test providers definitions.
Your standards for desired behaviour may vary from the test providers definitions.
The candidate is not immersed in a real-life scenario, therefore, their answers may not be representative of the choices they’d actually make in the job

If you’re thinking of using candidate testing then read our blog post on whether to buy one or build your own.

Share

The ThriveMap Newsroom

Subscribe for insights, debunks and what amounts to a free, up-to-date recruitment toolkit.

About ThriveMap

ThriveMap creates customised assessments for high volume roles, which take candidates through an online “day in the life” experience of work in your company. Our assessments have been proven to reduce staff turnover, reduce time to hire, and improve quality of hire.

Not sure what type of assessment is right for your business? Read our guide.

Other articles you might be interested in

Banner image for this post

The hidden “tests” candidates face — and why entry-level hiring needs proper processes

A guide to fair, consistent, and predictive selection methods for entry-level hiring. Every week another Reddit post goes viral, exposing the strange, improvised “tests” candidates face. And this week, one Reddit post summed up the whole problem in a single screenshot. A hiring manager proudly described their “punctuality test”: They join a Zoom call 15 […]

Continue reading

Banner image for this post

The 4 biggest mistakes recruiters make when using personality tests for hiring

Personality tests can be useful tools in the right context. They offer structure, they give teams a shared language, and they can help people understand how they prefer to work. But when recruiters start using personality tests for hiring — especially as part of high-volume or frontline recruitment — things often go wrong. Misinterpretation, overconfidence […]

Continue reading

Banner image for this post

Are Personality Tests Valid for Pre-Employment?

Personality testing has been part of the HR toolkit for decades. It feels structured, it feels objective, and it gives the impression of insight into how people think and behave.But the moment you use a personality test for pre-employment screening, the expectations change. The assessment must be predictive, job-relevant, and defensible — not just interesting. […]

Continue reading

View all articles