Validity/Reliability R Statistics: Criteria for judging a good measurement tool

A study with high validity actually measures exactly what it was intended to do, but high reliability is necessary to maintain high validity and provide stable results even in repeated situations.

Original Korean article: Validity/Reliability R Statistics: Criteria for judging a good measurement tool

The concepts of validity and reliability R statistics are key criteria for judging a good measurement tool. High reliability does not always mean high validity, and it must be checked whether the measurement is appropriate for the purpose of the study and whether repeated measurements produce consistent results. This article explains the differences between the two concepts and the criteria to check in actual research.

Ⅰ. feasibility

Validity refers to how accurately a measurement tool or method in research actually measures what it is intended to measure.

Content Validity: Concept: Content validity evaluates whether a measurement tool contains all important content for the research topic or purpose. Example: For example, if there is a test that evaluates students’ math skills, the process of evaluating content validity is to check whether the test includes only addition and subtraction problems or whether it includes all various mathematical concepts such as multiplication, division, and geometry.
Criterion-related Validity: Concept: Criterion-related validity evaluates the validity of a measurement tool through correlation with a specific criterion (or external measure). Types and examples: Concurrent Validity: Evaluation compared to standards at the current time. For example, if a new depression test shows a high correlation with an existing, validated depression test, it can be said to have high concurrent validity. Predictive Validity: Evaluation compared to future standards. For example, if college entrance exam scores are a good predictor of job achievement after graduation, the test has high predictive validity.
Construct Validity: Concept: Construct validity evaluates whether a measurement tool actually reflects the theoretical construct well. Example: The process of reviewing structural validity is to check whether the questionnaire intended to measure ‘self-esteem’ is composed of questions that actually reflect self-esteem. For this purpose, various statistical analysis techniques (e.g. factor analysis) can be used.
Ecological Validity: Concept: Ecological validity means whether research results can be equally applied in the real world. Example: If the results of a memory test performed in a laboratory environment show the same memory pattern in everyday life, it can be said to have high ecological validity.

Concept: Content validity evaluates whether a measurement tool contains all important content for the research topic or purpose.
Example: For example, if there is a test that evaluates students’ math skills, the process of evaluating content validity is to check whether the test includes only addition and subtraction problems or whether it includes all various mathematical concepts such as multiplication, division, and geometry.

Concept: Criterion-related validity evaluates the validity of a measurement tool through its correlation with a specific criterion (or external measure).
Types and examples: Concurrent Validity: Evaluation compared to standards at the current time. For example, if a new depression test shows a high correlation with an existing, validated depression test, it can be said to have high concurrent validity. Predictive Validity: Evaluation compared to future standards. For example, if college entrance exam scores are a good predictor of job achievement after graduation, the test has high predictive validity.

Concurrent Validity: Evaluation compared to standards at the current time. For example, if a new depression test shows a high correlation with an existing, validated depression test, it can be said to have high concurrent validity.
Predictive Validity: Evaluation compared to future standards. For example, if college entrance exam scores are a good predictor of job achievement after graduation, the test has high predictive validity.

Concept: Structural validity evaluates whether a measurement tool actually reflects the theoretical construct.
Example: The process of reviewing structural validity is to check whether the questionnaire intended to measure ‘self-esteem’ is composed of questions that actually reflect self-esteem. For this purpose, various statistical analysis techniques (e.g. factor analysis) can be used.

Concept: Ecological validity refers to whether research results can be equally applied in the real world.
Example: If the results of a memory test performed in a laboratory environment show the same memory pattern in everyday life, it can be said to have high ecological validity.

feasibility

Ⅱ. reliability

Reliability refers to whether a measurement tool or method in research consistently produces results. In other words, the degree to which similar results are obtained when measured repeatedly under the same conditions is evaluated.

Internal Consistency: Concept: Internal consistency evaluates how well the items in a measurement tool reflect the same concept. Example: If a questionnaire consists of 10 questions, and all of these questions measure ‘self-esteem,’ internal consistency can be said to be high only when the correlation between each question is high. To evaluate this, Cronbach’s α coefficient is often used.
Test-Retest Reliability: Concept: Retest reliability evaluates how consistent the results are when the same measurement tool is repeatedly applied to the same subject at regular time intervals. Example: When a psychological test is administered to the same person twice, two months apart, if the scores on both tests are similar, the test’s test-retest reliability can be said to be high.
Parallel-Forms Reliability: Concept: Parallel-Forms Reliability evaluates the consistency between two different forms of measurement tools designed to measure the same concept. Example: When there is a type A test paper and a type B test paper that evaluates mathematical ability, if the scores obtained when evaluating the same students with the two test papers are similar, the reliability of the alternative form can be said to be high.
Inter-Rater Reliability: Concept: Inter-rater reliability refers to how consistent the results are when different evaluators independently evaluate the same object. Example: When several psychologists watch a recording of a counseling session for the same patient and each rate the level of depression, if their ratings are similar, inter-rater reliability can be said to be high.
Split-Half Reliability: Concept: Split-Half Reliability is a method of evaluating the consistency of the entire test by dividing the data obtained from one test into half and finding a correlation between the scores of each half. Example: In a cognitive ability test consisting of 20 questions, if there is a high correlation between the scores of each part of the first 10 questions and the last 10 questions, the reliability of the split response can be said to be high.

Concept: Internal consistency evaluates how well the items in a measurement tool reflect the same concept.
Example: If a questionnaire consists of 10 questions, and all of these questions measure ‘self-esteem,’ internal consistency can be said to be high only when the correlation between each question is high. To evaluate this, Cronbach’s α coefficient is often used.

Concept: Test-retest reliability evaluates how consistent the results are when the same measurement tool is repeatedly applied to the same subject at certain time intervals.
Example: When a psychological test is administered to the same person twice, two months apart, if the scores on both tests are similar, the test’s test-retest reliability can be said to be high.

Concept: Alternative reliability assesses the consistency between two different types of measurement instruments designed to measure the same concept.
Example: When there is a type A test paper and a type B test paper that evaluates mathematical ability, if the scores obtained when evaluating the same students with the two test papers are similar, the reliability of the alternative form can be said to be high.

Concept: Inter-rater reliability refers to how consistent the results are when different evaluators independently evaluate the same object.
Example: When several psychologists watch a recording of a counseling session for the same patient and each rate the level of depression, if their ratings are similar, inter-rater reliability can be said to be high.

Concept: Split response reliability is a method of evaluating the consistency of the entire test by dividing the data obtained from one test into half and finding a correlation between the scores of each half.
Example: In a cognitive ability test consisting of 20 questions, if there is a high correlation between the scores of each part of the first 10 questions and the last 10 questions, the reliability of the split response can be said to be high.

Good article to read together

1. What is research? [R Statistics]
2. Variables and Measurements [R Statistics]
3. Measurement error [R statistics]
5. Research method [R statistics]
Importance and usage of pipe operator %>%

Key Checklist

Is the measurement tool appropriate for the research purpose?
Do repeated measurements produce similar results?
Isn’t this a situation where reliability is high but validity is low?
Has the validity been confirmed through existing research or expert review?

Good R statistics articles to read together

What is research: Summary of research concepts for introduction to R statistics
Variables and Measurement R Statistics: Understanding independent variables, dependent variables and measurement levels
Measurement Error R Statistics: Easily Understand Random Error and Systematic Error
Research Method Introduction to R Statistics: Understanding research design and analysis methods at a glance

FAQ

What is this article about?

This article is part of Thinknote’s English R statistics and data-analysis archive. It explains research, measurement, text processing, or tidyverse-style workflow concepts in practical language.

How should I use this guide?

Use it as a learning note and starter reference. When applying code, adjust package versions, object names, and dataset structure to your own R environment.

Where can I read the original Korean article?

The original Korean article is available here: Original Korean article.

Thinknote

근로자햇살론 신청 조건 총정리: 대상·한도·금리·보증료 7가지 체크

착한 리더가 팀을 망치는 순간: 오탁민 작가가 말한 팀 장악의 기술

Python Beginner 20-Lesson Course Guide: Learn from Installation to a Mini Project