Assessment Reliability and Validity (Video)

On this page

Transcript

If you’ve participated in professional development within an educational setting recently, you’ve likely heard the term data-driven instruction.

Data-driven instruction is common in education today. In this approach, student performance data is gathered frequently through a variety of assessments and used to guide instructional programs and practices.

In order for these assessments to provide useful data for drawing conclusions, they must be both reliable and valid.

In this video, we’ll define the term assessment and describe some common examples. We’ll also describe assessment reliability and validity and explain why both are important.

Types of Assessments

First, let’s define the term assessment and explore some forms that assessments can take.

An assessment is a way of collecting data about student learning. Assessment results are used for a variety of purposes, often to guide instructional practices to improve student performance or evaluate instructional programs.

Standardized tests often come to mind when the term assessment is used, as they are commonly required by states to assess students’ performance and growth at specific intervals.

However, assessments can also include unit, chapter, and lesson tests and quizzes; projects; writing pieces; presentations; independent practice questions; exit tickets; and more.

Assessments may include selected-response questions, where students are given answer options to choose from. Examples of these types of questions include multiple-choice, true or false, and matching questions.

Assessments may also include constructed-response questions, where students are given a prompt, and they have to construct their own responses. Essay questions are examples of constructed-response questions.

Both of these question types have their own considerations to ensure reliability and validity.

Reliability

Reliability refers to the consistency of assessment results and includes the following considerations.

First, an assessment should have consistent results over time when taken under the same conditions. If a student completes the same assessment on two different days under similar conditions, the results should be about the same.
Next, multiple versions of the same assessment should produce consistent results. For example, some tests contain question banks, where multiple questions are created to assess the same knowledge or skill. Test takers may receive different questions depending on which ones are randomly selected. Other tests have multiple versions, such as version A and version B, with different versions given to different students or at different times.

If a student takes version A of a test one day, and version B of the same test on another day under similar conditions, the results should be about the same.

Additionally, if assessment items are manually graded, different raters should assign similar scores to the same student response. For example, if an assessment contains an essay question scored with a rubric, different raters should give the same student the same score. Providing clearly articulated rubric criteria for each score point and providing scorer training with annotated sample responses at each score point assists with reliability.

It is important to note that taking the same assessment under different conditions can affect results for reasons other than reliability issues. For example, if students are hungry, not feeling well, or in environments that are too hot or too cold, they may score lower on the same assessment than they did previously.

Validity

Validity refers to how well an assessment measures what it is supposed to measure. There are multiple considerations regarding assessment validity. Let’s take a look at a few now.

Content validity refers to whether or not the assessment items adequately represent the areas the assessment is designed to measure. For example, if an assessment is designed to assess objectives from a yearlong sixth-grade math curriculum, then questions from each of the units in the course should be adequately included. If the assessment includes questions from units one and two only when there are ten units in the course, it would not be a valid assessment of the whole course.

Additionally, the questions or prompts on the assessment should be aligned with the objectives the assessment is designed to measure. An assessment designed to measure students’ ability to add and subtract fractions with unlike denominators should contain questions that require students to demonstrate these skills. Unrelated questions should not be included.

Care should also be given when writing assessment questions to ensure that they measure what they are designed to measure without assessing something else, even inadvertently. This is known as construct validity. For example, a math problem designed to assess a third grader’s ability to find the perimeter of a rectangle should not be written at an eighth-grade reading level, potentially causing a student to miss the problem due to difficulties with comprehending the question rather than an inability to find the perimeter.

Predictive validity relates to how well assessment results predict success on a related, future criterion. For example, a passing score on an end-of-year assessment in Algebra 1 may be used to predict success in Algebra 2 the following semester.

Why It Matters

Assessment data is commonly used to guide student instruction. For example, a third-grade teacher may identify a small group of students that need targeted instruction on measuring liquid volumes based on their performance in this skill on a recent assessment.

Assessment data can also be used to evaluate instructional programs. For example, if large numbers of students perform poorly on an assessment after an instructional program is implemented with integrity by highly trained teachers, a district may determine that the program is not effective in meeting instructional goals. They may then select a replacement program.

Standardized tests also have particularly high stakes, affecting ratings and funding for schools, decisions about when interventions are needed, and in some cases determining whether or not students are promoted or eligible for graduation.

These reasons highlight why it is important for assessments to be both reliable and valid. Assessments must provide useful data in order to guide instructional practices and decisions.

Review

Let’s review what we learned in this video.

An assessment is a way of collecting data about student learning.
Assessment results are used for a variety of purposes, often to guide instructional practices to improve student performance or evaluate instructional programs.
Reliability refers to the consistency of assessment results. An assessment should have consistent results over time when taken under the same conditions, and multiple versions of the same assessment should produce consistent results.
Additionally, if assessment items are manually graded, different raters should assign similar scores to the same student responses.
Validity refers to how well an assessment measures what it is supposed to measure. This includes ensuring that an assessment adequately represents all of the areas it is designed to measure. It also includes ensuring that unrelated content is not included.
Assessments are used to make instructional decisions, and they can have high stakes. It is important for assessments to be both reliable and valid in order for the data they produce to be useful in making these decisions.

Questions

Let’s cover a couple of questions before we go:

1. An eighth-grade end-of-year standardized test contains one writing prompt. Two students in different classrooms produce nearly identical responses that are comparable across all criteria of the rubric. One rater assigns the first student a score of 4, the highest score possible. A second rater assigns the second student a score of 2. Is this an issue with reliability or validity? Why?

The correct answer is B.

This example demonstrates an issue with reliability. If the student responses are nearly identical and comparable across all criteria of the rubric, the two raters should have assigned the students the same score. Instead, the raters interpreted the responses very differently.

2. A fifth-grade science teacher creates a test to assess the concepts covered in unit one of the science course. The goal is to determine if students have met the objectives from unit one. She includes five questions at the end of the test that cover content from unit two in order to get a sense of what students already know about the topic. Is this an issue with reliability or validity?

The correct answer is B.

This example demonstrates an issue with validity. If the test is supposed to assess the objectives from unit one of the course, then content from unit two should not be included.

That’s all for this video. Thanks for watching, and happy studying!

Return to Pedagogy Videos