Skip to Content


Virginia Journal of Education

One Piece of the Puzzle

Why standardized test results should just be one factor in a comprehensive system of teacher evaluation.

Some education policymakers have arrived at the conclusion that the best way to evaluate, pay and dismiss teachers is to carefully examine the standardized test scores of their students and base personnel decisions on those numbers. In a briefing paper entitled “Problems With the Use of Student Test Scores to Evaluate Teachers,” the Economic Policy Institute ( takes issue with that thinking, maintaining that test scores are just one factor in what should be a comprehensive evaluation. Here, excerpted from that paper, are some of EPI’s reasons for that belief:

While there are many reasons for concern about the current system of teacher evaluation, there are also reasons to be skeptical of claims that measuring teachers’ effectiveness by student test scores will lead to the desired outcomes. To be sure, if new laws or district policies specifically require that teachers be fired if their students’ test scores do not rise by a certain amount or reach a certain threshold, then more teachers might well be terminated than is now the case. But there is no current evidence to indicate either that the departing teachers would actually be the weakest teachers, or that the departing teachers would be replaced by more effective ones. Nor is there empirical verification for the claim that teachers will improve student learning if teachers are evaluated based on test score gains or are monetarily rewarded for raising scores.

The limited existing indirect evidence on this point, which emerges from the country’s experience with the No Child Left Behind (NCLB) law, does not provide a very promising picture of the power of test-based accountability to improve student learning. NCLB has used student test scores to evaluate schools, with clear negative sanctions for schools (and, sometimes, their teachers) whose students fail to meet expected performance standards. We can judge the success (or failure) of this policy by examining results on the National Assessment of Educational Progress (NAEP), a federally administered test with low stakes, given to a small (but statistically representative) sample of students in each state.

The NCLB approach of test-based accountability promised to close achievement gaps, particularly for minority students. Yet although there has been some improvement in NAEP scores for African-Americans since the implementation of NCLB, the rate of improvement was not much better in the post- than in the pre-NCLB period, and in half the available cases, it was worse. Scores rose at a much more rapid rate before NCLB in fourth grade math and in eighth grade reading, and rose faster after NCLB in fourth grade reading and slightly faster in eighth grade math. Furthermore, in fourth and eighth grade reading and math, white students’ annual achievement gains were lower after NCLB than before, in some cases considerably lower.

A recent careful econometric study of the causal effects of NCLB concluded that during the NCLB years, there were noticeable gains for students overall in fourth grade math achievement, smaller gains in eighth grade math achievement, but no gains at all in fourth or eighth grade reading achievement. The study did not compare pre- and post-NCLB gains. The study concludes, “The lack of any effect in reading, and the fact that the policy appears to have generated only modestly larger impacts among disadvantaged subgroups in math (and thus only made minimal headway in closing achievement gaps), suggests that, to date, the impact of NCLB has fallen short of its extraordinarily ambitious, eponymous goals.”

Such findings provide little support for the view that test-based incentives for schools or individual teachers are likely to improve achievement, or for the expectation that such incentives for individual teachers will suffice to produce gains in student learning. Research and experience indicate that approaches to teacher evaluation that rely heavily on test scores can lead to narrowing and over-simplifying the curriculum, and to misidentifying both successful and unsuccessful teachers. These and other problems can undermine teacher morale, as well as provide disincentives for teachers to take on the neediest students. When attached to individual merit pay plans, such approaches may also create disincentives for teacher collaboration. These negative effects can result both from the statistical and practical difficulties of evaluating teachers by their students’ test scores.

A second reason to be wary of evaluating teachers by their students’ test scores is that so much of the promotion of such approaches is based on a faulty analogy—the notion that this is how the private sector evaluates professional employees. In truth, although payment for professional employees in the private sector is sometimes related to various aspects of their performance, the measurement of this performance almost never depends on narrow quantitative measures analogous to test scores in education. Rather, private-sector managers almost always evaluate their professional and lower-management employees based on qualitative reviews by supervisors; quantitative indicators are used sparingly and in tandem with other evidence. Management experts warn against significant use of quantitative measures for making salary or bonus decisions. The national economic catastrophe that resulted from tying Wall Street employees’ compensation to short-term gains rather than to longer-term (but more difficult to measure) goals is a particularly stark example of a system design to be avoided.

Other human service sectors, public and private, have also experimented with rewarding professional employees by simple measures of performance, with comparably unfortunate results. In both the United States and Great Britain, governments have attempted to rank cardiac surgeons by their patients’ survival rates, only to find that they had created incentives for surgeons to turn away the sickest patients. When the U.S. Department of Labor rewarded local employment offices for their success in finding jobs for displaced workers, counselors shifted their efforts from training programs leading to good jobs to more easily found unskilled jobs that might not endure, but that would inflate the counselors’ success data. The counselors also began to concentrate on those unemployed workers who were most able to find jobs on their own, diminishing their attention to those whom the employment programs were primarily designed to help.

A third reason for skepticism is that in practice, and especially in the current tight fiscal environment, performance rewards are likely to come mostly from the redistribution of already-appropriated teacher compensation funds, and thus are not likely to be accompanied by a significant increase in average teacher salaries (unless public funds are supplemented by substantial new money from foundations, as is currently the situation in Washington, D.C.). If performance rewards do not raise average teacher salaries, the potential for them to improve the average effectiveness of recruited teachers is limited and will result only if the more talented of prospective teachers are more likely than the less talented to accept the risks that come with an uncertain salary. Once again, there is no evidence on this point.

And finally, it is important for the public to recognize that the standardized tests now in use are not perfect, and do not provide unerring measurements of student achievement. Not only are they subject to errors of various kinds, but they are narrow measures of what students know and can do, relying largely on multiple-choice items that do not evaluate students’ communication skills, depth of knowledge and understanding, or critical thinking and performance abilities. These tests are unlike the more challenging open-ended examinations used in high-achieving nations in the world. Indeed, U.S. scores on international exams that assess more complex skills dropped from 2000 to 2006, even while state and local test scores were climbing, driven upward by the pressures of test-based accountability.

This seemingly paradoxical situation can occur because drilling students on narrow tests does not necessarily translate into broader skills that students will use outside of test-taking situations. Furthermore, educators can be incentivized by high-stakes testing to inflate test results. At the extreme, numerous cheating scandals have now raised questions about the validity of high-stakes student test scores. Without going that far, the now widespread practice of giving students intense preparation for state tests—often to the neglect of knowledge and skills that are important aspects of the curriculum but beyond what tests cover—has in many cases invalidated the tests as accurate measures of the broader domain of knowledge that the tests are supposed to measure. We see this phenomenon reflected in the continuing need for remedial courses in universities for high school graduates who scored well on standardized tests, yet still cannot read, write or calculate well enough for first-year college courses. As policymakers attach more incentives and sanctions to the tests, scores are more likely to increase without actually improving students’ broader knowledge and understanding.

The Economic Policy Institute is a nonprofit, Washington, D.C.-based think tank that was established in 1986 to broaden the discussion about economic policy to include the interests of low- and middle-income workers. Its paper, “Problems With the Use of Student Test Scores to Evaluate Teachers,” was written by a team of leading education experts, including Linda Darling-Hammond, Diane Ravitch and Richard Rothstein. Excerpts here are used with the permission of EPI, and the entire paper can be found at their website at



Virginia Capital

Become a Cyberlobbyist
Sign up now!

Check out our products!


Embed This Page (x)

Select and copy this code to your clipboard