John Thompson: The Utter Failure of Standardized Teacher Evaluation

John Thompson, teacher and historian, writes here about one of the most controversial education issues of our time: mandated systems of test-based teacher evaluation. This was a central aspect of Race to the Top, and it was hated by large numbers of teachers.
Thompson writes:

“The obituaries for the idea that value-added teacher evaluations can improve teaching and learning are pouring in. The most important of those studies, probably, are those that are conducted by well-known proponents of data-driven accountability for individuals.

“Before summarizing the meager, possible benefits and the huge potential downsides of value-added evaluations, let’s recall that these incredibly expensive systems were promoted as a way to improve student outcomes by .50 standard deviations (sd) by removing the bottom-ranked teachers! In Washington D.C., for instance, a $65 million grant which kicked off the controversial IMPACT system was supposed to raise test scores by 10% per year! Of course, that raises the question of why pro-IMPACT scholars don’t mention its $140 million budget for just the first five years.

As reported by Education Week’s Holly Yettick, a study funded by the Gates Foundation and authored by Morgan Polikoff and Andrew Porter “found no association between value-added results and other widely accepted measures of teaching quality.” Polikoff and Porter applied the Gates Measures of Teaching Quality (MET) methodology to a sample of students in the Gates experiment, and found, “Nor did the study find associations between ‘multiple measure’ ratings, which combine value-added measures with observations and other factors.”

“Polikoff, a vocal advocate for corporate reform, acknowledged, “the study’s findings could represent something like the worst-case scenario for the correlation between value-added scores and other measures of instructional quality. … ‘In some places, [value-added measures] and observational scores will be correlated, and in some places they won’t.’”

“Before moving on to another study by pro-VAM scholars which calls such a system into question, we should note other studies reviewed by Yettick that help explain why the value-added evaluation experiment was so wrong-headed. Yettick cites two studies in the American Educational Research Journal. First, Noelle A. Paufler and Audrey Amrein-Beardsley which concludes, “elementary school students are not randomly distributed into classrooms. That finding is significant because random distribution of students is a technical assumption underlying some value-added models.” In the second AERJ article, Douglas Harris concludes, “Overall, however, the principals’ ratings and the value-added ratings were only weakly correlated.”

“Moreover, Yettick reports that “Brigham Young University researchers, led by assistant professor Scott Condie, drew on reading and math scores from more than 1.3 million students who were 4th and 5th graders in North Carolina schools between 1998 and 2004” and they “found that between 15 percent and 25 percent of teachers were misranked by typical value-added assessments.”

“Finally Marianne P. Bitler and her colleagues made a hilarious presentation to The Society for Research on Educational Effectiveness that “teachers’ one-year ‘effects’ on student height were nearly as large as their effects on reading and math. While they found that the reading and math results were more consistent from one year to the next than the height outcomes, they advised caution on using value-added measures to quantify teachers’ impact.”

“Bitler’s study should produce belly laughs as she makes the point, “Taken together, our results provide a cautionary tale for the interpretation and use of teacher VAM estimates in practice.” Watching other advocates for test-driven accountability twisting themselves into pretzels in order to avoid confronting the facts about Washington D.C.’s IMPACT should at least prompt grins.

“Getting back to the way that pro-VAM researchers are now documenting its flaws, Melinda Adnot, Thomas Dee, Veronica Katz, and James Wyckoff spin their NBER paper as if it doesn’t argue against D.C.’s IMPACT evaluation system. Despite the prepublication public relations effort to soften the blow, their “Teacher Turnover, Teacher Quality, and Student Achievement” admits that the benefits of the teacher turnover incentivized by IMPACT are less than “significant.”

“The key results are revealed on page 18 and afterwards. Adnot et.al conclude, “We find that the overall effect of teacher turnover in DCPS conservatively had no effect on achievement.” But they add that “under reasonable assumptions,” it might have increased achievement. (As will be addressed later, I doubt many teachers would accept the assumptions that have to be made in order to claim that IMPACT improved student achievement as reasonable.)

“The paper’s abstract and opening (most read) pages twist the findings before admitting “To be clear, this paper should not be viewed as an evaluation of IMPACT.” It then characterizes the study as making “an important contribution by examining the effects of teacher turnover under a unique policy regime.”

“In fact, the paper notes, “IMPACT targets the exit of low-performing teachers,” and “virtually all lowperforming teacher turnover [prompted by it] is concentrated in high-poverty schools.” That, of course, suggests that an exited teacher with a low value-added might actually be ineffective, or that the teacher was punished for a value-added that might be an inaccurate estimate caused by circumstances beyond his or her control.

“Their estimates show that exiting those low value-added teachers improves student achievement in high-poverty schools by .20 sd in math, and that the resulting exit of 46% of low-performing teachers “creates substantial opportunity to improve achievement in the classrooms of low-performing teachers.” The bottom line, however, is: “We estimate that the overall effect of turnover on student achievement in high-poverty schools is 0.084 and 0.052 in reading.” Both estimates may be “statistically distinguishable from zero” but they would only be “significant at the 10 percent level.”

“So, why were the total gains so negligible?

“The NBER study concludes that IMPACT contributed to the increase in the attrition rate of Highly Effective teachers to 14%. It admits that some high-performing teachers find IMPACT to be “demotivating or stressful” and that the loss of top teachers hurts student performance. It acknowledges, “This negative effect reflects the difficulty of replacing a high-performing teacher.”

“The study doesn’t address the biggest elephant in the room – the effect of value-added evaluations on instructional effectiveness on the vast majority of D.C teachers. If high-performing teachers leave because of the “stress and uncertainty of these working conditions,” wouldn’t other teachers be “dissatisfied with IMPACT and the human capital strategies in DCPS writ large?” If the attrition rate of the top teachers in higher-poverty schools increases to 40% more than their counterparts in lower-poverty schools, does that indicate that the harm done by the evaluations is also greater in high-challenge schools? And, the NBER paper finds that “teachers exiting at the end of our study window were noticeably more effective than those exiting after IMPACT’s first year.” Shouldn’t that prompt an investigation as to whether the stress of IMPACT is wearing teachers down?

“Adnot, Dee, Katz, and Wyckoff thus continue the tradition of reformers showcasing small gains linked to value-added evaluations and IMPACT-style systems, but brushing aside the harm. On the other hand, they admit that IMPACT had advantages that similar regimes don’t have in many other districts. D.C. had the money to recruit outsiders, and 55% of replacement teachers came from outside of the district. Few other districts have the ability to dispose of teachers as if we are tissue paper.

“Even with all of those advantages provided by corporate reformers in D.C. and other districts with the Gates-funded “teacher quality” roll of the dice, an incredible amount of stress has been dumped on educators as they and students became lab rats in an expensive and risky experiment. The reformers’ most unreasonable assumption was that these evaluations would not promote teach-to-the-test instructional malpractice. They further assume that the imposition of a accountability system that is biased against high-challenged schools will not drive too much teaching talent out of the inner city. They never seem to ask whether they would tackle the additional challenges of teaching in a low-performing school when there is a 15 to 25% chance PER YEAR of being misevaluated.

“Now that these hurried, top-down mandates are being retrospectively studied, even pro-VAM scholars have found minimal or no benefits, offset by some obvious downsides. I wonder if they will try to tackle the real research question, try to evaluate IMPACT and similar regimes, and thus address the biggest danger they pose. In an effort to exit the bottom 5% or so of teachers, did the test and punish crowd undermine the effectiveness of the vast majority of educators?”

Steve Cohen says:

March 16, 2016 at 12:26 pm

When does the time arrive at which we can say, VAM is stupid, just plain stupid? Lipstick on a pig. Snake oil (and snake oil salesmen and women). Scientistic nonsense?

So why does serious discussion of such nonsense persist? Because of the money and the advantage it provides the children of rich reformers, whose time and childhood are not squandered because they attend private schools that would never waste a penny or a school moment on such…stupidity.

Chiara says:

March 16, 2016 at 12:56 pm

This seems like good news:

“In an acknowledgement that Georgia went too far with a 2013 law that mandated the heavy use of tests in teacher job evaluations, the state House of Representatives voted unanimously Tuesday to roll back the use of tests.

The House vote follows a unanimous Senate vote last month. Though the bill was tweaked and must return to the Senate for a final signoff, its sponsor, Sen. Lindsey Tippins, R-Marietta, says he sees no reason why it won’t pass.”

They have “test fatigue” In Georgia 🙂

I shudder to think how much they spent on it. Couldn’t they have evaluated it small-scale first?

http://getschooled.blog.myajc.com/2016/03/15/georgia-lawmakers-suffer-from-test-fatigue/

retired teacher says:

March 16, 2016 at 1:09 pm

Since Gates was unable to prove that teachers are the most significant force in producing improved scores, he is moving on the CBE. The advantage to Gates is that CBE is free of the influence of teachers. Life is much easier when “Robbie the Robot” is in charge, and it has great potential to bring lots of cash for Gates and company. Once again, the tainted politicians are allowing Gates to sell his unvetted wares, and they even wrote CBE into ESSA without a shred of evidence! Our children are guinea pigs for billionaires!http://www.schoolsmatter.info/2016/03/get-evidence-before-you-pass-law-not.html

Laura H. Chapman says:

March 16, 2016 at 1:10 pm

“Polikoff and Porter applied the Gates Measures of Teaching Quality (MET) methodology to a sample of students in the Gates experiment, and found, “Nor did the study find associations between ‘multiple measure’ ratings, which combine value-added measures with observations and other factors.”

The MET methodology was a farce, and the leader of that work, economist Thomas Kane and his colleagues, ought to be called out for malpractice, along with Gates, who not only paid about $68 million for that project but has recently given grants to doctoral students who will recycle and analyze and keep massaging the deeply flawed data from that project. I read one of those pathetic studies recently in Educational Researcher.

Flawed is an understatement. Charlotte Danielson’s framework what used in the MET project, but not for observations in real classrooms. Never mind. Just build out a lot of claims about observations and VAM based on raters who observed videos of teaching, made and chosen by teachers who got to keep the equipment as a perk. The number of “trainings” and observations required to get decent “reliability” from the raters were ridiculous, not possible in real schools.

I have read enough of the publicity and research from economist Thomas Kane–, including the $64 million plus Gates-funded Measures of Effective Teaching (MET) project–to know that he and others who contributed to that project are key players in marketing VAM. Their reputations as experts in education are built on statistics, fancy number crunching, and inferential leaps through thin air.KAne’s testimony in Congress spurred the “multiple measures” thing in teacher evaluation with totally illogical schemes for weighting these and producing a singe score/judgment.

In fact, results from VAM have been so aggrandized and become so ubiquitous that other measures are said to be valid if they have the slightest correlation with VAM. For example, economist Ron Ferguson’s student survey of teachers developed for the MET project (now branded the Tripod survey) gained “credibility” in this way. Danielson markets here framework, proudly noting it’s use in the project

None of these high profile sources of data have predictive validity for full spectrum learning—learning in every subject and every grade and for every group of students now gathered into accountability schemes. Yet all are marketed as if some sort of magic for promoting better education can be found in numbers.

The scores of students on statewide tests are essential for calculating VAM. These annual tests are still foisted on students and teachers by federal and state regulations. There is not an ounce of regulation in the testing industry. Test makers offer no guarantees of the integrity or validity of the data produced by the tests.

As far as I can determine the one of the first uses of VAM dates back to the dissertation of economist Eric Hanushek, 1970s, who is even more notorious for pushing VAM than Thomas Kane.. Hanushek is one of the most widely cited researchers in the charter industry. He is one of the three economists who have invented the total fiction that students gain or lose X many days of learning– based on standard deviations in the “growth” measures calculated by VAM.

It is 2016. VAM and the MET project live on and continue to do damage. For another ad much earlier de-bunking of the MET project see

Rothstein, J. & Mathis, W. J. (2013). Have we identified
effective teachers? Culminating findings from
the Measures of Effective Teaching project. (Review).
Boulder, CO: National Education Policy
Center. Retrieved from
http://nepc.colorado.edu/thinktank/review-METfinal-

retired teacher says:

March 16, 2016 at 1:35 pm

Laura, Thanks for your insights and the way you connect the dots. You prove that when a lie is repeated enough, it becomes accepted as the truth, especially when policymakers are “blinded by faux science.”

SomeDAM Poet says:

March 17, 2016 at 11:53 am

“VAM is ~~flawed~~ fraud”

VAM is fraud
Not simply flawed
Those who know
Just won’t let go

RageAgainstTheTestocarcy says:

March 16, 2016 at 8:27 pm

One of the many utter FAILURES:

NCLB: FAIL
AYP: FAIL
RTTT: FAIL
CCSS: FAIL
PARCC: FAIL
SBAC: FAIL
VAM: FAIL
EdTPA: FAIL
TFA: FAIL
CBE: FAIL
USDOE: FAIL
BROWN: FAIL
KING: FAIL
DUNCAN: FAIL
COLEMAN: FAIL
CUOMO: FAIL
RHEE: FAIL
GATES: FAIL

A track record of complete and utter FAILURE.
Why do they have even a shred of credibility anywhere with anyone?

RageAgainstTheTestocracy says:

March 16, 2016 at 8:46 pm

One more.
SLOs: FAIL

The reform crowd cannot point to one single initiative that worked to improve teaching or learning, They can’t even improve their own test scores without manipulating the data.

Is there any veteran teacher (20+ years) who has noticed an overall improvement in math skills? reading skills? writing skills? listening skills? speaking skills? HOTS? problem solving skills? creativity? critical thinking skills?

Can test-based (NCLB/CSSS) reform even pass the smell tests?

Bryce Carpenter says:

March 17, 2016 at 8:30 am

You may be interested to know that this blog entry is blocked by the DCPS netadmin.

DCPS is rolling out a rebranded IMPACT system next year, called LEAP. It is not based on research.

This post is an excellent, critical response to the NBER paper. Thank you.

There is an inescapable truth about student performance in DCPS. It is a national embarrassment. The initial promise of IMPACT remains unrealized; the persistence of IMPACT is testament to something about the culture of the district in general.

Perhaps this blog post, and other responses to the NBER paper, will inspire deeper subsequent study.

(I am a former education professor who returned to K-12 and is teaching in DCPS. I am “highly effective” and have elected to leave at the end of the year.)

dianeravitch says:

March 17, 2016 at 2:56 pm

Bryce, I am not surprised that the DC schools are blocking criticism of their big reform policy. It is junk science. Bogus.

Judith Yero says:

March 17, 2016 at 6:31 pm

This should read, The Utter Failure of Standardized Testing (period!) There is no “norm” in human beings, so how can there be a norm-referenced test that has any meaning at all? There is no average when you look at more than a single factor, so how can a test label learners as above and below average? Why aren’t people questioning the tests themselves rather than even arguing about their use to evaluate teachers?

Sad Teacher says:

March 18, 2016 at 4:21 pm

The saddest thing is that this silly value-added along with Marzano’s rubric on the teacher evaluation system has made so many outstanding veteran teachers feel worthless and useless. The reality is that no matter how hard we try, teachers cannot make a student learn when the student does not want to learn.

The dentist should not lose his job when the same patient comes back with constant toothaches, just like a teacher should not lose his/her job when Billy’s scores come back low. Well, let’s see…..Billy stays up playing video games till midnight every night. Billy skips breakfast every morning. Billy never completes any of his homework. Oh, yes, Billy has also missed 24 days of school. Billy refuses to write anything down as you teach. Billy does not care about his education. But, the poor teacher is desperate for Billy to do well. Billy will never do well until Billy and his family make education a priority in their home.

Stephany says:

April 2, 2016 at 9:29 pm

I believe educators need less emphasis on standardized testing in schools. Even students in young grades are subject to testing, which is not always developmentally appropriate for every child. Student test scores should not be used to determine the effectiveness of a teacher. By connecting tests to teacher evaluation, teachers are feeling less joy in their profession. Testing, in general, is taking the joy out of teaching for teachers and learning for students. I believe this is making many teachers leave the profession, and those that stay feel added stress. Many teachers start to solely teach to the test and unplanned teachable moments and mini-lessons diminish. In my personal experience, I’ve known teachers that are spending more time teaching the standards that will be on a particular test because they want the students to do well on the assessments which effect their evaluation. This practice is not fair to the students or the teachers. Nor does it give the teacher the overall knowledge of what their students truly know. What does it matter if students memorize how to do well on the test so a teacher can look good for an evaluation, if they can’t apply the skills on the assessment in their own lives? This current format will leave many students behind as adults. I believe that there should be some district created standardized testing given. This will hold teachers and students accountable for what is expected of them to teach and to learn during the year. These assessments should not be connected to teacher’s evaluations so the true purpose of them can be implemented fully. These tests should cover all of the standards in the curriculum. There will be no worry about teachers teaching to the test because all standards should be covered during the semester/year and student test scores would not negatively affect the teacher. With this change, students can be celebrated for what they know and teacher can be celebrated for instruction.

April 2, 2016 at 9:51 pm

As Diane posted about a week ago, there is no reason at all to even waste our time arguing about standardized tests. They are totally invalid because 1.) learning (other than memorization) can’t be measured; and 2.) standardized tests based on the Bell Curve are invalid when applied to human beings. (Please everyone…read Todd Rose’s recent book The End of Average for the sordid tale of where the whole concept of average and standard began.)

John Thompson: The Utter Failure of Standardized Teacher Evaluation

14 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats

John Thompson: The Utter Failure of Standardized Teacher Evaluation

Share this:

14 Comments Post your own or leave a trackback: Trackback URL

Leave a comment Cancel reply

Search All Posts

Previous posts

Recent posts

Blog Topics

Top posts

Follow blog via email

Follow blog via RSS reader

Blog Stats