What exactly is “standardization” in assessment design?

I’m going to do my best to keep this really short and concise and write according to The Notorious RBG: ‘Get it right, keep it tight.'”

Peter Greene made a claim that the correct number of standardized tests is zero.
I presented a presented a counterclaim that standardization isn’t the problem.
Greene expanded on his claim to clarify his intentions around the tests.

While reading Peter’s updated claim, I realized that at no point was the phrase “standardized” actually defined. We both gave our opinions on what it means:
From Greene: “Standardized” when applied to a test can mean any or all (well, most) of the following: mass-produced, mass-administered, simultaneously mass-administered, objective, created by a third party, scored by a third party, reported to a third party, formative, summative, norm-referenced or criterion referenced.

From Me: Welp, first, minus ten to me because I didn’t state a definition, I asked questions that implied one. So to restate the intention of my questions. Standardized means doing the same thing for a group of students. The “thing” can be the nature of the task, the amount of time, the scoring criteria, or the directions to the students.

From Standards for Educational and Psychological Testing AKA “The Testing Standards” (This is basically the sourcebook for writing a quality measure of student learning), written by the AERA, APA, NCME, 2014

 A test is a device or procedure in which a sample of an examinee’s behavior in a specific domain is obtained and subsequently evaluated and scored by a standardized process. Tests differ on a number of dimensions… but in all cases, however, tests standardize the process by which test takers’ responses to test materials are evaluated and scored. 
 According to the alpha and omega, a test by its very nature is standardized. Which makes the phrase “standardized test” redundant, it seems.

From the Code of Fair Testing Practices, which is a supplementary document for the Testing Standards.

The Code applies broadly to testing in education regardless of the mode of presentation, so it is relevant to conventional paper-and-pencil tests, computer-based tests, and performance tests.… Although the Code is not intended to cover tests prepared by teachers for use in their own classrooms, teachers are encouraged to use the guidelines to help improve their testing practices.

From Stanford’s primer on performance-based assessments:

[Describing performance-based assessments] Teachers can get information and provide feedback to students as needed, something that traditional standardized tests cannot do.

…. in the early years of performance assessment in the United States, Vermont introduced a portfolio system in writing and mathematics that contained unique choices from each teacher’s class as well as some common pieces. Because of this variation, researchers found that teachers could not score the portfolios consistently enough to accurately compare schools. The key problem was the lack of standardization of the portfolios.

Here, the authors use standardized in two ways: first to refer to the multiple choice test we tend to picture when we hear “standardized test” and then to refer to the process of creating a uniform approach to scoring student writing samples.

From Handbook of Test Development, edited by Downing & Haladyna:

The test administration conditions – standard time limits, proctoring to ensure no irregularities, environmental conditions conducive to test taking, and so all – all seek to control extraneous variables in the experiment and make conditions uniform and identical for all examinees. Without adequate control of all relevant variables affecting test performance, it would be difficult to interpret examinee test scores uniformly and meaningfully. This is the essence of the validity issue for test administration.

Now, for the kicker. Why does any of this matter? Because of assessment literacy.  Peter and I are reading the same book but we’re not on the same page, as it were. He’s a teacher, I’m out of the classroom, working with teachers around assessment design. This isn’t an issue of “He’s right and I’m wrong” or “I’m the expert, trust me.” It’s more compelling, instead, to consider the implications – and there are many – of how we talk about testing and assessment. From teacher preparation, to academic writing, to communicating with parents and the public. I suspect, that until the profession agrees on a common glossary, we’re going to keep nibbling at the edges.


Unintended Consequences of Making Standardized Tests the Enemy

President Obama’s “2%” video has generated a number of claims, counterclaims, rants and praise. Larry Ferlazzo rounded up many of them if you’re looking for texts that support your opinion or challenge it. There’s a lot to be said about it and there’s a text written basically for every possible counterclaim or supporting claim. My general take away vacillates between “meh” as unless Congress changes ESEA, state-wide annual tests will persist and yeah! let’s talk about what healthy assessment systems look like! The challenge remains, though, around how to deal with mandates – to leverage them to support student learning, to ignore them, to advocate parents opt their children out to send a message, or door #4. Regardless, it’s a deeply personal decision every educator, school, and district must make.

That, however, is not the issue at hand or why I dusted off my semi-irregular blog. I get ranty about semantics and I own it. It comes from a place of absolute adoration for the teaching profession. Since my first ed course in college, I’ve loved the hubris that comes with labeling, describing, and attempting to capture the unseeable. It’s the place that made me comfortable as a young teacher to speak up when a teacher’s aide referred to my students with special needs as “TMR.” In her day, “Trainable Mentally Retarded” was an acceptable moniker for a certain type of student. I had the concept of “people first language” to fall back on and the rules of the framework to help me find the words to speak up and find a way through the awkwardness.

It’s the love for the language of teaching and learning that rears up occasionally and results in me offering unsolicited opinions. Before I offer up the claim I disagree with for semantical reasons, I’d like to lay out the evidence for my counter-claim.

  • Ever give the same test or task to a group of students?Ever use an answer key to score students’ test?
  • Ever use an answer key to score students’ test?Ever ask students to hand in their work after a certain period of time?
  • Ever ask students to hand in their work after a certain period of time?
  • Ever assign a grade on a scale of 0-100 with a pass/fail cutpoint (i.e. 65)?
If your answer to any of these questions is yes, then you’ve given a standardized test.
Ever talk with other teachers to reach consensus around what quality work looks like?
Ever set aside examples of student work to refer to later as an example of a “good” paper?
Ever review a task to make sure it’s fair, accessible to students with disabilities, free of bias?
If your answer is yes, then you’ve used standardization to help you do your job.
Peter Greene wrote a post in response to Obama’s comment titled The Correct Number of Standardized Tests. His claim is that the correct number is zero. He says, “Students need standardized tests like a fish needs a bicycle.” If that’s the case, then it raises a whole slew of questions about grades, final exams, and what it means to fairly assess students. Meanwhile, the NYS Performance Consortium uses standardized processes.

These teachers in NH are using standardized tests (which are actually performance tasks) to assess their students. And more to the point, when you ask Kiana Hernandez about standardized tests, she talks about those created by the state, her district, and her teachers.

I recognize that his point isn’t about standardized tests per se but the large-scale once a year tests. However, he ends his post with “the number of necessary standardized tests is zero.” I’m all about performance tasks, portfolios, and authentic assessment. I’m all about the maker movement and kids doing things in school that have meaning to them outside of school. I’m also all about the profession of teaching and having a robust public education system. I want there to be a standardized approach to how we collect large-scale evidence of learning in order to inform systemic decisions and policy. Right now, we’re using multiple choice because they’re easy, familiar, and faster than the alternative. Hopefully, we’ll move to a system like what NH is cooking up where the tasks are embedded within the curriculum and assessment is a part of a learning, not an interruption. But even when we get there, there will still be a need for some degree of standardization. To suggest that standardization itself is a problem … well, that’s a whole nother case of worms about the purpose of public education and society.

The aide in my school way back when didn’t mean my students were “trainable.” She didn’t think less of them, she was used to “TMR” and rattled it off as a placeholder for “the students with mild to moderate disabilities in the 15:1 math class but in a general education class for Science and Social Studies.” I suspect Greene is using “standardized tests” as a placeholder for “the tests given once a year that cause an incredible amount of stress and do little to inform what happens in my classroom.” I go back to the questions I asked in my semantics post: What’s gained or lost by using or not using precise language? What does the profession gain or lose when using shorthand to refer to a given concept or idea?

My CCLS Changes and Recommendations

Earlier today, NYSED released the link to the tool they’re using to collect recommended changes to the Common Core Learning Standards. The CCLS are slightly different than the original CCSS in a few key ways: NYS added #11 to reading about literature, a handful of other standards around creativity, culture, and choice and one or two to the math standards.

The survey is clear about what it is and what it isn’t. It’s not a place to share general opinions about the giant ball of sticky wax referred to as “Common Core.” It’s a place to comment on individual standards. Each. Individual. Standard. Which, according to my Excel file, is 1115 literacy standards. I will be sharing feedback with SED based on my experiences around curriculum and assessment over the last four years. Some of my feedback will include:

* Creativity was added during the adoption process – in some grades, though, it appears under Reading Informational Texts and in others, it’s in Speaking and Listening, some it’s both. I’d advocate for putting them all under Speaking and Listening like it is in 12th grade.

* Cultural connections in Kindergarten and First Grade (RL.9a) is worded oddly. I suspect it’s about inviting students to see connections between their own lives and the experiences of those in a book they’re reading but it should be cleaned up and clarified to ensure alignment in curriculum design.

* The “seek to understand” standard has always been one of my favorites but like cultural connections, the wording seems a bit hastily. I’ve drafted a proposed re-write based on work from anti-racism/cultural competency educators.

* The study of dialects and accents appears only in 5th grade. Feels like a waste of an opportunity to invite students to engage with the English language and all its odd quirks. I have ideas for how to expand that into other grades.

* In some of the original CCSS, there’s a sense of writing by committee that becomes clear when you’ve started at the standards many, many, many times during design sessions. Small things like a word being used in 3rd, dropped in 4th, but re-appears in 5th. Because that’s the kind of person I am, I have a running list of those odd quirks and will be passing them along.

Those who advocate opting out of state tests have reported that, in order to opt back in, they want the standards to be fixed so they are developmentally and age-appropriate. Let’s hope they pass along their feedback around what changes need to be made to ensure that happens.