Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Quantifying student growth: analysis of the validity of applying growth modeling to the California Standards Test
(USC Thesis Other)
Quantifying student growth: analysis of the validity of applying growth modeling to the California Standards Test
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
QUANTIFYING STUDENT GROWTH:
ANALYSIS OF THE VALIDITY OF APPLYING GROWTH MODELING TO THE
CALIFORNIA STANDARDS TEST
by
Marvin Horner
A Dissertation Presented to the
FACULTY OF THE ROSSIER SCHOOL OF EDUCATION
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF EDUCATION
August 2009
Copyright 2009 Marvin Horner
ii
DEDICATION
For my mother, who always showed me unconditional love;
my father, who always believed in all I do;
the United States Marine Corps who have fostered in me
the discipline needed to achieve my goals;
and especially my spouse and partner, C.J Watrous, who is my greatest supporter.
All of you have made this possible.
iii
ACKNOWLEDGEMENTS
To my dissertation chairperson, Dr. Dennis Hocevar and my committee members, Dr.
Richard Brown the late Dr. John Zimmer, and Dr. Robert Keim.
iv
TABLE OF CONTENTS
DEDICATION ii
ACKNOWLEDGEMENTS iii
LIST OF TABLES vi
LIST OF FIGURES viii
ABSTRACT ix
Chapter
1. INTRODUCTION
Background of the Problem
Validity and Reliability of Accountability Models
Legal Background
Existing Models of Student Performance Assessment Analysis
Current Practice in California Schools
Purpose of the Study
Research Questions
Design Summary
1
1
4
6
11
13
14
16
20
2. REVIEW OF THE LITERATURE
Existing and Proposed Models of School Accountability
Validity in Student Assessment
22
22
29
3. RESEARCH METHODOLOGY
Purpose and Importance
Research Questions
Research Design Overview
Participants and Setting
Methods
36
36
36
41
42
44
v
4. RESULTS
Comparison of Evaluation by Varying Value Tables
Discriminant Validity and Fairness of the Model
Utility of the Model
Impact of Policy Changes
53
53
65
77
79
5. DISCUSSION
Comparison of Evaluation by Varying Value Tables
Discriminant Validity and Fairness of the Model
Utility of the Model
Impact of Policy Changes
Implication
Limitations
Conclusions
81
81
85
87
89
90
92
95
References 98
vi
LIST OF TABLES
Table 1-1 A Model for Understanding Validity 5
Table 1-2 Performance Band Weights for California’s
Academic Performance Index 8
Table 1-3 NCLB Value Table 9
Table 1-4 API Value Table 18
Table 2-1 Example of a Value Table 28
Table 3-1 NCLB Value Table 38
Table 3-2 API Value Table 38
Table 3-3 Proposed Value Table for Analyzing Student Growth Data 39
Table 3-4 Value Table Consistent with NCLB Safe Harbor Calculation 40
Table 3-5 School Demographic Characteristics (2005) 42
Table 3-6 School Achievement Status 44
Table 4-1 Comparison of Cohort to State and District – Status Model 54
Table 4-2 Comparison of Cohort to State and District – Improvement Model 55
Table 4-3 NCLB and API Comparisons of Student Status
Performance on the ELA CST 56
Table 4-4 Cohort Transition Matrix, Grades 8-9 60
Table 4-5 Cohort Transition Matrix, Grades 9-10 61
Table 4-6 Cohort Growth Scores, Grades 8-9
Using the Proposed Value Table (3-6) 61
Table 4-7 Cohort Growth Scores, Grades 9-10
Using the Proposed Value Table (3-6) 62
Table 4-8 Safe Harbor Calculation Matrix, Grades 8-9
Using Hill et al. (2006) Safe Harbor Value Table 63
vii
Table 4-9 Safe Harbor Calculation Matrix, Grades 9-10
Using Hill et al. (2006) Safe Harbor Value Table 63
Table 4-10 Analysis of Variance for Student Growth Scores Grade 8-9 66
Table 4-11 Analysis of Variance for Student Growth Scores Grade 9-10 67
Table 4-12 Ethnicity/Language Group Mean Growth Grade 8-9 68
Table 4-13 Ethnicity/Language Group Mean Growth Grade 9-10 68
Table 4-14 Base Cohort School Mean Growth Grade 8-9 69
Table 4-15 Base Cohort School Mean Growth Grade 9-10 69
Table 4-16 Tukey HSD Summary for Ethnicity/Language Group Differences;
Grades 8-9 70
Table 4-17 Tukey HSD Summary for Ethnicity/Language Group Differences;
Grades 8-9 70
Table 4-18 Tukey HSD Summary for School Pair wise Comparisons;
Grades 8-9 72
Table 4-19 Tukey HSD Summary for School Pair wise Comparisons;
Grades 9-10 75
Table 4-20 Pearson Correlation Coefficients of Growth Scores
and Other Indicators of Success 78
Table 4-21 Student t test of students in Basic Compared to Students in Far Below
and Below Basic Performance Bands 79
viii
LIST OF FIGURES
Figure 4-1 Comparison of Percent Proficient or Above, Grades 8-10 57
Figure 4-2 Comparison of API, Grades 8-10 58
Figure 4-3 Cohort Mean Growth Using Proposed Value Table 59
Figure 4-4 Cohort Mean Growth Using Hill et al. (2006) Safe Harbor Value
Table
64
ix
ABSTRACT
In this study, the validity of applying growth modeling to the California
Standards Test (CST) in assessing schools and academic programs is evaluated. The
2001-2002 freshman classes of 10 inner city high schools, 13,160 students, formed the
study cohort. Evaluations from four value tables used to analyze the cohort’s growth
were compared. The validity of using changes in CST scores to measure growth was
analyzed, comparing the ten schools and four groups of students, African American,
and Hispanic of three language categories: Limited English Proficient (LEP),
Reclassified Functionally English Proficient (RFEP), and English Only/Initially
Functionally English Proficient (EO/IFEP). The utility of using CST growth scores
was investigated by comparing them to marks, changes in marks, enrollment,
attendance, and CST status scores. Evidence of a disparate impact of the No Child
Left Behind (NCLB) Act on lower performing students was sought.
Four value tables were analyzed, one matching California’s Academic
Performance Index, one matching NCLB, an author generated growth model table,
and a growth model table matching the NCLB Safe Harbor provision. The valuations
derived from the author generated value table assessed the cohort as drastically less
behind the benchmark than the other tables.
Hispanic LEP students had higher growth than did other students. The lowest
performing schools showed the highest growth. School growth scores were not
consistent over time. Growth scores were negatively correlated with other measures of
student achievement and to each other. No evidence was found that NCLB was having
x
a disparate impact upon the lowest performing students or that schools were targeting
bubble students, those in the basic band. The lowest performing students showed more
growth than bubble students did.
The findings in this study are consistent with regression towards the mean.
Using raw growth scores that are not vertically equated may not be a valid way to
evaluate schools. However, this conclusion should be interpreted in light of the fact
that there was a restriction of range problem in that only lower performing inner city
high schools were evaluated and the dropout rates for these high schools were in the
40-60% range.
1
CHAPTER 1
INTRODUCTION
Background of the Problem
Public educators hold as a common value the growth of students. As congress
begins to debate the reauthorization of The No Child Left Behind (NCLB) Act of 2001
the executive branch has included growth modeling as an option within its proposal
for school and district accountability, albeit with serious restrictions on how it may be
utilized (United States Department of Education, 2007). One important limitation to
what the executive branch is proposing is that conditional models, which consider the
influences of known demographic covariates, are specifically forbidden in the
currently proposed plan. In evaluating education programs, such as a new reading or
math program, educators are essentially asking the same questions about measuring
and assessing student achievement and growth in trying to determine if the program in
question will lead to more student performance.
Measuring student growth has become a particular challenge. A plethora of
variables confounds most measurement efforts. With the advent of NCLB and
increased scrutiny of aggregated student performance data, more attention has been
paid to the current state of cohorts of individual students and to the differences among
successive cohorts of students than has been given to the growth that each individual
student may have shown over time. Presently, current research literature is exploring
the possibility of using growth and value added modeling for a variety of purposes
including accountability. The Tennessee Value Added Accountability System has
2
been using value added modeling to measure individual student growth as part of its
school and teacher accountability system. However, to date, there is little consensus
among researchers as to the validity of such a high stakes use of value added modeling
(Goldschmidt et al., 2005; Martineau, 2006; McCaffrey, Lockwood, Koretz, Louis, &
Hamilton, 2004).
Measurement of student growth and of the value added by educational
programs, schools, or teachers is full of controversies. One major controversy
regarding the use of growth modeling for accountability purposes centers on the use of
conditional modeling. Conditional modeling refers to any value added model that uses
statistical controls to tease out the effects of known covariates, thereby adjusting the
expected outcomes for students. For example, the evaluator may choose to control for
poverty, race, gender, and so forth. Tekwe et al. (2006) point out that the use or disuse
of these mathematical adjustments in creating regression models represents an ethical
dilemma. By ignoring these known covariates, the institution that is being evaluated,
whether it is a school, a teacher, or a program, is held accountable for the entire
covariance, a substantial portion of which is outside of the institution’s control.
Conversely, by ignoring these factors, the institution is absolved of all responsibility it
may have in contributing to the observed covariance. This absolution of responsibility
is recognizably similar to what Secada et al. (1998) refered to as the trap of the
pobrecitos – which is a notion of certain students being poor little children that one
feels sorry for because institutions such as schools are inefficacious at supporting their
success.
3
Tekwe et al. (2006) assert that neither of these two options seems acceptable,
particularly in a high stakes environment such as school and teacher accountability.
While the first option, ignoring the known covariates, holds people accountable for
factors that are outside of their control, statistically controlling for these factors sets up
a separate and apparently lesser level of expectations for certain students. Thus, if high
stakes evaluations are performed at the school or teacher level, without controlling for
external factors, teachers who work in high poverty, high minority areas may be
disproportionately and unfairly rated lower. However, if we control for these factors,
those same teachers will have no responsibility whatsoever to raise the level of
expectations for their students.
In this study, I consider the practical validity of growth modeling in general. It
is essential in this effort to consider both conditional and unconditional models. In
California, the problem of developing a valid growth model while operating within the
current statewide testing practices is compounded by the fact that the California
Standards Test (CST) is not vertically aligned. Given the current political status of the
CST, it is unlikely that a vertically aligned test will be developed in California at this
time. Most of the current literature regards vertical alignment to be an essential
component of most growth and value added models, particularly in the more
sophisticated regression models. The exceptions to this general case will need to be
explored as well (Lissitz, Doran, Schafer, & Willhoff, 2006).
The focus of this project will be to assess the validity of the use of growth and
value added models that may be used at school sites or within districts, particularly for
4
the purpose of evaluating the impact of educational programs. Although there is a
great deal of interest in developing these models for use in teacher evaluation, or for
ranking teachers, due to the constraints inherent in the types of data that are generally
available to school leaders, it is doubtful that a model of this sort can be developed in
the foreseeable future which will be sufficiently reliable for such high stakes uses,
such as ranking teachers for accountability purposes (Andrejko, 2004; Martineau,
2006; Tekwe et al., 2004). Much of the existing literature does argue that following
intra-student growth data is a better approach to assessing school and program quality
than is the tracking of student achievement status or comparing subsequent cohorts
(Amrein-Beardsley, 2008; Lissitz, Doran, Schafer & Willhoff, 2006). At the same
time, Linn (2006) has shown that this intra-student longitudinal approach can yield
very different results from the more familiar successive cohorts model that is currently
in use in California.
Validity and Reliability of Accountability Models
The issues of validity and reliability in testing have a long history of academic
discussion and debate. Traditional models of the concept of validity, which included a
long list of different, seemingly unrelated concepts of validity, have been essentially
replaced by a unitary concept of validity (Linn & Miller,2005; Messick, 1994). The
unitary validity concept emphasizes that the validity of the use of a test for a given
purpose is an overall construct, rather than a series of unrelated evaluations. Whereas
this construct of a unitary validity has been criticized as unusable in practical terms for
evaluating test validity, and is perceived as being too close to the idea of a singular
5
notion encompassing the entirety of validity (Brennan, 1998), a more balanced
approach to test validity is desirable. Lissitz and Samuelsen (2007) propose a construct
of validity, which organizes the various aspects of the traditional understanding of
validity while acknowledging the all-encompassing nature of validity that Messick
puts forward. In their work, Lissitz and Samuelsen argue for interpreting validity by
Table 1-1
A Model for Understanding Validity
Focus
Theoretical Practical
Perspective
Internal Latent process Content validity
Reliability
External Nomological networks Utility
Impact
Source: Lissitz and Samuelsen (2007)
considering both the purpose and the focus of the analysis. In this way, they devise a
simple matrix (Table 1-1) of validity. In this study, I have focused my analysis of
growth modeling using the practical perspective that Lissitz and Samuelsen put forth.
In their model, content validity considers whether the test measures the content that it
intends to measure. Utility relates to the usefulness of the data and impact addresses
whether the use of the test is in line with its intended use and in line with the data that
the test generates. Reliability is a broader idea than the traditional notion of reliability
and deals with the consistency of the test results across various groups and across
time.
6
Legal Background
NCLB established the legal basis for the current system of school
accountability. In accordance with this law, states are required to establish and assess
students using state defined standards of student performance. Schools and districts
that receive Title I funds must demonstrate adequate yearly progress (AYP), as
defined by each of the individual states, towards the eventual goal that all students will
meet the state’s proficiency benchmark by the spring of 2014. In California, AYP for
elementary and middle schools is defined by a series of criteria to be met by annual
testing (California Department of Education, 2007a). The criteria include testing
participation, Annual Measurable Objectives – which is the percent of students that
meet each of the proficiency benchmarks, and either a minimal overall score or
sufficient improvement on the state’s Academic Performance Index (API). High
schools use the first administration of the California High School Exit Exam to meet
the participation rate and Annual Measurable Objectives requirements, and must meet
an additional graduation rate requirement. At all schools, the participation rate and
percent proficient requirements must be met both by the student body as a whole and
by each subgroup of the student body that the state deems to be numerically
significant as defined by California Department of Education (2007b). A numerically
significant subgroup is any subgroup that includes at least 100 students or, at smaller
schools, a subgroup that comprises both at least 50 students and at least 15% of the
student body as a whole.
7
Prior to NCLB, in response to A Nation at Risk (The National Commission on
Excellence in Education, 2003) and Goals 2000 (1994), the California legislature
enacted the Public Schools Accountability Act of 1999, which established the state’s
API system. Under this legislation, each subject area test is weighted and student
scores are translated into five performance bands, namely advanced, proficient, basic,
below basic and far below basic. These student performance bands are also weighted
using a value Table. As Table 1-2 indicates, the difference in the weighting is vastly
greater at the lower end of the scale than the higher end. Because of this weighting, a
student that scores in the far below basic band is much more detrimental to a school’s
API than a student that scores in the below basic band. Under this system, schools
have little incentive to focus on improving instruction and learning for higher
performing students, or even for median performing students. Schools are thereby
encouraged by the API system to put resources into developing the skills of its lowest
performing students in order to avoid the heavy weight of students receiving far below
basic scores.
By contrast, Adequate Measurable Objectives, which are part of the federal
AYP accountability system under NCLB, uses a value table that only considers if
8
Table 1-2
Performance Band Weights for California’s Academic Performance Index
Band Weight Proportional Difference
Advanced (A) 1000 14.3% higher than P
Proficient (P) 875 25% higher than B
Basic (B) 700 40% higher than BB
Below Basic (BB) 500 250% higher than FBB
Far Below Basic (FBB) 200
Source: California Department of Education (2007a)
students in the target group meet the cut score that is established by the state. Hill,
Gong, Marion, DePascale, Dunn, and Simpson (2006) point out that for this criterion,
all students that score at proficient or advanced bands have met that cut score and are
equally treated as an indicator of success, while each student that score in the basic,
below basic or far below basic bands have not met the cut score and are equally
treated as indicators of school or district failure (Table 1-3). Since the sanctions
associated with school or district failure can be severe, ranging from a reduction of
9
Table 1-3
NCLB Value Table
Year 2
Far Below
Basic
Below
Basic
Basic Proficient Advanced
Year 1
Far Below Basic 0 0 0 1 1
Below Basic 0 0 0 1 1
Basic 0 0 0 1 1
Proficient 0 0 0 1 1
Advanced 0 0 0 1 1
Source: Hill et al. (2006)
local control over the use of federal funds to the imposition of major restructuring of
the school or district by the state, school leaders take these indicators very seriously.
As a result, in order to address the cut score issue inherent in the federal AYP
accountability system, school leaders find themselves redirecting resources to address
the needs of a narrow band of students, specifically those that are near the cut score
(Hamilton et al., 2007; Linn, 2006). These may be either students that school leaders
hope will be able to improve slightly, thus making it into the proficient band, or those
students whose performance is barely above the cut score and are in danger of not
being able to maintain performance above it in the upcoming round of testing.
While the federal AYP accountability system and the state API system run
simultaneously, they do have distinct foci and different consequences. The AYP
system has the effect of leading school leaders to attend to those students who are near
the bar that divides proficient and basic levels of performance. By doing this, schools
10
maximize the number of students that achieve above that established benchmark. The
API system punishes schools for having students at the lowest levels of the scale,
below basic and especially far below basic. This encourages school leaders to attend to
the lowest performing students. However, with the exception of schools which have
accepted specific state grants that are specifically tied to gains in the API, the
consequences for not meeting state established API expectations are far less tangible
in comparison to the relatively heavy handed sanctions that are part of the federal
AYP system. Therefore, school officials feel pressured to make decisions and allocate
resources with AYP requirements in mind, with little if any regard to API
requirements. This policy change translates into school leaders feeling compelled to
address only the federal AYP system. School leaders do this by allocating remedial
resources to attempt to meet the needs of bubble students, or median performing
students who are near the proficient cut score, rather than more needy students who
are highly unlikely to make the gains necessary to achieve at the proficient level by
this year’s testing cycle (Hamilton et al., 2007; Linn, 2006). As a result, schools are in
danger of failing to address the neediest of students in order to survive year to year
without falling into the punitive side of AYP, which is called Program Improvement.
This effort to avoid Program Improvement status also has the potential of clouding
identification for individualized education plans. School leaders may be reticent to
identify the 100
th
student at a school or within a district as having special needs, which
would have the effect of creating another subgroup that must be held to the AYP
criteria. This situation creates what Dubner and Levitt (2005) would consider an
11
unfortunate perverse incentive, which has the potential of trumping the benefit of the
neediest children whom the education agency is supposed to serve. Rather than
encouraging school leaders to make special education identification decisions solely
on the needs of a given child, this system begs the school leaders also to consider not
identifying too many students as having special needs, for the sake of self-
preservation.
Existing Models of Student Performance Assessment Analysis
Across the states, two major types of accountability systems are currently in
use (Goldschmidt et al., 2005). These models are referred to as status models and
growth models. Both of these models are based upon student testing data, but in each
model, the data are interpreted in very different manners. Broadly speaking, status
models are snapshots of performance, considering only current student performance.
Growth models consider the change, or growth a student shows over the course of
time.
Status Models
Status models measure student performance as a snapshot and reports the
results for a single point in time. This type of model has been the basis for school
accountability under NCLB and in California is the model that is used for school
accountability. Although status models reports roll out in consecutive years, they do
not have a meaningful longitudinal interpretation. That a given school’s fourth grade
class performed better or worse than the previous year’s fourth grade class does not
necessarily equate to better or worse fourth grade instruction. The two groups of
12
students are two wholly separate and non-comparable cohorts. The change in score
may be a result of better instruction in prior years, greater access to outside resources,
or simply the differing demographics of the cohorts. one type of status model, which is
called an improvement model, does look for (hopefully positive) changes in
subsequent cohorts as an indicator of school improvement; however, this modeling
cannot attribute the source of these changes in student performance to any particular
factor at a given school. This improvement model, which simply compares the
performance of successive cohorts, is the design that currently underlies the NCLB
safe harbor provision.
Growth Models
Growth models, including value added models, differ from status models in
that they focus on longitudinal data in respect to the “same students from one year to
the next” (Goldschmidt et al., 2005, p. 4). These models are interested in change in
individual student knowledge. Value added models, which are a subset of growth
models, incorporate a regression equation, which may include previous performance
along with known demographic correlates to generate an expected level of growth.
Then, in the value added model, actual student growth is compared to the expected
student growth, thereby allowing the resultant residual to be analyzed. This allows for
over or underperformance to be tracked in comparison to the level of performance the
model predicted. In this study, growth is regarded as distinct from improvement.
Growth refers to the change in performance as measured from one year to the next
while following the same student in a growth model. Improvement in this study refers
13
to changes in test scores from one year to the next at the same level while considering
successive cohorts in a status model. In other words, growth is the change in
performance a student demonstrates from year to year and improvement compares this
year’s class with last year’s class at the same grade level.
Current Practice in California Schools
In California, students are assessed on various state tests. The Public Schools
Accountability Act of 1999 requires that the state’s API include both a criterion
referenced and a norm-referenced component. English learner students are also
assessed in English language development using the California English Language
Development Test annually. Currently, both the federal and state accountability
systems require the use of status-model accountability systems and growth or value
added models are not likely to be considered in line with these legally mandated
accountability systems (Goldschmidt et al., 2005). At present, the focus on avoiding
Program Improvement status is a key issue for many school leaders. To this end, much
attention has been placed on meeting Adequate Measurable Objective requirements in
terms of percent of students at a school meeting the proficiency target on the CST each
year.
The state’s system to address AYP requirements does incorporate one aspect of
an improvement model in its manner of addressing the Safe-Harbor Provision of
NCLB (California Department of Education, 2007a). This provision allows that if the
percentage of students in subsequent cohorts that perform below the established cut
score for proficiency decreases by 10%, then the school is determined to be in safe
14
harbor. In practical terms, that means that if a school has only 10% of its students
performing at or above the cut score, hence 90% are below it, then it must have 19%
of its students perform above the cut score the following year to be in safe harbor
[10% + (10% of 90%) = 19%]. The state also incorporates a 75% confidence interval
into calculating safe harbor, which allows some flexibility in this percentage;
therefore, a school can be in safe harbor by reducing its percentage below proficient
on the CST by approximately 10%. Since this improvement in student performance
takes place over the course of two separate cohorts of students, it does not represent
the use of a growth model, but rather an improvement model, which is a subset of
status models. A school that is in safe harbor is treated as having met that particular
AYP criterion for the year. As Adequate Measurable Objectives continue to climb
towards 100% in 2014, one can anticipate that the safe harbor provision will
increasingly be perceived as a more important alternative to the standard AYP criteria.
Purpose of the Study
California schools rely on a status model accountability system. At present,
none of the existing status models that are in use in California provide instructional
leaders with information that is intelligible, equitable and useful. Current data
reporting systems in California do not provide adequate information for leaders to
truly assess programs and make informed decisions about allocation of future
resources. Therefore, my purpose in this study has been to analyze the validity of
school leaders using longitudinal models for evaluating programs that may be more
informative and equitable. To this end, in this study I have addressed whether a
15
growth model, which measures intra-student change on the CST, is a fairer manner to
assess school and program quality than is the status model that is presently used in
California. In this pursuit, I have attempted to explore a potentially more informative
model for evaluating educational programs in California. The current status and
subsequent cohort models for school and program accountability is a gravely flawed
strategy due to its heavy reliance on known counterfactuals. For status models to be
valid would require that they are not strongly correlated to factors that are irrelevant to
teacher and instructional quality, such as student demographics. For subsequent cohort
improvement models to be valid would require that subsequent cohorts come to a
grade level with equivalent backgrounds and prior achievement. The violation of these
two assumptions alone is sufficient reason to discard the use of status modeling for
accountability purposes.
Although there is substantial interest in the use of value added accountability
to evaluate both teacher and school effects, Matineau (2006) points out that it is
unlikely that a satisfactory model for use in a high stakes manner, such as teacher
evaluation, can be developed in the foreseeable future. The models I have explored in
this study are not able to ameliorate Matineau’s valid concerns. However, at present
the alternatives to using growth or value added models are far less appealing. Using
achievement status as a school and program evaluation tool is riddled with serious
issues of validity, particularly as related to fairness and discriminant validity.
Improvement models are similarly flawed because they rely heavily on known
16
counterfactual assumptions regarding the consistency of student readiness in
subsequent cohorts.
The CST provided the outcome variable for this study. This statewide,
criterion-referenced test is given to students annually in grades 2-11. Although the
CST is tied directly to the state’s standards, and California has done substantial work
to provide for the content validity of the test, subsequent tests have not been
deliberately scaled vertically. Indeed, the difficulty levels of the tests demonstrably
vary as is acknowledged by the state (California Department of Education, 2008b).
Lissitz, Doran, Schafer and Willhoft (2006) noted that there is precedence for using
non-vertically scaled, criterion referenced data for this type of evaluation, particularly
when using simpler models, as I have examined in this study. Although most value
added models use vertical scaling, this option is not available using the CST and it is
unlikely that a vertically scaled CST will be developed in the current political
environment, which is dominated by the requirements NCLB. Although the currently
proposed revisions of NCLB allows for the use of growth models, the use of
conditional growth models, which adjust for demographic covariates, have been
categorically rejected by the United States Department of Education (2007).
Research Questions
In this study, I addressed four essential research questions related to the
practical validity of the use of growth in analyzing student performance on the English
language arts CST. The research questions consider three of the four aspects of
practical validity outlined by Lissitz and Samuelsen (2007). These questions focused
17
first on comparing how various value tables (Hill et al., 2006) evaluate this study
sample, then on what Lissitz and Samuelsen regard as an aspect of reliability in terms
of discriminant validity and fairness of the methods that I examined. I then considered
the utility of the growth scores that I analyzed in this study, and finally I investigated a
concern about the impact of NCLB and attempted to see if this policy change has had
a disparate effect on lower performing students.
In order to compare the evaluations derived from a variety of value tables, I
first used the currently existing status and improvement models and value tables to
describe this cohort and two larger sets in which it is nested, namely all high schools
in the same district and all high schools in the state. For this analysis, I compared the
cohort sample to the larger system populations using two value tables, the NCLB
Value Table (Table 1-3) and the API Value Table (Table 1-4). For these analyses, I
used an improvement model, as a necessary concession to unavailability of
longitudinal student data at the population levels. I followed this descriptive analysis
with an evaluation of the cohort using a growth model that utilizes transition matrices,
which is an approach proposed by Hill et al. (2006). In their work, Hill et al. advance
the idea of using value tables, essentially relatively simple matrices, to analyze and
communicate student growth. In this study, I used two distinct value tables to assess
student growth within this cohort. The first is value table that I am proposing for this
type of analysis (Table 3-3), which would have a benchmark of 100 points
representing acceptable growth. The second value table (Table 3-4) is one that was
created by Hill et al. and is intended to be in line with the federal definition of safe
18
harbor under NCLB. This value table is intended to have a benchmark of 10 points
representing adequate growth to meet the federal safe harbor definition.
Table 1-4
API Value Table
Year 2
Far Below
Basic
Below
Basic
Basic Proficient Advanced
Year 1
Far Below Basic 200 500 700 875 1000
Below Basic 200 500 700 875 1000
Basic 200 500 700 875 1000
Proficient 200 500 700 875 1000
Advanced 200 500 700 875 1000
Source: Hill et al. (2006)
In the second question for this study, I sought to address one of the most
politically sensitive issues discussed within the value added and growth modeling
literature. To what extent are known correlates to status model achievement related to
growth scores? Traditionally, this would be called a question of discriminant validity.
In the model of validity proposed by Lissitz and Samuelsen (2007), this question
relates directly to the reliability and the core concept of fairness of this model. For the
sake of clarity, I refer to this as discriminant validity and fairness throughout this
study. This question relates to the fairness of using this growth model for the purpose
of evaluating schools or programs, not necessarily to the fairness of the standards, or
19
the fairness of the student scores in and of themselves. One consistent and warranted
criticism of status and improvement based models of interpreting the CST as a
measure of school and teacher quality is its strong correlation to the student’s socio-
economic status. In this sense, if student achievement status based on scores achieved
on the CST were the only measure of teacher quality, it would appear that high quality
teachers work almost universally in high-income neighborhoods. Similarly, teachers in
low-income neighborhoods would seem to be consistently of low quality. In this
research question, I asked whether or not students in four groups – African American,
Hispanic who speak English only or were identified as functionally proficient in
English upon initial enrollment in school, Hispanic with limited English proficiency,
and Hispanic who have been reclassified as proficient in English – are demonstrating
varying levels of growth as measured in this study.
In order to address the dynamic of utility in the practical validity of growth
modeling in analyzing the CST, in my third research question, I sought to determine if
individual student growth scores are related to other established indicators of student
success inherent in status models. If the growth scores I analyzed in this study are to
be regarded as useful, they should be related to these other indicators. To address this
issue, I used three import measures of success. The first measure for this evaluation
was teacher issued student marks in English and composition classes. Herein, I sought
to determine if growth in CST scaled scores predicts both future student marks and
changes in student marks. Additionally, I explored the relationship of the CST growth
to both prior and future attendance.
20
My fourth research question examined potential disparate impact that may
have resulted from the policy changes inherent in switching the focus of school
accountability from the focus on meeting the requirements of California’s Public
Schools Accountability Act to meeting the requirements of the federal NCLB Act. The
CST was originally created to address the policies and goals outlined in the
California’s Public Schools Accountability Act and at the federal level, Goals 2000.
The API value table (Table 1-4) was established by the original systems the CST was
designed to address and differs in important ways from the value table established
under the federal NCLB Act (Table 1-3), particularly in the type of achievement that is
valued. Therefore, I investigated if there evidence that school leaders and teachers in
this cohort group are targeting bubble students, or students in the basic proficiency
band, rather than lower performing students in the below and far below proficient
bands, and if so, are these efforts having a differential effect on low performing versus
bubble student?
Design Summary
I addressed the four questions of this study with a series of quantitative
analyses. In order to address the first question I compared the cohort to all students in
the same age group at the district and the state level using status and improvement
models as tracked by various value tables. I also tracked the cohort using two growth
model value tables. To address the second question, regarding discriminant validity
and fairness, I attempted to determine if the changes in subsequent CST scaled scores
that various subgroups of the cohort achieved vary systematically by ethnicity, English
21
language status, and school of attendance. For this analysis, I used an ANOVA design
and the appropriate follow up post hoc one-way ANOVA or Student t-tests as
indicated by significant findings. In order to address the third question, which is
concerned with the utility of the use of growth models for interpreting the CST, I used
a Pearson’s correlation coefficient design. In doing this, I sought out relationships
between the CST growth scores and the other indicators of student success, including
teacher issued marks in English language arts and composition classes, changes in
these marks, and attendance. In order to examine the fourth question, regarding the
potential disparate impact that the change from a focus on California’s Public Schools
Accountability Act to the focusing on the requirements of NCLB may have had on
lower performing students, I used a Student t-test. In this, I was intending to seek a
difference in the growth scores between lower-performing students and those that are
in the basic performance band. For all statistical tests, appropriate measures of effect
size were also calculated for significant results. For the ANOVA, ω ˆ
2
was used for
effect size. Cohen’s d was used as the measure of effect size for the Student t-tests.
For the Pearson correlation coefficient, r was interpreted directly as the measure of
effect size. The standard level of significance (α is set at p < .05) and standard effect
size interpretation were used throughout this study.
22
CHAPTER 2
REVIEW OF THE LITERATURE
Existing and Proposed Models of School Accountability
Currently, two basic models are in use for evaluating schools, namely status
models, which also include improvement models, and growth models, which include
the subset of value added models. Goldschmidt et al. (2005), in their survey of these
basic existing models for school site accountability, point out that most state level
systems for school accountability under the No Child Left Behind Act of 2001
(NCLB), including the one which is used in California, presently rely on status and
improvement models. Status and improvement accountability models assess schools
based upon the current student achievement of the students at the school. Improvement
models are based on the differences in achievement demonstrated by subsequent
cohorts within a school. Meyer (1997) points out that the main fatal flaw inherent in
using status and improvement models is that status scores are strongly related to
factors that are out of the control of school personnel. Since a school’s student body is
made up of students whose families live in their attendance areas, and parents choose
where to live at least in part in order to select the school that their children will attend,
selection bias taints the student achievement status data. This bias renders the data
from status and improvement models virtually meaningless for the purpose of
accountability.
In contrast to status and improvement models, growth and value added models
base their analysis upon changes in individual student achievement from one year to
23
the next. Goldschmidt et al. point out how important it is to distinguish growth from
status modeling by pointing out that evaluations that are based on these two different
types of models tend to be unrelated or, more precisely, moderately negatively related
to each other. According to Goldschmidt et al., schools that have high achievement
status are less likely than low achievement status schools to show more substantial
intra-student growth.
Value added models are an important subset of the broader group of growth
models. The term “Value Added” was popularized by the Tennessee Value Added
Assessment System (Sanders & Horn, 1994; Sanders, Saxton, & Horn, 1997). This
pioneering model introduced the large-scale use of regression analysis for predicting
student achievement. By comparing these predictions to observed student performance
levels, this system purports to provide more valid data to use in evaluating the success
of the instruction provided by individual teachers, using an unconditional layered
value added approach.
More recently, growth and value added models have received a great deal of
attention in both education policy and academic literature. These models have been the
topic of two recent major conferences, a special issue of a refereed journal, two
influential edited books and several articles in other peer reviewed publications. Linn,
Baker, and Betebenner (2002) discuss several issues in addressing the intended
purpose of the No Child Left Behind Act of 2001 (NCLB). Two issues they present
relate directly to my current study. The first is the instability, and therefore
unreliability, of successive cohorts as a measure of school progress as is done in a
24
school improvement model. Additionally, they note the lack of recognition for student
change that does not cross the dividing line between the basic and proficient bands. As
a potential remedy to this instability and to the failure to reward progress within lower
levels of performance, the authors put forth the use of longitudinal data to measure
student growth rather than relying on a successive cohort model. Recently, the
Secretary of Education (United States Department of Education, 2007), has agreed that
certain unconditional growth models may be accepted under NCLB.
Lissitz, Doran, Shafer and Willhoff (2006) provide the basic definitions of
growth and value added modeling. Growth models are those that assess changes in
performance by individual students. Value added models are regarded as a subset of
growth models, which attempt to use statistical models to tease out other variables,
such as student experiences or demographics, and “allocate growth to causal factors
such as teacher ability, curricular innovation or even to student background variables”
(p. 3). Stated briefly, improvement models consider the changes between separate
cohorts of students, who subsequently go through the same grade level, while growth
models follow students longitudinally from grade to grade, and value added models
attempt to predict individual student growth and then compare the observed growth to
the predicted growth in order to make some causal inferences.
Several factors are thought to be of concern to value added modeling. Tekwe et
al. (2004) evaluated three important models that were designed to address different
concerns. The first model, the Hierarchical Linear Model, allows for users to adjust for
important school and student demographics, such as socioeconomic status. This type
25
of model is also referred to as a conditional model, as opposed to an unconditional
model, which ignores student demographic factors. Conditional models may use a
wide variety of known covariates, such as race, gender, poverty, English learner status,
and so forth, to correct the model and provide a more refined estimate of the best
linear unbiased predictors for the regression model. A second model, The Layered
Mixed Effects Model, measures gains of students and focuses on the embedded nature
of various levels of effects, such as students within classes, teachers within schools,
and schools within districts. Layered mixed effect models also allow for the
partitioning of effects to schools when students move part way through a school year;
however, it does not adjust these gains based upon important known covariate socio-
demographic factors. While both of these are relatively complex regression models,
which are difficult to explain to the base stakeholders at the school level, the third
model Tekwe et al. evaluated, the simple fixed effects model, is as its name implies,
much simpler mathematically and logically. This simplicity makes it more appealing
to school stakeholders and policy makers as well. The simple model adjusts for neither
demographic covariates, nor transient students. In their work, Tekwe et al. found that
the simpler model rendered results that did not differ significantly from the more
complex layered model. They also noted that adjusting for socio-demographic factors,
however, did result in substantive differences in school rankings from the simper
model. The similarity of results between the simpler, fixed effects model and the
layered mixed model renders the more complex model seemingly unnecessary.
26
While status and improvement models are fatally flawed, longitudinal growth
models have their own issues. Missing data in these models create a serious problem,
since each growth score is essentially the difference between two status scores. The
importance of the violation of the missing at random assumption is critical
(McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004). This is a particular
concern since McCaffrey, Lockwood, Mariano, and Setodji, (2005) found that the data
that tends to be missing does have a systematic pattern. Another concern that cannot
be ignored is the large error variance within classrooms. Lockwood, Louis, and
McCaffrey (2002) point out that the reliability of teacher rankings based upon a value
added model is dependent upon the ratio of the growth variance that is observed
between teachers, essentially effect variance, and growth variance among students
within the same classroom, which could be interpreted as error variance. This
interpretation is somewhat analogous to the concept that underlies the traditional F-
ratio if one considers the variance between teachers as similar to the construct of SS
bet
and within classroom variance as analogous to SS
w
. Buddin, McCaffrey, Kirby, and
Xia (2007) illustrate how this becomes a major issue in the implementation of value
added modeling for individual teacher evaluation. In their work they analyzed the
merit pay system which is being implemented in the state of Florida. Although they
found that there were important differences in student gains between the highest
gaining quartile of teachers’ classes and the lowest gaining quartile, they were unable
to distinguish either pole from the median. Additionally, they found that alternative
measures of these gains were only moderately correlated to the results of the value
27
added model. A more serious concern is also raised in that teacher rankings derived
from the value added model were only moderately stable from one year to the next.
Braun (2005) evaluates value added modeling in terms of several distinct
assumptions that the various models require. An important strength that growth and
value added models is that growth scores and value added residuals are less correlated
with demographic characteristics than status model achievement scores (McCafferey,
Lockwood, Koritz & Hamilton, 2003; Stevens, 2005). Braun also points out that
selection bias confounds the data in that students are not randomly assigned teachers
or even to schools. This non-random selection taints growth models similarly to status
models. To the degree to which the choices which place certain teachers with specific
students correlate to student growth scores, this non-random selection inserts a
statistical bias into the data. Furthermore, Braun raises the issue of missing data that
may not be missing at random. He asserts that the probable violation of the missing at
random assumption is a serious challenge for any growth model. Finally, regarding the
essential consequential validity, or impact of the use of the data, Braun explains that it
is necessary to be cautious about causal inferences. The difficulty of asserting causal
inferences stems from the lack of experimental control. Although it is possible to
control for certain, known covariates in the regression equation, in the end, it can only
be by assumption that the remainder of the variance is attributed to a single given
presumed cause, whether that is a curriculum, school or individual teacher.
One potential model which may alleviate some of the difficulties outlined by
Braun (2005) is the use of value tables (Hill, Marion, DePascle, Dunn, & Simpson,
28
2006) and the development of a transition matrix. The value table model has several
advantages. Braun notes that it is not dependent upon vertical scaling, or even interval
scaling. Additionally, the value table model has the advantage of transparency, being
less complex than hierarchical linear, layered mixed effects or even simple fixed
effects models (Buddin, McCaffrey, Kirby, & Xia, 2007; Florida Department of
Education, 2006). Hocevar, Brown, and Tate (2008) give an example of a value table
that might be used locally (Table 2-1) and note one of its great strengths is its
flexibility in that it can readily be manipulated to reflect the type of growth that is
valued by local stakeholders. However, since value tables collapse the data into score
categories, they tend to be statistically weaker.
Table 2-1
Example of a Value Table
Year 2
Far Below
Basic
Below
Basic
Basic Proficient Advanced
Year 1
Far Below Basic -100 100 200 300 300
Below Basic -100 100 200 300 300
Basic -200 -100 100 200 300
Proficient -300 -200 -100 100 200
Advanced -400 -300 -200 100 100
Source: Hocevar, Brown, and Tate (2008)
29
Validity in Student Assessment
The theoretical construction of validity in research has changed over time.
Traditionally, it had been divided into three smaller pieces. Messick (1994) asserted
that the traditional construct of validity, as three separate types, is incomplete and
unduly fragments the overall concept. Traditional understanding of validity had held
that there are three major types of validity, construct, content, and criterion. Messick,
in contrast contends that validity should be understood to be a unified concept tied to
the appropriate, meaningful and usefulness interpretation of a set of given scores. As
exemplified by Linn and Miller (2005), over time, this unitary concept has reached a
level of general acceptance. In their work, the authors state “validity involves an
overall evaluative judgment” (p. 71, emphasis as in the original). In this sense, validity
is conceived to be a broad concept that relates primarily to the use and interpretation
of scores.
Some researchers have expressed discomfort with this unitary concept of the
nature of validity. While generally praising Messick’s overall life work in the field of
test validity, particularly for drawing attention to the need to focus questions of
validity towards the use of test scores as opposed to a given test in and of itself,
Brennan (1998) criticized the unitary concept of validity. Brennan asserts that
Messick’s concept of a unitary validity is faulty in two important manners. Firstly, he
asserts that a unitary concept of validity runs too near to a concept of only one aspect
of validity being “THE validity” (p. 7, emphasis as in the original). In this, he
expresses concern that the validity of a test would be seen as being reduced too far
30
down to accurately express the complexity of the issues that go together to make the
use of a test more or less valid. Secondly, and more significant from a practitioner
perspective, Brennan points out that the unitary concept of validity is too broad of a
concept to be of practical use. In order to evaluate the validity of how a test is used, in
a reasonable manner, requires a series of questions that address different dynamics of
the notion of validity. The construct of validity as a unified concept does not readily
lend itself to the development of such a series of questions for test evaluation.
In considering the validity issues specific to longitudinal designs and growth
modeling, Stevens and Zvoch (2006) find it useful to consider the issue in the
traditional sense. In their work, they list seven distinct threats to the validity of growth
and residual analysis. Of these, two serious threats which have yet to be adequately
addressed in the literature, regression toward the mean and nonrandom attrition will
continue to pose grave issues for the validity of any longitudinal study.
In balancing these issues, Lissitz and Samuelsen (2007) suggest a model,
summarized in Table 1-1, which substantially changes the emphasis of validation
studies and recommends that there be a shift in the terms that are used to discuss
validity. In their model, which I have used to inform the design of the present study,
the authors recommend understanding validity as involving two foci, internal and
external. Internal validity “involves the development and analysis of the test itself” (p.
437). External validity is used to evaluate a test in regards to the test’s relationship to
other measures of the same theoretical idea. In evaluating a test’s external validity, a
31
researcher considers the test’s utility for a given purpose, and the appropriateness of
the impact of a given use of the data that results from the use of the test.
In the model that is proposed by Lissitz and Samuelsen (2007), the
understanding of a test’s validity is also dependant on the perspective of the
investigation. In their model, the validity of the use of a test can be considered from a
theoretical perspective, or a practical perspective. A researcher approaching the
subject of a test’s validity from a theoretical perspective would primarily focus on
latent process (internal focus) and the nomological network (external focus), when
coming from a practical perspective, Lissitz and Samuelsen assert that there are four
important aspects to be considered. The authors present content validity and reliability
as two aspects of internal validity from a practical perspective. Content validity seeks
to establish that the test in fact measures the underlying construct that it is used to
assess. In this sense, assessing the content validity of the CST would involve
determining the extent to which high performance on the test truly indicates mastery
of the standards for the given subject and grade level. Reliability assesses the
consistency of the test scores. Linn and Miller (2005) point out that, while a score can
be reliable without being valid overall, for a test score to be regarded as valid, it must
first be shown to be reliable. In assessing the reliability of student growth on the CST,
it would be essential to demonstrate that student growth is not overly correlated to
factors that should be irrelevant, such as race, poverty, and English language
proficiency. Although in traditional terms, this would be considered discriminant
validity, in the model proposed by Lissitz and Samuelsen, this is one dynamic of
32
reliability. These particular factors would make growth scores particularly valuable
since they are well-established correlates with achievement status on the CST and
would have the potential of demonstrating growth scores to be more useful than
achievement status for the evaluation of schools and programs.
The state of California regards the content validity of the CST as the “degree to
which the content of the test is congruent with the purpose of testing, as determined by
subject matter experts” (California Department of Education, 2008b). Kane (2006)
warns against accepting operational definitions of complex constructs as definitive,
and, in order to support the state’s interest in ensuring the content validity of the CST,
employees of the California Department of Education have worked with Educational
Testing Services, the company that publishes the CST, in forming an Assessment
Review Panel. This panel is composed of about 100 members, from the education
profession, including teachers, administrators, resource specialists, curriculum experts,
and other professionals and every effort is made to ensure representation from the
various ethnic, gender and geographic groups of California. Test item writers must
meet certain minimum educational, credentialing, and experiential requirements. Each
test item receives three separate reviews, once for content, once for editorial
considerations, and once for sensitivity. Additionally, the state and Educational
Testing Services compared the CST to the norm referenced California Achievement
Test-Sixth Edition (CAT-6). The CST English Language Arts was highly correlated to
the reading and language sections of the CAT-6, with r ranging from .75 to .80. It was
also moderately correlated to the spelling section of the CAT-6, with r ranging from
33
.62 to .70. Based on these points, the California Department of Education regards the
content validity of the CST to be reasonably well established.
When validity is assessed from an external focus, Lissitz and Samuelsen
(2007) assert that, the practical perspective has two more aspects. From this angle, the
test needs to be investigated in terms of its utility and its impact. Utility is the ability
of the test to be put to use in some meaningful way. In assessing utility, investigators
attempt to establish whether a test predicts a construct that is more difficult to
measure, or could not otherwise be determined in time to be useful. As an example,
Lissitz and Samuelsen present the case of a college entrance exam, that if it has proper
utility would predict a student’s success or failure in college level study prior to the
student having to go through the pains of dropping out of advanced studies. In terms
of utility, it is informative to determine if the CST is related to other measures of the
same constructs.
Generally speaking, impact refers to using the data from a given test in a
manner that meets the needs of the stakeholders for whom it is intended. With the
federal accountability system taking precedence over the state’s accountability system,
significant policy changes have occurred. The CST was originally conceived only to
address the values embedded in the state’s earlier system. These policy changes have
had a major impact in instructional practices. Some concerns may be raised as to the
disparate impact these changes may have on students who have had lower
achievement. In considering the impact of the CST in the context of these policy
changes, it is worthwhile to assess the disparity in achievement that may have resulted
34
from the instructional practice noted by Hamilton et al. (2007) of teaching to the
students near the cutoff point for the proficient band. If the instructional practice
which Linn (2006) refers to as teaching to the “bubble students” (p. 15) is shown to
exist it would be expected to be having a disparate effect on the access to educational
opportunity for lower achieving students. This disparate effect would be an unintended
and certainly undesirable consequence of the impact of the current model and its use
of the CST. Evidence of this practice would come from a widening of the achievement
gap between higher performing students and lower performing students. To observe
this phenomenon, if it indeed exists, would require the pairing student scores in
subsequent years and comparing the growth that is demonstrated by students who are
at different levels of achievement.
Although this test and its associated scales were originally established to
support the state accountability system, they now constitute a major factor for the
NCLB accountability requirements. For NCLB purposes, the only important break
point is the proficient/below proficient line, or a scaled score of 350. Under NCLB, a
student who received a scale score of 590, which is high in the advanced band of
achievement, during the previous year’s assessment and then in this year’s assessment
receives a scaled score of 351, just barely over the cut score for the proficient band of
achievement, is regarded as an indicator of success. This student would be regarded to
be an example of a high quality school. Meanwhile, a student who previously received
a score of 150, at the bottom of the far below basic band of achievement, and this year
receives a scaled score of 349 is considered to be a failure and is regarded as an
35
indicator of a low quality school. NCLB makes no distinction whatsoever between this
school and another that maintains the very high achievement of the first student or, in
a school that fails to cause the very low performing student to make any progress at
all. In this study, I have endeavored to address this overly simplistic approach by using
the change in standardized scaled scores from one year to the next in a longitudinal
model rather than using the annual scaled scores in isolation from each other.
36
CHAPTER 3
RESEARCH METHODOLOGY
Purpose and Importance
The purpose of this study is to evaluate the practical validity of applying
growth modeling to the California Standards Test (CST) for local school site and
district level use in evaluating the effectiveness of schools and academic programs.
The bases I used for this study were three of the aspects of practical validity which
Lissitz and Samuelsen (2007) put forth. In their model, which partially rebuts the
conventional perspective of validity as a unitary construct (Linn & Miller, 2005;
Messick, 1989, 1994, 1995a, 1995b), the evaluation of tests is divided into internal
versus external foci and theoretical versus practical perspectives. Lissitz and
Samuelsen regard four major concepts as issues that relate to practical validity.
Content validity and reliability are the issues that constitute the internal perspective of
practical validity. Utility and impact comprise the aspects that constitute the external
perspective of practical validity. In this study, I focused on the reliability, utility, and
impact of the use of growth scores, or the differences between subsequent years’
scaled scores, as a measurement of growth on the English language arts portion of the
CST in evaluating schools and programs.
Research Questions
In this study, I addressed four major questions regarding the practical validity
of the use of longitudinal student data. In the first research question I compared how
various value tables (Hill et al., 2006) evaluate this study sample. Then, the following
37
three questions each related to different aspects of practical validity, namely the
reliability, utility, and impact of the proposed growth model.
In order to address the first research question regarding whether different value
tables result in substantial differences in how the cohort group is perceived, I first used
the currently existing status and improvement models and value tables to describe this
cohort and the two larger sets in which it is nested, namely all high schools in the
same district and all high schools in the state. For this analysis, I compared the cohort
sample to the larger system populations using two value tables, the NCLB Value
Table (Table 3-1) and the API Value Table (Table 3-2). For these analyses, I used an
improvement model, as a necessary concession to unavailability of student level
longitudinal data at the population levels. Although I focused on longitudinal
modeling for this study, this improvement model analysis will provide a more familiar
set of descriptors for practitioners. For this set of analyses only, all data points were
included, irrespective of whether or not students have scores in subsequent years. The
data used throughout this study is archival test score data based on California’s
statewide testing program.
38
Table 3-1
NCLB Value Table
Year 2
Far Below
Basic
Below
Basic
Basic Proficient Advance
d
Year 1
Far Below Basic 0 0 0 1 1
Below Basic 0 0 0 1 1
Basic 0 0 0 1 1
Proficient 0 0 0 1 1
Advanced 0 0 0 1 1
Source: Hill, Gong, Marion, DePescale, Dunn, and Simpson (2006)
Table 3-2
API Value Table
Year 2
Far Below
Basic
Below Basic Basic Proficient Advanced
Year 1
Far Below Basic 200 500 700 875 1000
Below Basic 200 500 700 875 1000
Basic 200 500 700 875 1000
Proficient 200 500 700 875 1000
Advanced 200 500 700 875 1000
Source: California Department of Education (2008a)
I then tracked the cohort group using a value table that is being proposed by
this study(Table 3-3), which is an adaptation of the value table (Table 2-1) that was
put forward by Hocevar, Brown, and Tate (2008), to analyze student growth. The
proposed table for this study differs from the example provided by Hocevar, Brown,
39
Table 3-3
Proposed Value Table for Analyzing Student Growth Data
Year 2
Far Below
Basic
Below
Basic
Basic Proficient Advanced
Year 1
Far Below Basic 50 200 300 400 500
Below Basic 0 100 200 300 400
Basic -100 0 100 200 300
Proficient -200 -100 0 100 200
Advanced -300 -200 -100 0 200
Adapted from Hocevar, Brown, and Tate (2008)
and Tate in the specific values assigned to the various cells. While both tables set a
benchmark of 100 points as indicating minimal acceptable growth, the table proposed
for this study utilizes a softer punishment for a student declining in performance bands
or remaining in the Far Below Basic band from one year to the next. It also provides a
bonus for students that remain in the advanced band that is not inherent in the table
proposed by Hocevar, Brown and Tate.
In addition to this proposed value table, I also tracked the cohort group using a
value table (Table 3-4) that is consistent with the NCLB safe harbor provision, as
developed by Hill et al. (2006). The results of this analysis were compared to the same
cohort using a status model. This was done to determine if these two methods generate
similar results. This also provides evidence regarding the degree of the effect attrition
40
may have on the interpretation of the data, since all test participants are included in
first analysis, yet only those with subsequent scores are included in the second
analysis.
Table 3-4
Value Table Consistent with NCLB Safe Harbor Calculation
Year 2
Far Below
Basic
Below
Basic
Basic Proficient Advanced
Year 1
Far Below Basic 0 0 0 100 100
Below Basic 0 0 0 100 100
Basic 0 0 0 100 100
Proficient -90 -90 -90 10 10
Advanced -90 -90 -90 10 10
Source: Hill, Gong, Marion, DePescale, Dunn, and Simpson (2006)
In the second question for this study, I sought to address the extent to which
known correlates to status model achievement are related to growth scores. In the
model of validity that is proposed by Lissitz and Samuelsen (2007) this question
relates to the broad construct of reliability. In traditional terms, this question addresses
the core concepts of the fairness and discriminant validity of the proposed model. In
this research question, I asked whether or not students in four groups – African
American, Hispanic who speak English only or were identified as functionally
proficient in English upon initial enrollment in school, Hispanic with limited English
41
proficiency, and Hispanic who have been reclassified as proficient in English – are
demonstrating varying levels of growth as measured in this study.
In terms of practical utility, it is necessary to determine if the growth scores are
related to other measures of student achievement. To this end, the third question I
addressed was: what is the degree of the predictive value of growth scores to future
achievement in English language arts courses, changes in achievement, and school
attendance? Teacher issued marks in English language arts classes, along with changes
in these marks, and school enrollment and attendance formed the points of comparison
for this series of analyses.
Finally, it is noteworthy that the CST was originally created to address the
policies and goals outlined in the California’s Public Schools Accountability Act and
at the federal level, Goals 2000. The valuation established by the original systems this
test was designed to address (California Department of Education, 2008a) differs
substantially from the values assigned under the No Child Left Behind Act. Therefore,
it is important to investigate if there is evidence that school leaders and teachers of
students in this cohort group are targeting bubble students, in the basic proficiency
band, over lower performing students and if so, are these efforts having a differential
effect on low performing versus bubble student?
Research Design Overview
This study analyzed longitudinal student outcome data on the CST at the
secondary level. My primary purpose was to analyze the practical validity of using
individual student CST growth as an accountability measure. This included
42
determining the predictive value of earlier CST growth scores to later CST growth
scores, school enrollment and attendance, and changes in teacher marks as an
additional measure of achievement. These quantitative analyses examined existing
data from a series of 10 inner city high schools.
Participants and Setting
The data for this study included a single cohort of 13,160 students who entered
the 9
th
grade at 10 inner city senior high (SH) schools together in the 2001-2002
school year. Each student’s data was matched back to the 8
th
grade, and followed
through the 2004-2005 school year. The schools selected represent a major, inner city
system.
Table 3-5
School Demographic Characteristics (2005)
School N N N N
Pseudonym Students EL-S EL-O Study
Cohort
Latino Black
Nicholas SH
2239 468 7 1276 1070 1365
Daly SH
3815 1741 0 1415 3511 286
Basilone SH
4227 1830 117
a
1937 3616 153
Boyington SH
2626 334 3 1147 1788 826
Butler SH
4839 2191 2 2313 4252 581
Crowe SH
2466 1054 0 888 1944 515
Dunham SH
1636 321 11 688 718 904
Lejeune SH
2803 1033 232
b
1311 2109 141
Glenn SH
3802 1441 8 1120 3020 767
Sousa SH
3410 1056 3 1065 2146 1255
a
The majority of this group is classified as L
1
Pilipino
b
The majority of this group is classified as L
1
Armenian
43
Table 3-5 summarizes the demographic data for each of the high schools. The vast
majority of students in these cohort schools fall into one of two major ethnic
categories, black and Hispanic. The students who are identified as Hispanic are further
divided into four groups in regards to their English language status: English only
(EO), initially identified as functionally English proficient (IFEP), reclassified as
functionally English proficient (RFEP), limited English proficient (LEP). For the
purposes of the analysis in this study, the EO and IFEP categories have been
collapsed, thus providing four groups of students for analysis, African American,
Hispanic-EO/IFEP, Hispanic-LEP, and Hispanic-RFEP.
As can be surmised from Table 3-6, which summarizes the cohort’s
performance on the CST as reported by the state, the schools in this cohort have an
achievement level that can be characterized as pervasively low. All of the cohort
schools have been in Program Improvement (PI) status for multiple years, and have
achievement levels well below the state’s benchmark of an API of 800. Similarly, the
schools all lag well behind the state’s planned achievement level to meet the federal
NCLB requirements in terms of the percent of students who are proficient in English
language arts.
44
Table 3-6
School Achievement Status
API
a
ELA Percent
School base growth Proficient PI Year
Name 2004 2005 2003 2004 2005 2004 2005
Nicholas SH 488 504 13.8
b
22.0
b
12.6 4 5
Daly SH 474 482 29.1
b
14.2
b
10.8 5 5
c
Basilone SH 599 619 25.4
b
29.2 26.4 3 4
Boyington SH 492 505 25.0
b
20.2
b
12.8 4 5
Butler SH 464 502 15.7
b
12.6
b
10.3 5 5
c
Crowe SH 465 519 15.3
b
14.3
b
11.8 4 5
Dunham SH 475 501 21.9
b
21.2
b
10.3 4 5
Lejeune SH 535 611 24.0 27.7
b
20.1 4 5
Glenn SH 508 525 25.4
b
19.7 12.7 4 5
Sousa SH 450 488 18.4
b
9.8
b
9.0 5 5
c
a
Since the formula for API is modified annually, the only valid API comparison is the
base/growth comparison for two subsequent years.
b
These data are regarded by CDE as unreliable due to a participation rate lower than 95%.
c
NCLB delineates only 5 years of Program Improvement Status, these schools remained in
Program Improvement beyond this 5 year limit.
Source: http://www.cde.ca.gov/ayp
Methods
Instruments
For my first research question, my primary variable that is analyzed is the
average student performance based upon Table 3-1, Table 3-2, Table 3-3 and Table 3-
4. For my second, third and fourth research questions, the primary outcome variable
that I analyzed was the difference between standardized individual student scale score
on subsequent CST exams in English language arts between grades 8 and 10. As an
45
additional measure of achievement, individual achievement marks issued by teachers
in English language arts and composition classes were also considered.
The state of California adopted the CST as a criterion referenced test to address
needs outlined in the Public Schools Accountability Act. These tests are administered
each spring in grades 2-11. The state reports the results in two formats, scale scores
and proficiency bands. The first format is a traditional scaled score, ranging from 150
– 600. In order to address, at least partially, the concern that arises from the CST not
being vertically aligned, in this study, prior to calculating differences, I standardized
the individual student scale scores using the following formulae:
where T
g
represents the standardized T-score for the CST at grade level g, s represents
the standard deviation of scale scores received within the cohort on the given grade
level CST, and x represents the scale score the given student received on the given
CST. This procedure set the standard deviation of each set of scaled scores at 10 and
set the mean as 100 times the grade level of the test. Hence, one year of average
growth in this system is defined as a gain of 100 points when differences are
calculated as the variables T
9
– T
8
and T
10
– T
9
.
The state also reports the data in a more intuitive format by aggregating
various ranges of scaled scores into five proficiency bands, namely: advanced,
46
proficient, basic, below basic, and far below basic. The break points for the bands
basic and proficient are uniformly set for all grades and subject area tests at scale
scores of 300 and 350 respectively. The cut scores for the other bands vary slightly
from grade to grade and by subject matter. These cut scores are published by the state
(California Department of Education, 2007a).
Changes in student marks were another indicator of student growth that I used
in this study. To calculate student mark changes, letter grades that students received
will first be converted to Grade Point Average (GPA) equivalents, using the standard
A = 4, B = 3, C = 2, D = 1, and F = 0 conversion (+/- marks were not issued within
this cohort). In this manner, the changes of student marks for the 2004-2005 school
year will be calculated as M
spring2005
– M
spring2004
where M
session
is the individual
student’s mark received in English language arts during the given session. For this
analysis, the change of English language arts in spring of grade 9 and English
language arts in spring of grade 10 was used.
Procedures
This study focused on individual student growth rather than school
improvement. Growth in this study is the change in standardized scale scores that a
student achieved on consecutive years’ CST tests in English language arts. Thus, a
student’s growth from the spring of 8
th
grade to the spring of 9
th
grade is defined as
T
9
– T
8
where T
grade
is the individual student’s standardized scale score on the English
Language Arts section of the CST as calculated using the aforementioned procedure.
47
To address research question number one, I sought to compare the cohort to
the district and state level student bodies in which these students are embedded in the
familiar terms of school status and improvement. I did this by describing the change
the cohort demonstrates using two distinct valuations. I also described the cohort in
terms of a proposed value table for evaluating student growth. The first table I used is
one that is based on the valuation given under NCLB (Table 3-1), with credit given
only for students who achieve at the proficient and advanced levels. Secondly,
changes in the group’s performance were described using the value table established
by California to determine Academic Performance Index (API) (Table 3-2), which
was the original set of valuations intended for use with this particular test. To consider
the notion of growth rather than improvement, the cohort’s growth was described
using two separate longitudinal growth value tables. Table 3-3 was used as a proposed
value table in this study, which is intended to compare growth to an acceptable
benchmark of 100 points. Finally, Table 3-4 was used to examine the cohort’s
longitudinal growth in a manner consistent with the Safe Harbor provision of NCLB.
These growth model tables, distinct from the status model tables used to track the state
and the Local Educational Agency (LEA), only include students who have data for
two consecutive years. Additionally, this study’s proposed value table (Table 3-3)
places value on changes between each of the various bands. This particular table also
values continuance at the highest band (advanced) and devalues remaining in the
lowest band (far below basic). Each of these tables places a different value on
particular aspects of student achievement. While the NCLB Table values a minimum
48
bar of student achievement set at the proficient band, the API Table values
improvement in the lower bands over improvement in the higher bands, and the table
proposed herein values any change between bands. The growth model tables value
intra-student growth, disregarding the scores of transient students until they have been
in place at a given school long enough to take two subsequent CST tests while the
status model tables include all students irrespective of whether or not they have a
previous score on the CST.
In order to evaluate the student data, students were placed on each value table
according to their performance on the CST. For the status model tables (Table 3-1 and
Table 3-2), students were only placed in regard to the year 2 column, since these tables
do not assume knowledge of a previous year’s performance. In using the growth
model value tables (Table 3-3 and Table 3-4), students were placed as indicated by
two years’ worth of performance. Each cell was then tallied, and the number of
students in each cell was multiplied by the value assigned to that cell, and the values
within the matrix were summed and divided by the total count of students included in
the evaluation. This procedure provides essentially an average of student performance
as described by the given table. Each table also has a theoretical benchmark of
acceptable performance. In the cases of the status model tables, the benchmarks are set
by state law (Table 3-1) or by NCLB (Table 3-2). For Table 3-3, it is intended that an
average student performance of 100 points on this table would represent minimal
acceptable growth. Hill et al. (2006) set the benchmark for Table 3-4 as an average
49
student performance of 10 points on this table to demonstrate adequate growth to meet
the NCLB Safe Harbor provision.
In order to address the second question, regarding practical validity of the
model in terms of discriminant validity and fairness, I sought to determine if the
inclusion of specific known covariates to status models of achievement would have a
similar relationship to growth scores. In this, I investigated the discriminant validity of
growth modeling and the relationship of growth to four well-established factors that
are regularly cited as undercutting the validity of status models: English language
status, ethnicity, poverty, and parent education level. I also included school of
attendance in this analysis, to see if the schools within this cohort were showing
different levels of success. The individual student growth scores were compared using
a general linear model ANOVA, along with post hoc t-tests for significant main
effects and a post hoc one-way ANOVA for significant interactions. For the ANOVA,
I calculated ω ˆ
2
as the measure of effect size for all statistically significant findings.
The status model that is currently used for school accountability has been
highly criticized in terms of validity as a measure of school quality because student
achievement status is highly correlated to these demographic factors (Meyer, 1997)
that are not within the control of the ones being evaluated. If growth scores are
uncorrelated to these factors then an argument might be made that they could be a
fairer and therefore more valid measure of school or program quality.
In terms of practical utility, it is useful to determine if a link exists between a
student’s growth and her or his other indicators of growth in achievement. In this
50
study, I used two major indicators of student achievement to address this question.
First, I compared growth on the CST to marks issued by teachers and the change in
teacher issued marks that students received in English language arts and composition
classes. The CST growth scores were also compared against student attendance in
secondary school. To the extent that these measures reliably relate to a student’s
growth in understanding the material outlined in the state framework, these measures
should be highly correlated. The growth scores at grades 8-9 and 9-10, were correlated
with the marks which the student received in English language arts and composition
classes, and the changes in marks from the spring of grade 9 to the spring of grade 10
using Pearson’s coefficient of correlation. For these two correlation tests, as
recommended by Rosenthal, Rosnow and Rubin (2000), effect sizes were determined
directly from r.
The final question of this study considers the implications of policy changes
that took place after the CST was created. The CST was originally designed to comply
with a system in which improvement among a school’s lowest performing students
was explicitly favored over any other improvement, and all positive change was
rewarded. Thus, school leaders were reinforced for improvements at all levels in
subsequent cohorts’ achievement, but particularly reinforced for focusing on the
school’s neediest students. In contrast, the policy under NCLB only recognizes
changes between the proficient and basic bands. In 2002, California reported that 38%
of the state’s 9
th
grade cohort, who are the same age as this study’s cohort, performed
in the below basic and far below basic bands on the English language arts CST. These
51
students are at severe risk of being effectively ignored as ones that the school will not
be able to turn into successes by NCLB standards. If these schools are successful in
any efforts they may be implementing to address the bubble issue, then students in the
basic band should show greater growth than those in the below basic and far below
basic bands. These growth scores will be compared using an independent groups t-test.
Although the primary question is interested in knowing if the students in the basic
band are showing more growth than lower-performing students, evidence to the
contrary is also important. In this sense, there is insufficient justification for a single-
tailed test. Therefore, the standard two-tailed design was utilized. For this evaluation, I
used Cohen’s d as the measurement of effect size.
For all statistical tests in this study, I set α at the standard p < .05 level. The
measures of effect size used in this study, ω ˆ
2
, and Cohen’s d, were interpreted in the
standard manner. For ω ˆ
2,
effect sizes of .01 were regarded as small effects, .06 were
regarded as medium effects, and .14 were regarded as large effects (King & Minium,
2003). For Cohen’s d effect sizes of .2 were regarded as small effects, .5 as were
regarded as medium effects and .8 were regarded as large effects, (King & Minium).
For the correlation test, r was used directly as the measure of effect size, with .1 being
regarded as small effects, .3 being considered as medium effects, and .5 being
considered to be large effects (Cohen, 1992). By setting these standards, if the
reliability of the growth scores were perfect, which is an unsupportable
presupposition, then β would be estimated to exceed .8 for any analysis with at least
250 participants’ scores. For the tests in which N is greater than 500, as a measure of
52
statistical power, β would, assuming perfect reliability, be estimated to exceed .9.
Since growth scores are regarded generally as less reliable than status scores, the
actual power of these analyses must be expected to be substantially lower than these
estimates.
53
CHAPTER 4
RESULTS
In Chapter Three, I outlined the methodology by which I evaluated the
practical validity of using the growth scores of subsequent years’ California Standards
Test (CST) standardized scaled scores for the purpose of evaluating schools and
programs. This chapter presents the statistical results of these analyses. First, the
cohort’s performance is described, along with the performance of the Local
Educational Agency (LEA) and the performance of state within which it is embedded.
Then the practical validity of the method is addressed by three separate analyses which
in turn address the discriminant validity and fairness of the method, the utility of the
model, and the impact of changing policies as the federal No Child Left Behind Act of
2001 (2002) (NCLB) has superseded California’s Public Schools Accountability Act
of 1999 in affecting school leaders’ decisions.
Comparison of Evaluation by Varying Value Tables
The primary way in which California reports the results of the California
Standards Test (CST) is by proficiency bands. Table 4-1 summarizes the proportion of
students that scored in each of the state’s proficiency bands within this study cohort,
the LEA in which it is embedded and the state as a whole. The state’s system is a
status model, reporting only at a point in time, and not following students from year to
year. As students move, repeat grades, or otherwise change cohort groups, these
changes are effectively ignored by the state’s reporting system. Herein, the state and
54
LEA data are reported to the nearest full percentage point because these data are
reported in that format by the state. As is evident from Table 4-1, the students in
Table 4-1
Comparison of Cohort to State and District – Status Model
Performance Grade 8 Grade 9 Grade 10
Band Cohort LEA State Cohort LEA State Cohort LEA State
N
9121 47237 435885 7100 59792 481597 6092 44943 452242
% Advanced 0.5 4 10 1.6 6 14 1.4 7 14
% Proficient 4.7 13 22 8.9 16 24 8.2 15 21
% Basic 23.9 32 34 31.1 32 31 28.8 32 30
% Below Basic 31.3 26 19 35.6 28 19 32.2 28 21
% Far Below Basic 39.7 25 14 22.7 18 12 29.4 18 14
Source for LEA and State Data: http://www.star.cde.ca.gov
the study cohort are underperforming in comparison to students in the LEA and across
the state. The standard under NCLB is to consider the percent of students at proficient
or above. By this standard, in 8
th
grade, only 5.2% of the study cohort was at grade
level in 8
th
grade, compared to 17% of 8
th
grade students in the LEA and 32% of 8
th
grade students in the state as a whole. Conversely, when considering lower achieving
students by combining the lowest two bands, in 8
th
grade 71% of the study cohort fell
into these low achieving bands, while 51% of LEA’s students and only 33% of the
state’s students performed at similarly low levels. The cohort’s performance level
remains low through the 10
th
grade in comparison to both the LEA and the state.
55
Another way that is commonly used in analyzing these types of data is an
improvement model. In this model, the raw percentage change of each band is
considered. Table 4-2 summarizes the changes at each performance level for the
cohort, the LEA and the state from the 8
th
through the 10
th
grade years.
Table 4-2
Comparison of Cohort to State and District – Improvement Model
Performance
Band
Grades 8-9
Cohort
LEA
State
Grades 9-10
Cohort
LEA
State
Δ N - 2021 + 12,555 + 45,712 - 1008 - 14,849 - 29,355
Δ Advanced + 1.1 + 2 + 4 - 0.2 + 1 0
Δ Proficient + 4.2 + 3 + 2 - 0.7 - 1 - 3
Δ Basic + 7.2 0 - 3 - 2.3 0 - 1
Δ Below Basic + 4.3 + 2 0 - 3.4 0 + 2
Δ Far Below Basic - 17.0 - 7 - 2 +6.7 0 + 2
Source for LEA and State Data: http://www.star.cde.ca.gov
Although the improvement that is apparent in the cohort’s performance from
grade 8-9, when the only band to shrink in proportion was the lowest, far below basic
band, some of these gains appear to become lost the following year, in which the far
below basic band is the only one that grows in proportion. Additionally, the number of
students who took the test in the study cohort shrank during this same time, while both
the LEA and the state student populations grew. This reduction in numbers, in spite of
56
simultaneous statewide and regional growth, makes interpretations of these data
difficult since the missing at random principle cannot be established.
The federal system of school accountability under NCLB differs substantially
from California’s Academic Performance Index (API) system. As Table 4-3
summarizes, using both the NCLB system and the API system, students in the cohort
group and their contemporaries in the LEA and across the state have a similar pattern
of higher achievement during the 9
th
grade year, and either a slight loss or leveling off
of that higher score in the 10
th
grade year . However, in comparing the different
Table 4-3
NCLB and API Comparisons of Student Status Performance on the ELA CST
NCLB Valuations
Percent proficient and above
8
th
9
th
10
th
API Valuations
Using Table 3-2
8
th
9
th
10
th
Cohort 5.2% 10.5% 9.6% 449 535 507
LEA 17.0% 22.0% 22.0% 558 600 601
State 32.0% 38.0% 35.0% 654 686 667
Note: California reports the State and LEA data rounded to the nearest whole percent
groups to the stated goals of the Federal NCLB requirements as opposed to the state’s
API system, the groups are in very different situations. As can be seen in Figure 4-1,
the study cohort is the only set that remains behind the federally established
benchmark, with the state being well ahead of the benchmark and the LEA being
57
ahead, or in the case of the last year of this analysis, about even with benchmark. As
can also be seen, a smaller proportion of students in both the cohort group and
students across the state as a whole received scores in the proficient or advanced bands
in the cohort’s 10
th
grade year in comparison to its 9
th
grade year.
Figure 4-1
Comparison of Percent Proficient or Above on the ELA CST, Grades 8-10
As evidenced in Figure 4-2, the state’s API system generates a similar
assessment. However, there is an important distinction in the evaluations that result
from these two systems. In the case of the API system, none of the systems considered
here, all contemporaries in the state, or in the LEA, or the study cohort itself have
0%
5%
10%
15%
20%
25%
30%
35%
40%
8th 9th 10th
benchmark
Cohort
LEA
State
58
shown achievement at the state’s target level of an API of 800. Similar to the NCLB
valuation, the API valuations show a simlar peak in scores at the 9
th
grade year and a
subsequent drop off in the 10
th
grade year.
In terms of Growth modeling, the cohort was analyzed separately using
transition matrices as suggested by Hill, Gong, Marion, DePescale, Dunn, and
Simpson (2006). The results are summarized in Table 4-4 for the 8
th
to 9
th
grade
transition and in Table 4-5 for the 9
th
to 10
th
grade transition. This study’s proposed
Figure 4-2
Comparison of API, Grades 8-10
0
100
200
300
400
500
600
700
800
900
8th 9th 10th
benchmark
Cohort
LEA
State
59
value table (Table 3-3) was applied to the transition matrices. This value table is
intended to compare student growth to a benchmark of an average of 100 points
Figure 4-3
Cohort Mean Growth Using Proposed Value Table
using this matrix. This average is calculated by dividing the sum of the scores by the
number of students that had scores in both grades. Using this method, the cohort
achieved an average growth score of 98.05 during the grade 8-9 transition and an
average growth score of 98.93 during the grade 9-10 transition (summarized in figure
4-3). These results are summarized in Table 4-6 for the transition between grades 8
95
96
97
98
99
100
101
102
103
104
105
Grade 8-9 Grade 9-10
Benchmark
Cohort
60
and 9, and in Table 4-7 for the transition between grades 9 and 10. It is noteworthy
that both of these scores are just slightly below the proposed expected mean growth
score of 100 points.
Table 4-4
Cohort Transition Matrix, Grades 8-9
Grade 9 Performance Band
Grade 8 Performance Band
Far Below
Basic
Below
Basic Basic Proficient Advanced
Far Below
Basic
1617 635 98 3 0
Below Basic 559 981 557 17 0
Basic 58 359 1059 322 8
Proficient 2 5 90 226 43
Advanced 0 0 1 14 22
61
Table 4-5
Cohort Transition Matrix, Grades 9-10
Grade 10 Performance Band
Grade 9 Performance Band
Far Below
Basic
Below
Basic Basic Proficient Advanced
Far Below Basic 52 73 72 22 7
Below Basic 73 82 64 16 5
Basic 72 58 53 21 4
Proficient 22 15 15 6 1
Advanced 7 2 6 2 0
Table 4-6
Cohort Growth Scores, Grades 8-9 Using the Proposed Value Table (Table 3-3)
Grade 9 Performance Band
Grade 8 Performance Band
Far Below
Basic Below Basic Basic Proficient Advanced
Far Below Basic
80850 127000 29400 1200 0
Below Basic
0 98100 111400 5100 0
Basic
-5800 0 105900 64400 2400
Proficient
-400 -500 0 22600 8600
Advanced
0 0 -100 0 4400
62
Table 4-7
Cohort Growth Scores, Grades 9-10 Using the Proposed Value Table (3-3)
Grade 10 Performance Band
Grade 9 Performance Band
Far Below
Basic
Below
Basic Basic Proficient Advanced
Far Below
Basic
2600 14600 21600 8800 2600
Below Basic
0 8200 12800 4800 0
Basic
-7200 0 5300 4200 -7200
Proficient
-4400 -1500 0 600 -4400
Advanced
-2100 -400 -600 0 -2100
Another value table (Table 3-4) suggested by Hill, et al. (2006) is intended to
address the Safe Harbor Provision of NCLB directly. Utilizing their method of
summing the cells and dividing by the number of students, with an acceptable
demonstration of growth being represented by a quotient of 10 or more, as
summarized in Table 4-8 and Figure 4-4, this cohort achieved a Safe Harbor Growth
Score of 4.38 in the transition from 8
th
grade to 9
th
grade. As summarized in Table 4-9,
the cohort had a Safe Harbor Growth Score 2.08 for the transition from 9
th
grade to
10
th
grade.
63
Table 4-8
Safe Harbor Calculation Matrix, Grades 8-9 Using Hill et al. (2006) Safe Harbor
Value Table
Grade 9 Performance Band
Grade 8 Performance Band
Far Below
Basic
Below
Basic Basic Proficient Advanced
Far Below
Basic 0 0 0 300 0
Below Basic 0 0 0 1700 0
Basic 0 0 0 32200 800
Proficient -180 -450 -8100 2260 430
Advanced 0 0 -90 140 220
Table 4-9
Safe Harbor Calculation Matrix, Grades 9-10 Using Hill et al. (2006) Safe Harbor
Value Table
Grade 10 Performance Band
Grade 9 Performance Band
Far Below
Basic
Below
Basic Basic Proficient Advanced
Far Below
Basic 0 0 0 2200 700
Below Basic 0 0 0 1600 500
Basic 0 0 0 2100 400
Proficient -1980 -1350 -1350 60 10
Advanced -630 -180 -540 20 0
64
In both of these transitions, the cohort group showed substantially less growth than
would be adequate to be placed in Safe Harbor under NCLB when using the model
proposed by Hill et al. (2006).
Figure 4-4
Cohort Mean Growth Using Hill et al. (2006) Safe Harbor Value Table
In considering these four different value tables, each paints a distinct picture of
the cohort. The first status model considered, namely the API model that California
implemented prior to the federal NCLB requirements, placed this cohort of schools
well below the state established benchmark of 800. However, the same was true for
0
2
4
6
8
10
12
14
Grade 8-9 Grade 9-10
Benchmark
Cohort
65
the aggregate of all schools in the LEA and the aggregate of all schools across the
state. In this sense, the cohort lagged behind, but the benchmark has not been achieved
by the larger systems in which it is embedded. By contrast, using the NCLB model to
measure achievement status, the cohort has demonstrated substantially less
achievement than the established benchmark, which is generally described as the
percent of students who perform at or above the proficient mark.
By the value table proposed herein (Table 3-3), the cohort seems to be hanging
closer to an acceptable performance level at both transitions analyzed, demonstrating
only a scant one to two fewer average growth points than proposed benchmark of
acceptable performance. This assessment remains quite different even from the other
growth model that was used to analyze the cohort, which determined the group to have
demonstrated less than half the growth necessary to establish eligibility for Safe
Harbor status. The difference in the assessment made by these two different value
tables is consistent with the different types of growth they value. While the proposed
value table for this study values continuance or advancement between all bands, the
Hill et al. (2006) Safe Harbor Value Table, as a necessary concession to the
requirements of NCLB, values growth that breaks over the demarcation between the
basic and proficient bands of student performance.
Discriminant Validity and Fairness of the Model
In order to address the second research question, regarding whether growth
scores had discriminant validity in relation to other established predictors of student
status achievement, the differences of standardized student achievement scores,
66
growth scores, from the transitions between 8
th
and 9
th
grade and the transition
between 9
th
and 10
th
grade were analyzed using an ANOVA design. The cohort was
divided into two major ethnic groups, namely African American and Hispanic. The
Hispanic group was also divided into three major groupings by language
classification. These groups were limited English proficient (LEP), reclassified
functionally English proficient (RFEP) and one group that included both English only
(EO) and initially functionally English proficient (IFEP). The students’ base senior
high (SH) school was tracked as an additional source of variance. In the grade 8-9
transition, summarized in Table 4-10, I found small but significant main effect for
both ethnicity/language group, F (3, 4382) = 17.036, p = .001, ω ˆ
2
= .011, and for
school, F (9, 4382) = 8.248, p = .001, ω ˆ
2
= .015. As summarized in Table 4-11,
Table 4-10
Analysis of Variance for Student Growth Scores Grade 8-9
Source df F ω ˆ
2
p
Ethnicity/Language Group 3 17.036 ** .011 .001
School 9 8.248 ** .015 .001
Ethnicity/Language Group x School 27 1.184 .234
Error 4382 (43.001)
Note: Values enclosed in parentheses represent mean square errors
* p < .05. ** p < .01
.
similar, small but significant main effects were evident for the grade 9-10 transition
for both factors, ethnicity/language group, F (3, 4382) = 13.589, p = .001, ω ˆ
2
= .008,
67
and for school, F (9, 4382) = 5.979, p = .001, ω ˆ
2
= .010. The interaction of these
sources was not found to be significant at either transition.
Table 4-11
Analysis of Variance for Student Growth Scores Grade 9-10
Source df F ω ˆ
2
p
Ethnicity/Language Group 3 13.589 ** .008 .001
School 9 5.979 ** .010 .001
Ethnicity/Language Group x School 27 0.864 .666
Error 4382 (43.001)
Note: Values enclosed in parentheses represent mean square errors
* p < .05. ** p < .01
In order to examine the significant findings, Tukey’s HSD was used to analyze
the pair-wise comparisons. The mean growth for each of the ethnicity/language groups
is summarized in Table 4-12 for the transition from grade 8 to grade 9, and in
68
Table 4-12
Ethnicity/Language Group Mean Growth Grade 8-9
Ethnicity/Language Group Mean Growth Std. Error
African American 99.854 .293
Hispanic – EO/IFEP 99.755 .373
Hispanic – RFEP 99.232 .209
Hispanic – LEP 100.891 .267
Table 4-13 for the transition from grade 9-10. Additionally, the mean growth for each
of the cohort schools are summarized in Table 4-14 for the grade 8-9 transition and in
Table 4-15 for the transition from grade 9-10. The Tukey HSD for the
Table 4-13
Ethnicity/Language Group Mean Growth Grade 9-10
Ethnicity/Language Group Mean Growth Std. Error
African American 98.436 .322
Hispanic – EO/IFEP 98.420 .410
Hispanic – RFEP 99.236 .229
Hispanic – LEP 100.094 .293
69
Table 4-14
Base Cohort School Mean Growth Grade 8-9
Base Cohort School Mean Growth Std. Error
Boyington SH 98.108 .629
Dunham SH 99.532 .515
Butler SH 99.105 .340
Lejeune SH 100.898 .399
Daly SH 100.081 .523
Crowe SH 99.733 .491
Sousa SH 101.853 .400
Glenn SH 100.375 .403
Basilone SH 100.192 .439
Nicholas SH 99.453 .394
Table 4-15
Base Cohort School Mean Growth Grade 9-10
Base Cohort School Mean Growth Std. Error
Boyington SH 98.828 .692
Dunham SH 99.175 .566
Butler SH 100.027 .374
Lejeune SH 96.985 .439
Daly SH 99.828 .576
Crowe SH 99.564 .540
Sousa SH 97.578 .440
Glenn SH 98.920 .443
Basilone SH 100.071 .483
Nicholas SH 99.487 .433
70
Table 4-16
Tukey HSD Summary for Ethnicity/Language Group Differences; Grades 8-9
Groups Mean Difference Standard Error Cohen’s d p
African American
Hispanic - EO/IFEP
Hispanic - RFEP
Hispanic – LEP
-0.38
-0.03
-1.56
**
0.40
0.27
0.28
.148
.778
1.000
.001
Hispanic - EO/IFEP
Hispanic - RFEP
Hispanic - LEP
0.35
-1.17
*
0.37
0.38
.171
.774
.010
Hispanic - RFEP
Hispanic - LEP
-1.53
**
0.23
.246
.001
* p < .05. ** p < .01
Table 4-17
Tukey HSD Summary for Ethnicity/Language Group Differences; Grades 9-10
Groups Mean Difference Standard Error Cohen’s d p
African American
Hispanic - EO/IFEP
Hispanic - RFEP
Hispanic – LEP
0.15
-1.05
-1.73
**
**
0.44
0.30
0.31
.093
.201
.987
.003
.001
Hispanic - EO/IFEP
Hispanic - RFEP
Hispanic - LEP
-1.20
-1.88
**
0.41
0.41
.112
.017
.001
Hispanic - RFEP
Hispanic - LEP
-0.68
*
0.26
.246
.041
* p < .05. ** p < .01
71
ethnicity/language groups comparisons are summarized in Table 4-16 for the grade 8-
9 transition and in Table 4-17 for the grade 9-10 transition. In these analyses, it was
found that during the grade 8-9 transition, the students who were classified as
Hispanic – LEP demonstrated more growth than did any of the other groups, which
were not statistically distinguishable from each other. The same observation was made
during the grade 9-10 transition, however in the latter transition, Hispanic – RFEP
students also showed more growth than African American students. Although
statistically significant, these differences were all small effects, with each of these
differences having a Cohen’s d of less than .25.
Two schools stand out in the Tukey HSD pair-wise analysis for the grade 8-9
transition which is summarized in Table 4-18. The students in the cohort group that
attended Boyington SH demonstrated significantly less growth than did students at
any of the other schools except Dunham SH. At the same time, students in the cohort
group from Sousa SH demonstrated more growth than did students at the other
schools, except Lejeune SH and Basilone SH. This is particularly noteworthy when
this result is taken in conjunction with the status model data reported in Table 3-6.
Basilone SH and Lejeune SH stand out in this cohort as the only two schools with an
72
Table 4-18
Tukey HSD Summary for School Pair wise Comparisons; Grades 8-9
School Pair Mean Difference Standard Error Cohen’s d p
Boyington SH
Dunham SH -1.803 0.607 .087
Butler SH -1.965 ** 0.476 .24 .002
Lejeune SH -3.101 ** 0.521 .38 .000
Daly SH -2.319 ** 0.495 .29 .001
Crowe SH -2.027 * 0.557 .25 .010
Sousa SH -4.573 ** 0.558 .57 .001
Glenn SH -2.539 ** 0.500 .32 .001
Basilone SH -2.802 ** 0.486 .35 .001
Nicholas SH -2.069 ** 0.527 .26 .003
Dunham SH
Butler SH -0.162 0.505 1.000
Lejeune SH -1.299 0.547 .342
Daly SH -0.516 0.523 .993
Crowe SH -0.225 0.582 1.000
Sousa SH -2.770 ** 0.582 .41 .001
Glenn SH -0.736 0.527 .929
Basilone SH -0.999 0.514 .638
Nicholas SH -0.266 0.553 1.000
Butler SH
Lejeune SH -1.137 0.398 .118
Daly SH -0.354 0.364 .994
Crowe SH -0.063 0.444 1.000
Sousa SH -2.609 ** 0.445 .40 .001
Glenn SH -0.574 0.370 .870
Basilone SH -0.837 0.350 .332
Nicholas SH -0.104 0.406 1.000
Lejeune SH
Daly SH 0.783 0.420 .695
Crowe SH 1.074 0.492 .468
Sousa SH -1.472 0.492 .083
Glenn SH 0.562 0.426 .949
Basilone SH 0.299 0.409 .999
Nicholas SH 1.032 0.457 .417
73
Table 4-18 (continued)
School Pair Mean Difference Standard Error Cohen’s d p
Daly SH
Crowe SH 0.291 0.465 1.000
Sousa SH -2.254 ** 0.465 .32 .001
Glenn SH -0.220 0.394 1.000
Basilone SH -0.483 0.376 .957
Nicholas SH 0.250 0.428 1.000
Crowe SH
Sousa SH -2.546 ** 0.531 .38 .001
Glenn SH -0.512 0.470 .986
Basilone SH -0.775 0.454 .793
Nicholas SH -0.042 0.498 1.000
Sousa SH
Glenn SH 2.034 ** 0.470 .30 .001
Basilone SH 1.771 0.455 .004
Nicholas SH 2.504 ** 0.499 .39 .001
Glenn SH
Basilone SH -0.263 0.382 1.000
Nicholas SH 0.470 0.433 .986
Basilone SH
Nicholas SH 0.733 0.417 .761
* p < .05. ** p < .01
74
API that was greater than 600 in 2005, yet neither of these schools is differentiated
from the bulk of the rest of the cohort by this study’s growth model. Also worth
noting, Sousa SH, whose students showed more growth using this study’s method,
was also the lowest performing school in the cohort when measured by either the
state’s API or the NCLB status model. While the students at Boyington HS, which
showed less growth than the bulk of the other schools in this study cohort, have status
achievement that is above the median for this cohort sample, this school is not the
same type of an outlier in terms of status model student achievement for this group as
are Basilone SH and Lejeune SH.
The results of the grade 9-10 pair wise comparisons of the cohort schools, are
summarized in Table 4-19. The results for this latter transition differed substantively
from the results for the grade 8-9 transition. During the grade 9-10 transition, only one
school, namely Dunham SH, was not statistically distinguishable from any other
school. While students at Sousa SH and Lejeune SH had a slightly above median
average growth during the grade 8-9 transition, student growth at both schools
dropped at the grade 9-10 transition. Thus, Lejeune SH, whose students demonstrated
statistically significantly more growth during the transition at grades 8-9 than the
students at one other school (Boyington SH), the same cohort of students showed
75
Table 4-19
Tukey HSD Summary for School Pair wise Comparisons; Grades 9-10
School Pair Mean Difference Standard Error Cohen’s d p
Boyington SH
Dunham SH -1.166 .667 .769
Butler SH -2.326 ** .524 .29 .001
Lejeune SH 0.030 .573 1.000
Daly SH -2.085 ** .545 .26 .005
Crowe SH -1.683 .613 .156
Sousa SH 0.291 .613 1.000
Glenn SH -1.774 * .550 .22 .042
Basilone SH -2.266 ** .534 .28 .001
Nicholas SH -1.618 .580 .140
Dunham SH
Butler SH -1.161 0.556 .535
Lejeune SH 1.196 0.602 .609
Daly SH -0.919 0.575 .850
Crowe SH -0.517 0.640 .998
Sousa SH 1.457 0.641 .407
Glenn SH -0.608 0.580 .989
Basilone SH -1.101 0.565 .637
Nicholas SH -0.452 0.608 .999
Butler SH
Lejeune SH 2.357 ** 0.437 .30 .001
Daly SH 0.242 0.400 1.000
Crowe SH 0.644 0.488 .950
Sousa SH 2.617 ** 0.489 .38 .001
Glenn SH -0.553 0.407 .940
Basilone SH -0.060 0.385 1.000
Nicholas SH -0.708 0.446 .855
Lejeune SH
Daly SH -2.115 ** 0.462 .27 .001
Crowe SH -1.713 * 0.541 .22 .050
Sousa SH 0.261 0.541 1.000
Glenn SH -1.804 ** 0.468 .23 .005
Basilone SH -2.296 ** 0.450 .29 .001
Nicholas SH -1.648 * 0.503 .21 .035
76
Table 4-19 (continued)
School Pair Mean Difference Standard Error Cohen’s d p
Daly SH
Crowe SH 0.402 0.511 .999
Sousa SH 2.376 ** 0.512 .33 .001
Glenn SH 0.311 0.434 .999
Basilone SH -0.181 0.414 1.000
Nicholas SH 0.467 0.471 .993
Crowe SH
Sousa SH 1.974 * 0..584 .29 .025
Glenn SH -0.091 0..517 1.00
Basilone SH -0.583 0.500 .977
Nicholas SH 0.065 0.548 1.000
Sousa SH
Glenn SH -2.065 ** 0.517 .28 .003
Basilone SH -2.557 ** 0.500 .35 .001
Nicholas SH -1.909 * 0.549 .26 .018
Glenn SH
Basilone SH -0.492 0.420 .977
Nicholas SH 0.156 0.477 1.000
Basilone SH
Nicholas SH 0.648 0.458 .923
* p < .05. ** p < .01
significantly less growth during the grade 9-10 transition than did students at six other
schools in the cohort. During the same time period, the students at Boyington SH,
which demonstrated significantly less growth during the grade 8-9 transition than did
students at every school in the cohort except Dunham SH, at the grade 9-10 transition,
the students at Boyington SH demonstrated growth that was only statistically
distinguishable as lower than four schools in the cohort. It is also noteworthy that no
77
school demonstrated statistically significant higher growth than one school while
having less growth than another school.
Utility of the Model
In order to address the question of utility (Lissitz & Samuelsen, 2007), Pearson
correlation coefficients were calculated between the growth scores of the cohort’s 8
th
and 9
th
grade years and the growth scores of the 9
th
and 10
th
grade years against CST
scale scores for those years and other predictors of student achievement including
teacher issued marks, number of days the student was enrolled in classes, number of
days the student attended classes and the change in teacher issued marks from the 9
th
grade spring semester English class to the 10
th
grade spring English class. The results
are summarized in Table 4-20.
The largest single relationship found was the negative relationship between the
growth between the 9
th
and 10
th
grade CST exams and the growth between the 8
th
and
9
th
grade CST exams. Additionally, the scores generated for this growth model were
generally negatively correlated with most status measures of achievement. Growth
between the 8
th
and 9
th
grade CSTs was not shown to be associated with enrollment or
attendance, except for a small relationship to attendance in the cohort’s 10
th
grade year
(r = .03, p = .028). This growth score was negatively correlated with 8
th
grade status
scores on the CST, and 9
th
grade teacher issued marks. It had a small but significant
relationship to improvement in teacher issued marks from English 9B to English 10B.
Both sets of growth scores had a moderately negative correlation to the previous
year’s status score on the CST, and a moderate positive relationship to the second year
78
scale score; however this may be more definitional in nature, rather than of real
significance. Growth between grades 9 and 10 had a noteworthy small, but significant
negative relationship to enrollment and attendance at each grade level that was
tracked.
Table 4-20
Pearson Correlation Coefficients of Growth Scores and Other Indicators of Success
growth
score
scale
score 8
scale
score 9
scale
score 10
days enr
gr. 8
days enr
gr. 9
days enr
gr. 10
days enr
gr. 11
8
th
– 9
th
r
p
N
-.347
.001
6059
**
.324
.001
6059
**
.007
.652
4656
-.020
.123
5744
-.006
.616
6051
-.006
.623
6013
.013
.337
5516
9
th
– 10
th
r
p
N
.084
.001
4656
**
-.372
.001
5171
**
.353
.001
5171
**
-.035
.015
4902
*
-.042
.003
5149
**
-.036
.009
5139
**
-.051
.001
5060
**
growth
score
days att
grade 8
days att
grade 9
days att
grade 10
days att
grade 11
marks
Eng 9A
marks
Eng 9B
marks
Eng 10A
8
th
– 9
th
r
p
N
-.023
.087
5744
.020
.123
6013
.022
.095
6013
.030
.028
5516
*
-.077
.001
5892
*
.058
.001
5448
**
.011
.421
5297
9
th
– 10
th
r
p
N
-.041
.004
4902
**
-.040
.004
5149
**
-.052
.001
5139
**
-.037
.008
5060
**
-.041
.004
4989
**
-.015
.311
4553
-.065
.001
4442
**
growth
score
marks
Eng 10B
marks
composition
marks change
Eng 9B – Eng 10B
growth
gr. 8-9
growth
gr. 9-10
8
th
– 9
th
r
p
N
-.014
.285
5874
.014
.304
5431
.038
.006
5322
**
1
6059
-.426
.001
4691
**
9
th
– 10
th
r
p
N
-.038
.007
4996
**
-.003
.863
4661
-.023
.121
4436
-.426
.001
4691
**
1
6059
* p < .05, ** p < .01
79
Impact of Policy Changes
To determine if there is evidence of a disparate impact brought on by changing
policies with the advent of NCLB, the differences between standardized CST scale
scores were analyzed over two transitions. Once standardized, changes in individual
student performance were determined from 8
th
to 9
th
grade and from 9
th
to 10 grade.
The resultant growth scores were compared by the students’ performance band during
Table 4-21
Student t test of students in Basic Compared to Students in Far Below and
Below Basic Performance Bands
Transition N M SD t d p
Grade 8-9
Basic 1675 98.04 6.619 -15.314 ** .24 .001
FBB/BB 3757 100.98 6.482
Grade 9-10
Basic 1362 98.75 7.506 -5.317 ** .18 .001
FBB/BB 2749 100.01 6.983
** p < .01
the earlier year. For these analyses, the far below basic (FBB) band and the below
basic (BB) band were combined and compared to the basic (B) band. As summarized
in Table 4-21, at both transitions, students who were in the B proficiency band showed
lower growth than the students who were in the combined FBB and BB performance
80
bands. For the transition from 8
th
grade to 9
th
grade, students whose 8
th
grade status
scale score fell in the lower bands demonstrated more growth than students whose
status scale score was in the B Band, t (5430) = -15.314, p = .001. This was a small,
but significant effect, d = .24. For the transition from 9
th
grade to 10
th
grade, students
whose 9
th
grade status scale score fell in the FBB and BB bands also demonstrated
more growth than students whose status scale score was in the B Band, t (4109) = -
5.317, p = .001. This was also a small, but significant effect, d = .18. Although these
are both small effects, the differences are in the opposite direction from the stated
hypotheses. These results are consistent with what would be expected based upon an
assumption of statistical regression towards the mean.
81
CHAPTER 5
DISCUSSION
In this study, I have examined the practical validity of using growth analysis
on standardized student scale scores on the California Standards Test (CST) in order to
evaluate schools and programs. Using the CST scale scores from a cohort of inner-city
students, after describing the cohort in relation to the Local Educational Agency
(LEA) and state in which it is embedded, I turned to three important issues of practical
validity, namely: the discriminant validity and inherent fairness of the model, the
utility of the growth model that was employed in this study, and the impact of
changing policies on practice. Although this type of growth modeling has serious
limitations, and cannot be reasonably used for evaluation at the teacher or class level,
the basic model may be an improvement over the current practices of using status and
improvement models for school and program evaluation. Although it cannot be
deemed to be a truly fair model, further research may be able to refine it into a fairer
model than the current status and subsequent cohort models.
Comparison of Evaluation by Varying Value Tables
Using a variety of value tables to describe the cohort, the LEA, and the state
performance status on the CST, I observed that the cohort group lagged substantially
behind the state and the LEA in terms of status and improvement model achievement
on the CST. This observation was irrespective of whether the analysis was based upon
the No Child Left Behind (NCLB) metric, which only looks at the proportion of
students whose performance level is proficient and above, or by using a more lax
82
benchmark of considering student performance that is in the basic band and above.
The students in the cohort group showed substantially less success using a status
model than either students from the entire state or students from the LEA. In this
sense, the students in this cohort group are very different from the students in the
larger systems from which the group was extracted. Another important observation is
that over the course of the same years, the number of students in the cohort group
shrank dramatically over time, while the number of students in the state and in the
LEA did not have the same type of consistent change in numbers of students who took
the CST from one year to the next. This shrinking of the cohort may be indicative of
low performing students dropping out of school, or simply opting out of the state
mandated testing at a greater rate than in the larger systems, or shifting demographics
as families move about within the area due to external factors such as housing costs
and local job markets.
However, an important similarity in trends in the data was noteworthy. The
students in the cohort group, the LEA and the state experienced parallel changes in
performance, which were particularly evident in Figure 4-1 and Figure 4-2. These
parallel changes in performance demonstrate that the cohort group experienced similar
changes from year to year in the level of difficulty of the subsequent tests, as did the
students in the LEA and students across the state as a whole. This observation
illustrates the lack of vertical scaling of the CST as a test and shows that the changes
in difficulty level were pervasive for all students in the cohort, the LEA and across the
state. This consistent trend indicates a need to address the lack of vertical equating by
83
some method, such as standardizing the scores before calculating the differences used
as the scores for measuring growth. By standardizing the scale scores before
calculating the growth scores, I was able to establish a set score as average growth to
create a basis for comparison. Without standardizing the scale scores, any effort to
compare the growth of students at different grade levels is rendered meaningless, since
all students as a whole consistently find the CST to be simply more difficult at certain
grades in comparison to others. While the dissimilarity in performance noted above
reasonably calls into question the ability to generalize the rest of the findings in this
study, the parallel changes noted here speaks to an important similarity that would
suggest that it may be reasonable to interpret these results beyond the cohort group
itself.
Another valuable observation that can be made about these analyses about the
cohort and the larger systems is related to the method of analysis itself. The transition
matrices that I used herein have some very useful attributes for communicating the
data to the public at large. As Johnson (2002) points out, it is essential to make data
“audience friendly” (p. 49). The transition matrices put forward by Hill, Gong,
Marion, DePascale, Dunn, and Simpson (2006) have an elegant simplicity and are
much easier for the general public, or even most school professionals, to understand
than even the most basic regression equation that is used for value added modeling
namely, the Simple Fixed Effects Model. Indeed, for many they are more intuitive
even than the simple change scores I used in the rest of this analysis. These matrices
are a valuable tool that school leaders can use to communicate school performance or
84
program effectiveness to the general public, and to aid stakeholders at a given school
site in understanding this otherwise complex issue.
Another important value that is inherent in the use of transition matrices is
likewise tied directly to their simplicity. School leaders can, with only modest effort
and after minimal training, create a transition matrix using simple spreadsheet
software (e.g.: Microsoft Excel®) that is regularly available on virtually every
classroom computer. This is a particularly valuable attribute of the transition matrix
model since many school leaders lack the statistical sophistication necessary to
analyze the data with more advanced models or even have access to the software that
would be needed for such analysis, such as SPSS®, SAS®, or even Minitab®. Even a
simple fixed effects model, or the simple standardized growth score model I use in this
analysis requires a degree of statistical knowledge that is not inherent in most teacher
preparation or school leadership programs. In a more extreme contrast, Hierarchical
Linear Models and Layered Mixed Effects Models require the use of complex
proprietary formulae and these analyses are not easy to perform for any practitioner in
the field.
Another great strength to using value tables and transition matrices is in their
flexibility. Local school stakeholders can adopt an existing table, or easily modify one
to reflect local values, goals, and an individual school’s mission. The value of each
cell can be modified to value growth at specific levels over growth at other levels,
such as awarding a bonus for growth in lower bands, or reducing the value of
remaining in the same band from year to year.
85
However, the more complex regression models do provide greater ability to
examine the data. They are able to account better for extraneous variables. By using
standardized scores, some accounting for the lack of vertical equating may be
accomplished. These are all advantages that the transition matrices lack. However,
transition matrix models hold a lot of promise as a way to analyze and communicate
student growth in the future and are worthy of further investigation.
Discriminant Validity and Fairness of the Model
In order to evaluate the discriminant validity and fairness of using the model I
have proposed in this study, two factors were analyzed to determine if they were
unduly related to the growth scores used in this model. As a necessary concession to
the demographics of the cohort of students considered in this study, four groups of
students were considered based on ethnicity and English language proficiency.
Notably, the students that were designated as having limited English proficiency
(LEP) demonstrated more growth in this model than did the other groups. While this
was a statistically significant finding, it was a small effect; there are multiple possible
explanations for this difference. Secondary schools in this cohort provide double-
block, homogenously grouped instruction for LEP students. This approach may be
providing the additional support these students need to attain more growth during the
school year.
However, by definition, LEP students are expected perform at lower levels on
tests of English language arts than any other category. Once a student who is classified
as LEP attains proficiency in English, the student is by definition no longer considered
86
to be LEP, but is now reclassified as functionally English proficient (RFEP). This is
important since lower status model achievement being paired with higher growth
model performance has been echoed frequently in this study. It may be simply an
example of statistical regression towards the mean that is large enough to be measured
and may be indicative of a weakness in the proposed model that is primarily resultant
from the lack of reliability inherent in calculating a growth score as the difference
between two other scores.
Additionally, I evaluated the ten schools in this cohort against each other. The
differences in growth among the schools, while statistically significant, also amounted
to a small effect. Also worth noting, Sousa SH, whose students showed more growth
during the grade 8-9 transition, using this study’s method, was also the lowest
performing school in the cohort when measured by either the state’s API or the NCLB
status model. This finding again raises the question of statistical regression towards
the mean.
Another issue that was raised by the inter-school evaluation was the lack of
consistency from year to year. The fact that students at some schools showed more
growth at different years could be the result of school programs being implemented
differently at different grade levels at the various schools. It could also indicate a
corollary situation to the lack of stability from year to year that Buddin, McCaffrey,
Kirby, and Xia (2007) found in Florida’s value added model teacher evaluation
system.
87
Given this instability, and apparent lack of established reliability, when trying
to apply this model to smaller systems, such as small districts or to individual schools,
the problem of small numbers may further threaten the validity of this type of analysis.
In this study, I started with a cohort of 13,160 students from 10 large urban senior high
schools. Even with that large initial population, the small size of various groups can
create room for reasonable doubt regarding the interpretation of the results. In an
analogous situation, a school site leader, starting with a population of around 100 in a
grade level, who is interested in meeting the needs of the various ethnic and language
subgroups at her or his school, will soon be looking at very small groups, which
makes it difficult to bring out meaningful interpretations of the data.
Utility of Model
In a broad sense, the differences between subsequent years’ standardized CST
scale scores, that were used to represent growth in this study, were negatively related
to the various measures of achievement status. The exception to this rule was the
standardized scale score from the later grade that was used as part of the equation that
created the growth score. This observation seems to be essentially definitional in
nature. Students with lower scores in grade 8 and higher scores in grade 9 will have a
higher difference when the scores are standardized and subtracted as a simple function
of the arithmetic involved.
Another observation that is worth noting is the negative relationship between
growth in grades 8-9 and subsequent growth in grades 9-10. It seems that students who
made substantial growth in grades 8-9 did not continue that pattern in grades 9-10. In
88
addition, students who demonstrated little growth in the earlier year tended to show
more growth in the later year. This negative relationship shows that growth from one
year to the next is not stable. This type of instability over time leads to serious
concerns about the reliability of the growth scores used in this method to analyze the
data.
Growth between grades 9 and 10 was also negatively related to the various
measures of student attendance. Conversely, growth between grades 8 and 9 was
predictive of attendance in grade 10. It would seem that the more growth a student
showed on the CST during their first year of high school the more they would attend
their sophomore classes. However, the more they attended school at any year, the less
growth they showed on the CST in their sophomore year. The flaw in this series of
analyses comes in the observation that the grade 9-10 growth score is also negatively
correlated with status model achievement. Thus, less growth is related to higher
achievement. Attendance is generally regarded as a predictor of higher performance in
status modeling. The more a student attends classes, the higher the student tends to
score on the CST. Since CST performance is negatively related to growth on the CST,
it is reasonable to expect that growth would also be negatively related to attendance.
This is an important caveat that will be discussed as part of the limitations of this
study at greater length in this chapter.
One of the more difficult to interpret relationships was the one between growth
on the CST and the change in grades between the teacher-issued marks in the spring
semesters of grades 9 and 10. If one assumes that both the teacher-issued marks and
89
the CST scale scores are reliable indicators of student achievement, it would be
expected that changes in marks as a measure of growth would be directly related to the
CST growth scores as calculated in this study over the same period of time. However,
this analysis did not bear out that expectation. The analysis failed to find a relationship
between these different measures of growth. However, growth on the CST from grade
8 to grade 9 was positively predictive of changes in teacher-issued marks between the
spring semester of grades 9 and 10.
The failure to find a relationship between these marks and the simultaneous
growth on the CST may be an indicator of the unreliability of the growth scores that
were used to measure CST growth, the unreliability of changes in teacher-issued
marks as a measure of student growth, or both. It is possible that growth on the CST in
a student’s first year, which is predicts future higher attendance in grade 10, may lead
to higher marks in grade 10, but the data here is not sufficient to adequately support
the assertion of a causal link.
Impact of Policy Changes
In this study’s fourth research question, I attempted to apply the model to
determine if there is a disparate impact on lower achieving students because of
changes in educational policy brought on by the federal NCLB Act. Rather than find
that lower achieving students were continuing to show lower growth, which would be
an indicator of such a disparate impact of the policy change, the reverse was found in
this study. Interpreting this result is not as simple as declaring that lowest performing
students are outgrowing their cohorts who had status scores in the basic band, which is
90
often targeted for intervention as the bubble group. Throughout this study, I have
consistently observed a pattern of lower performing students, when measured by status
modeling, showing more growth than higher performing students do. This result is
also consistent with observations that Goldschmidt et al. (2005) noted. This
observation ties into a thread of statistical regression towards the mean that cannot
easily be dismissed. This statistical regression is an indicator of weak reliability and is
discussed as a limitation in this chapter.
Implications
One of the greatest strengths of the growth model that I examined here is that
the main effect found among the ethnic and language groups was only a small effect.
This is in stark contrast to the very powerful predictor that ethnicity and English
language fluency are in status models (Meyer, 1997). Additionally, the observed
trends in this study tend to favor students who are in lower proficiency bands, or
would be expected to have lower achievement in status models, similar to what was
noted by Goldschmidt et al. (2005). Noting the relative weakness of the observed
influence of these factors, using growth scores to measure student progress may be
argued to have greater discriminant validity than using subsequent cohort or other
status models. Therefore, programs may not be greatly advantaged or disadvantaged
by the fact that they provide services to groups of students that fit into one or another
combination of these demographic factors. This apparent improvement in discriminant
validity over status models would seem to indicate that measuring student growth with
this type of growth model would be considered a fairer method for evaluating school
91
and instructional programs than using status and improvement models. However, it is
worth further exploring this question by isolating ethnicity as a single factor to
determine the model’s ability to discriminate this factor from growth. It is also a
concern that since the model does favor lower achieving students, it is probably not
completely fair as an evaluation tool for programs unless the factor of initial
performance level is controlled for by using experimental, or at least quasi-
experimental research methods. Additionally, the ethnic groups that I examined in this
study are not generally cited as the ethnic groups that have the greatest disparity in
status model performance. It would be more informative to further investigate this
method with a study sample that compares across a broader ethnic spectrum, and
specifically including white, Asian, African American, and Hispanic participants in
the ANOVA.
Another advantage of using this growth model is its relative simplicity. The
growth scores can be calculated, standardized, and evaluated with relative ease using
any standard statistical software package. There are no complicated or proprietary
formulae involved. Although the statistical knowledge required to do this analysis may
be beyond the average education practitioner, a typical secondary school will have
personnel on staff that are capable of executing and interpreting this level of analysis.
As a corollary to its simplicity, this model’s transparency is one of its
strengths. By setting the standardized mean as 100 times the grade level, one year’s
growth is defined as an intuitively comfortable 100 points. A group of students whose
average growth score is greater than 100 has shown more than average growth, while a
92
group of students whose average growth score is below 100 has shown less than
average growth. These terms are easy for school leaders to explain to stakeholders at
the school. This level of transparency is essential to communicating results to the wide
variety of stakeholders at a school site, such as parents, benefactors, faculty, staff, and
leadership. By using a model that is simple and easy to explain, school leaders put
themselves in a position of being able to explain clearly whether or not a program is
leading to greater student growth in a manner that is intelligible to all stakeholders.
However, since I was not able to establish the discriminant validity of this method,
this could lead to misinterpretations being made.
Limitations
One of the more severe limitations to this study is the observation that simple
statistical regression towards the mean may be a sufficient explanation for many of the
key findings. This statistical phenomenon satisfactorily explains the finding of the
main effect relating to the ethnicity/language group of the student since it is reasonable
to expect that LEP students to have lower status achievement on the CST than native
English speaking students whose. It also would be a reasonable explanation for the
instability of the growth scores over time. Statistical regression towards the mean is
also is a satisfactory explanation for many of the correlations I discussed in regards to
the utility of the model. Finally, regression towards the mean satisfactorily explains
why students whose CST scores were in the lowest bands showed more growth using
this model than did students whose status performance was in the basic band. In as
much as regression towards the mean can be taken as an indicator of unreliability, the
93
growth scores, which constituted the major unit of measurement for this study, may be
unreliable. This fact may be true in many other growth and value added models as
well, as indicated by the observation by Goldschmidt et al. (2005).
Another important limitation to this study is the homogeneity of the study
cohort. All 10 schools had similar demographic characteristics. They were from the
same LEA and they were all local to each other. They all had strikingly similar
geographic and demographic characteristics, being situated in a large, urban
environment with large minority populations and high poverty rates. Although the
cohort group experienced similar changes in the difficulty level of the CST at different
grades as compared to the LEA and the state, their achievement status was
consistently low in comparison to these larger systems irrespective of how that status
achievement was assessed. This may indicate a limitation to the ability to generalize
the conclusions from this study to other situations, such as districts that have less
poverty, or different ethnic combinations.
While interpreting the growth scores in general may appear to be easy, due to
the statistical regression, these scores may be particularly less meaningful at the
individual student level, especially for students whose status achievement is not near
the mean. A student who is achieving at the far below basic level will tend to have a
higher growth score than a student who is in the basic band. Similarly, the regression
towards the mean would indicate that a student that is achieving in the advanced level
would tend to have a lower growth score. These nuances may prove to be difficult if
not impossible to explain at the individual student level, especially to parents. It will
94
sound confusing to say to a parent that his or her child is demonstrating higher than
average growth, but has low achievement, and that should be a real concern to the
parent. Conversely, it would also be confounding to a parent to say that a high
achieving student has less than average growth, but then attempt to explain why a
parent should not be concerned about her or his child’s less than average growth.
Thus, interpreting these results at the individual level may prove to be more
problematic than it would be simply to continue to use the current status model scores.
Since in this study, growth is found to be generally negatively related to status
performance, The practice of reporting of growth scores at the individual level would
be prone to conveying both false positive and false negative messages.
A similar concern exists in using this method for school site evaluation. In this
study I found that schools that tend to have higher achievement will tend to have lower
growth, while lower performing schools will tend to have higher growth. A more
sophisticated model, which accounts for this regression towards the mean, may
improve upon this problem, but this would be at the expense of losing both
transparency and simplicity. Another option would be to limit interpretations and
comparisons of this model to schools that begin with similar achievement using a
status model. In this manner, schools and programs could be compared in a quasi-
experimental fashion by matching schools before using the growth scores as calculated
in this study.
95
Conclusions
In this study, I have opened more questions than I have provided answers.
There are myriad avenues that I open for further study. I have attempted here to
provide a simple measurement of student growth that is sensitive to cogent concerns
about the lack of vertical scaling of the CST. Given the valid doubts that this model
raises due to the observed statistical regression towards the mean, and the resultant
concern about the reliability of the growth scores that I used in this model, a more
refined model that takes this regression into account may be worth using instead. The
trade-off that is inherent in using a more complex model is the resultant loss of
simplicity and transparency.
In considering the use of value tables, one avenue that may be worthy of
further research would be to analyze the effect of parsing the lowest proficiency band,
far below basic into two sub groups. This band contains a large number of students,
and has a wider range of scaled scores included in it than do the other proficiency
bands except the advanced band. By parsing this band, hence adding an additional
very far below basic (or similarly named) band, more can be determined about
changes in performance among the group of student at this low end that would be of
practical importance to practitioners in education.
Depending on how the data are used, Goldschmidt et al. (2005) note that some
practitioners may be tempted to ignore the noted effect of higher growth being
associated with lower performance, in a sense allowing an advantage to programs and
schools that are used to serve lower performing students. This would not be a wise
96
tack in a leadership sense. By granting a statistical advantage to schools that tend to
have a large number of low performing students or to programs that tend to be
associated with lower performance, the leader gets results that are inaccurate and
therefore unhelpful in making decisions. In a political sense, this would be unwise,
since it will quite legitimately call into question the integrity of the interpretations
regarding any program that uses this method without exercising some deliberate
statistical control over the issue regression towards the mean.
Given its weakness in reliability, it would not be warranted to use this model
for high-stakes purposes such as personnel evaluation or even school ranking. The
regression towards the mean would likely taint the results and lead to falsely strong
measures for teachers and schools who serve lower achieving students and conversely
low measures for those that serve higher achieving students. Additionally, in order to
obtain a growth score, a student must have a CST score in two consecutive years.
Since the CST is given only in grades 2-11 and only in core subjects, this model
cannot be applied to teachers in the primary grades, through second grade. It also has
no application to secondary teachers who teach non-core classes, such as arts, music,
or physical education. It is politically untenable to create appropriate exams for each
of these subjects and for the lower grades simply to provide a level teacher-evaluation
program. The fiscal cost and the loss of instructional time would be too prohibitive.
However, this model can have some value in providing summative data for the
analysis of the effectiveness of instructional programs, interventions, or in determining
if a piloted curriculum is effective. Since valid doubts remain regarding the reliability
97
of growth scores, it may be best used as part of a concert of methods to help
triangulate with observations and other qualitative sources of data in a mixed-methods
approach to analyzing the programs in question. To be used effectively, this model
would be best used in an experimental, or at minimum, a matched groups quasi-
experimental design which takes status model performance into account in the
matching.
In spite of this model’s shortcomings, it may be developed with further
refinement into a model that would be an improvement upon the basic assumptions
required by successive cohort models, or snapshot status models. By following the
same students from year to year, the model examined in this study has the advantage
of providing a matched sample for analysis. Successive cohorts are notoriously
mismatched from each other. Matching to another similar school raises concerns of
research control and requires a practitioner to engage in extra work to obtain
permission to use another school’s data, and to ensure that the matching school
actually does match the school that is being studied. While I have not been able to
ameliorate the major concerns about the practical validity of this model, and the
growth scores used in this model seem to be less than reliable, the model I explored
here has the potential to provide an additional lens in analyzing the effectiveness of
schools and instructional programs.
98
References
Amrein-Beardsley, A. (2008). Methodological concerns about the educational value-
added assessment system. Educational Researcher, 37(2), 65-75.
Andrejko, L. (2004). Value-added assessment: A view from a practitioner. Journal of
Educational and Behavioral Statistics, 29, 7-9.
Braun, H. (2005). Value-added modeling: What does due diligence require? In R. W.
Lissitz (ed) Value Added Models in Education: Theory and Application (pp.
19-39). Maple Grove, MN: JAM Press.
Brennan, R. L. (1998). Misconceptions at the intersection of measurement theory and
practice. Educational Measurement: Issues and Practice, 17, 5-9.
Buddin, R., McCaffrey, D. F., Kirby, S. N., & Xia, N. (2007). Merit Pay for Florida
Teachers: Design and Implementation Issues. Santa Monica, CA: RAND
Corporation.
California Department of Education (2007a). 2006-2007 Adequate Yearly Progress
Report Information Guide. Retrieved February 2, 2008 from
http://www.cde.ca.gov/ayp
California Department of Education (2007b). 2006-07 APR Glossary – Growth API:
Glossary of terms for the Growth API section of the 2007 growth API report.
Retrieved February 4, 2008 from
http://www.cde.ca.gov/ta/ac/ap/glossary07e.asp#gg12
California Department of Education (2008a). 2007-2008Academic Performance Index
Performance Reports Information Guide.. Retrieved January 28, 2009 from
http://www.cde.ca.gov/ta/ac/ap/
California Department of Education (2008b). California Standards Tests
(CSTs)Technical Report: Spring 2007 Administration. Retrieved June 21, 2008
from http://www.cde.ca.gov/ta/tg/sr/documents/csttechrpt07.pdf
Cohen, J. (1992). A Power Primer. Psychological Bulletin, 112, 155-159.
Dubner, S. J. & Levitt, S. D. (2005). Freakanomics: A Rogue Economist Explores the
Hidden Side of Everything. New York: Harper Collins Publishers.
Florida Department of Education (2006). Using Value Tables to Determine Teacher
Effectiveness in Florida. Retrieved June 22, 2008 from
http://www.fldoe.org/news/2006/2006_04_05/ValueTable.pdf
99
Goals 2000: Educate America Act of 1994, Pub. L. No. 103-227, 108 Stat. 125 (1994).
Goldschmidt, P., Roschewski, P., Choi, K., Auty, W., Hebbler, S., Blank, R. et al.
(2005). Policy Maker’s Guide to Growth Models for School Accountability:
How do Accountability Models Differ? Washington, D.C.: Council of Chief
State School Officers.
Hamilton, L. S., Stecher, B. M., Marsh, J. A., McCombs, J. S., Robyn, A., Russell, J.
L., Naftel, S., & Barney, H. (2007). Standards-based accountability under No
Child Left Behind: Experiences of teachers and administrators in three states.
Santa Monica, CA: RAND Corporation.
Hill, R., Gong, B., Marion, S., DePascale, C., Dunn, J., & Simpson, M. (2006). Using
value tables to explicitly value student growth. In R. W. Lissitz (Ed.)
Longitudinal and Value Added Models of Student Performance (pp. 255-291).
Maple Grove, MN: JAM Press.
Hocevar, D., Brown, R., & Tate, K. (2008). Leveled Assessment Modeling Project.
Unpublished manuscript.
Johnson, R. S. (2002). Using Data to Close the Achievement Gap: How to Measure
Wquity in Our Schools. Thousand Oaks, CA: Corwin Press.
Kane, M. (2006). In praise of pluralism. A comment on Bersboom. Psychometrika, 71,
441-445.
King, B.M. & Minium, E.M. (2003). Statistical Reasoning in Psychology and
Education (4
th
ed.). Hoboken, NJ: John Wiley & Sons.
Linn, R. L. (2006). Educational accountability systems. CSE Technical Report 687.
Retrieved May 19, 2008 from
http://www.cse.ucla.edu/products/reports/R687.pdf
Linn, R. L., Baker, E. L., & Betenbenner, D. W. (2002). Accountability systems:
Implications of requirements of the No Child Left Behind Act of 2001. CSE
technical report 567. Retrieved May 19, 2008 from
http://www.cse.ucla.edu/products/reports/R567.pdf
Linn, R. L. and Miller, M. D. (2005). Measurement and Assessment in Teaching (9
th
Ed.). Upper Saddle River: Pearson Education.
Lissitz, R. W., Doran, H., Schafer, W. D., & Willhoft, J. (2006). Growth modeling,
value added modeling and linking: An introduction. In R. W. Lissitz (Ed.)
Longitudinal and Value Added Models of Student Performance (pp. 1-46).
Maple Grove, MN: JAM Press.
100
Lissitz, R. W. & Samuelsen, K. (2007). Dialogue on validity: A suggested change in
terminology and emphasis regarding validity in education. Educational
Researcher, 36, 437-448.
Lockwood, J. R., Louis, T., & McCaffrey, D. F. (2002). Uncertainty in rank
estimation: Implications for value-added modeling accountability systems.
Journal of Educational and Behavioral Statistics, 27, 255-270.
Martineau, J. A. (2006). Distorting value added: The use of longitudinal, vertically
scaled student achievement data for growth-based, value-added accountability.
Journal of Educational and Behavioral Statistics, 31, 35-62.
McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., & Hamilton, L.S. (2003).
Evaluating value-added models for teacher accountability. Santa Monica, CA:
RAND Corporation.
McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., Louis, T. A., & Hamilton, L.S.
(2004). Models for value-added modeling of teacher effects. Journal of
Educational and Behavioral Statistics, 29, 67-101.
McCaffrey, D. F., Lockwood, J. R., Mariano, L. T., & Setodji, C. (2005). Challenges
for value-added assessment of teacher effects. In R. W. Lissitz (ed) Value
Added Models in Education: Theory and Application (pp. 111-144). Maple
Grove, MN: JAM Press.
Messick, S. (1989). Validity. In R. L. Linn (Ed.) Educational Measurement (3
rd
ed.)(pp. 13-103). New York: Macmillan.
Messick, S. (1994). Foundations of validity: meaning and consequences in
psychological assessment. European Journal of Psychological Assessment, 10,
1-9.
Messick, S. (1995a). Standards of validity of standards in performance assessment.
Educational Measurement: Issues and Practice, 14, 5-8.
Messick, S. (1995b). Validity of psychological assessment: Validation of inferences
from persons’ responses and performances as scientific inquiry into score
meaning. American Psychologist, 50, 741-749.
Meyer, R. (1997). Value-added indicators of school performance: A primer.
Economics of Education Review, 16, 283-301.
No Child Left Behind Act of 2001, Pub. L. No. 107-110, 115 Stat. 1425 (2002).
101
Public Schools Accountability Act of 1999, Ca. Educ. Code §§ 52051052052.5.
Retrieved February 2. 2008 from http://www.leginfo.ca.gov/cgi-
bin/calawquery?codesection=edc&codebody=&hits=20
Robinson Kurpius, S. E. & Stafford, M. E. (2006). Testing and Measurement: A User-
Friendly Guide. Thousand Oaks, CA: Sage.
Rosenthal, R., Rosnow, R. L., & Rubin, D. B.(2000). Contrasts and Effect Sizes in
Behavioral Research. United Kingdom: Cambridge University Press.
Sanders, W. L. & Horn (1994). The Tennessee value-added assessment
system(TVAAS): Mixed-model methodology in educational assessment.
Journal of Personnel Evaluation in Education, 8, 299-311.
Sanders, W. L., Saxton, A. M., & Horn, S. P. (1997). The Tennessee Value-Added
Accountability System: A quantitative, outcomes-based approach to
educational assessment. In J. Millman (Ed.), Grading teachers, grading
schools: Is student achievement a valid evaluation measure? (pp. 137-162).
Thousand Oaks, CA: Corwin Press.
Secada, W. G., Chavez-Chavez, R., Garcia, E., Muñoz, C., Oakes, J., Santiago-
Santiago, I., et al. (1998). No More Excuses: The Final Report of the Hispanic
Dropout Project. Madison, WI: Wisconsin Center for Educational Research.
Stevens, J. (2005). The study of school effectiveness as a problem in research design.
In R. W. Lissitz (Ed.) Value Added Models in Education: Theory and
Application (pp. 166-208). Maple Grove, MN: JAM Press.
Stevens, J. & Zvoch, K. (2006). Issues for the implementation of longitudinal growth
models for student achievement. In R. W. Lissitz (Ed.) Longitudinal and Value
Added Models of Student Performance (pp. 170-209). Maple Grove, MN: JAM
Press.
Tekwe, C. D., Carter, R. L., Ma, C., Algina, J., Lucas, M. E., Roth, J. et al. (2004). An
empirical comparison of statistical models for value-added assessment of
school performance. Journal of Educational and Behavioral Statistics, 29, 11-
36.
The National Commission on Excellence in Education (1983). A Nation at Risk: The
Imperative for Educational Reform. Retrieved February 2, 2008 from
http://www.ed.gov/pubs/NatAtRisk/title.html
102
United States Department of Education (2007). Building on results: A blueprint for
strengthening the No Child Left Behind Act. Retrieved February 9, 2008 from
http://www.ed.gov/policy/elsec/leg/nclb/factsheets/blueprint.pdf
Abstract (if available)
Abstract
In this study, the validity of applying growth modeling to the California Standards Test (CST) in assessing schools and academic programs is evaluated. The 2001-2002 freshman classes of 10 inner city high schools, 13,160 students, formed the study cohort. Evaluations from four value tables used to analyze the cohort’s growth were compared. The validity of using changes in CST scores to measure growth was analyzed, comparing the ten schools and four groups of students, African American, and Hispanic of three language categories: Limited English Proficient (LEP), Reclassified Functionally English Proficient (RFEP), and English Only/Initially Functionally English Proficient (EO/IFEP). The utility of using CST growth scores was investigated by comparing them to marks, changes in marks, enrollment, attendance, and CST status scores. Evidence of a disparate impact of the No Child Left Behind (NCLB) Act on lower performing students was sought.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An evaluation of the impact of a standards-based intervention on the academic achievement of algebra students
PDF
In the implementation of standards-based reform: what is the leadership role of the principal in building school capacity and accountability to sustain student academic growth?
PDF
The usability of teacher-growth scores versus CST/API/AYP math status scores in sixth and seventh grade mathematics classes
PDF
Raising student achievement on the California Standards Test and California High School Exit Exam at the Phoenix Arts Charter School
PDF
English learners' performance on the California Standards Test at Aviles Elementary
PDF
Use of accountability indicators to evaluate elementary school principal performance
PDF
An analysis of the impact of the total educational support system direct-instruction model on the California standards test performance of English language learners at experimental elementary school
PDF
California under the microscope: an evaluation of the practical effect of different approaches to accountability
PDF
Closing the science achievement gap for ninth grade English learners through standards- and inquiry-based science instruction
PDF
Evaluation of the progress of elementary English learners at Daisyville Unified School District
PDF
A comparison of value-added, orginary least squares regression, and the California Star accountability indicators
PDF
The effectiveness of the literacy for success intervention at Wilson Middle School
PDF
An evaluation of the impact of a standards-based intervention on the academic achievement of English language learners
PDF
Examining the effectiveness of teacher training on the improvement of California standardized test scores at Eva B. Elementary School
PDF
The effects of culturally responsive standards based instruction on African American student achievement
PDF
The effects of the models of teaching on student learning
PDF
The impact of the Norton High School early college program on the academic performance of students at Norton High School
PDF
What about the children left-behind? An evaluation of a reading intervention program
PDF
An investigation into trends in advanced placement test taking in science and mathematics among student sub-populations using a longitudinal growth model
PDF
Evidence-based resource allocation model to improve student achievement: Case study analysis of three high schools
Asset Metadata
Creator
Horner, Marvin (author)
Core Title
Quantifying student growth: analysis of the validity of applying growth modeling to the California Standards Test
School
Rossier School of Education
Degree
Doctor of Education
Degree Program
Education (Leadership)
Publication Date
06/22/2009
Defense Date
05/06/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
California Standards Test,growth model,K-12,longitudinal model,OAI-PMH Harvest,program evaluation,school evaluation,student growth,validity,value table
Place Name
California
(states)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Hocevar, Dennis (
committee chair
), Brown, Richard Sherdon (
committee member
), Keim, Robert G. (
committee member
)
Creator Email
liljack92@verizon.net,marvin.horner@lausd.net
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2312
Unique identifier
UC1497669
Identifier
etd-Horner-2920 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-240285 (legacy record id),usctheses-m2312 (legacy record id)
Legacy Identifier
etd-Horner-2920.pdf
Dmrecord
240285
Document Type
Dissertation
Rights
Horner, Marvin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
California Standards Test
growth model
K-12
longitudinal model
program evaluation
school evaluation
student growth
validity
value table