Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A comparison of value-added, orginary least squares regression, and the California Star accountability indicators
(USC Thesis Other)
A comparison of value-added, orginary least squares regression, and the California Star accountability indicators
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A COMPARISON OF VALUE-ADDED, ORDINARY LEAST SQUARES
REGRESSION, AND THE CALIFORNIA STAR ACCOUNTABILITY INDICATORS
by
Aime Black
A Dissertation Presented to the
FACULTY OF THE USC ROSSIER SCHOOL OF EDUCATION
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF EDUCATION
May 2012
Copyright 2012 Aime Black
ii
DEDICATION
For my grandfather because he saw in me what took me until now to discover.
For my parents who have sacrificed in so many ways so that my siblings and I can live
out our dreams.
For my husband, John Black, who is the wind beneath my wings.
For our two beautiful and loving daughters, Sara and Shelby, who inspire me everyday
and in every way.
This enormous accomplishment is a reflection of all of you, your love, and your unending
patience.
iii
ACKNOWLEDGMENTS
To my dissertation committee members Dr. Richard Seder, Dr. Morgan Polikoff,
and my chairperson, Dr. Dennis Hocevar, and to my USC friends and family, I give my
sincere appreciation and deep-felt gratitude for the expertise, guidance, support, time, and
inspiration you have given me in order to bring this study to a successful completion.
iv
TABLE OF CONTENTS
DEDICATION.................................................................................................... ii
ACKNOWLEDGMENTS .................................................................................. iii
LIST OF TABLES ............................................................................................ vii
ABSTRACT....................................................................................................... ix
CHAPTER ONE: THE PROBLEM..................................................................... 1
Background of the Problem ..................................................................... 2
Statement of the Problem ...................................................................... 11
Purpose of the Study.............................................................................. 14
Significance of the Study ....................................................................... 16
Definition of Terms ............................................................................... 18
CHAPTER TWO: REVIEW OF THE LITERATURE....................................... 25
Evolution of Teacher Evaluation Practices ............................................. 27
Teacher Evaluation from 1900-1950........................................... 28
Teacher Evaluation from 1950-1980........................................... 29
Teacher Evaluation from the 1980’s to the 21st Century............. 29
Currently Operationalized Teacher Evaluation Model ................ 30
Major Historical Events Shaping Teacher Evaluation Practices.............. 32
The Coleman Report of 1966...................................................... 32
The National Assessment of Educational Progress (NAEP)........ 33
A Nation at Risk of 1983............................................................. 34
Legislation Impacting Teacher Evaluation Practices............................... 35
No Child Left Behind (NCLB) Act of 2001................................. 36
Race to the Top (RTTT) Initiative of 2009.................................. 39
Shift from Inputs to Outcomes in School Accountability Models ........... 40
Elementary and Secondary Education Act of 1965 ..................... 40
Improving America’s Schools Act of 1994.................................. 41
No Child Left Behind Act – Accountability Provisions................ 42
California’s Public Schools Accountability Act........................... 43
Test-Based Accountability and Evaluation Systems ............................... 47
Status Model .............................................................................. 49
Improvement Model................................................................... 50
Growth Model............................................................................ 53
Value-added Model .................................................................... 56
Comparison of Different Accountability Models ........................ 60
Socioeconomic Status (SES).................................................................. 62
Unique Contribution of Current Research .............................................. 64
v
CHAPTER THREE: RESEARCH METHODOLOGY...................................... 66
Research Design .................................................................................... 68
Population and Sample........................................................................... 69
Instrumentation: School Characteristics Index (SCI) .............................. 70
Instrumentation: Achievement Measures................................................ 71
California Standards Tests (CSTs).............................................. 74
California Alternate Performance Assessment (CAPA) .............. 74
California Modified Assessment (CMA) .................................... 75
Accountability Reports............................................................... 76
Unadjusted Achievement Accountability Indicators.................... 77
Adjusted Achievement Accountability Indicators ....................... 81
Procedures ............................................................................................. 88
Data Analysis......................................................................................... 90
Reliability .................................................................................. 91
Validity ...................................................................................... 92
Summary ............................................................................................... 93
CHATER FOUR: RESULTS ............................................................................ 95
Unadjusted Accountability Indicators..................................................... 97
Descriptive Statistics – Unadjusted Accountability Indicators..... 98
Correlation of Unadjusted Variables and SCI ............................. 99
Adjusted Accountability Indicators ...................................................... 100
Descriptive Statistics of Adjusted Variables ............................. 101
Reliability ................................................................................ 103
Validity .................................................................................... 106
Adjusted Grade Level Equivalent Scores ............................................. 111
Descriptive Statistics for AGLE Scores ................................... 111
Reliability ................................................................................ 112
Validity .................................................................................... 114
CHAPTER FIVE: DISCUSSION.................................................................... 116
Summary of Findings........................................................................... 119
Phase One: Unadjusted Achievement Indicators....................... 119
Phase Two: Adjusted Achievement Indicators.......................... 122
Findings From the Second Phase of the Investigation ............... 127
Phase Three: Adjusted Grade Level Equivalent Scores............. 128
Findings from the Third Phase of the Investigation: AGLEs..... 130
Implications......................................................................................... 131
Future Research ................................................................................... 133
Conclusion........................................................................................... 134
Limitations and Delimitations .............................................................. 135
REFERENCES................................................................................................ 137
vi
LIST OF TABLES
Table 1: API Scores for Overall Student Group,
Socioeconomically Disadvantaged (SED) Students,
English Learners (ELs), and Students with
Disabilities (SWDs) in LAUSD between 2006 and 2010 ............ 10
Table 2: State API Growth Targets........................................................... 45
Table 3: California Content Standards Tests and Grade
Levels at Which They are Administered..................................... 73
Table 4: School Content Area Weights for API Calculations for
Grade Span K-8.......................................................................... 78
Table 5: Performance Levels and Corresponding Weighting Factors........ 80
Table 6: Descriptive Statistics for Unadjusted Achievement
Indicators for LAUSD Elementary Schools in 2010.................... 98
Table 7: Skewness, Kurtosis, and z-Scores for the SCI and the
Unadjusted Achievement Indicators for LAUSD K-5
Schools in 2010.......................................................................... 99
Table 8: Pearson Product-Moment Correlations for
API, AYP, and SCI .................................................................. 100
Table 9: Descriptive Statistics for Adjusted Accountability
Indicators for LAUSD K-5 Elementary Schools in 2010........... 102
Table 10: Skewness, Kurtosis, and z-Scores for the
Adjusted Achievement Accountability Indicators in 2010......... 103
Table 11: Internal Consistency Reliability of AGT Scores and
ANCE Scores for 2010
(Correlation Between ELA and Math) ...................................... 104
Table 12: Test-Retest Stability of Adjusted Accountability
Indicators (Correlations Between 2009 and 2010)..................... 106
Table 13: Discriminant Validity for Adjusted Accountability
Variables for the 2010 School Year
(Correlation with SCI).............................................................. 107
vii
Table 14: Convergent Validity for Adjusted Accountability
Indicators for the 2010 School Year
(Correlations Between Adjusted Indicators).............................. 109
Table 15: Concurrent Validity for Adjusted Accountability
Indicators for 2010 School Year (Correlation with API) ........... 110
Table 16: Descriptive Statistics for Adjusted Grade Level
Equivalent Scores in ELA and Math for K-5 Elementary
Schools in the 2009 and 2010 School Years.............................. 112
Table 17: Internal Consistency Reliability of the AGLE
Scores for K-5 LAUSD Schools
(Correlation Between ELA and Math) ...................................... 113
Table 18: Test-Retest Stability – Pearson Product-Moment
Correlation Coefficients and Spearman-Brown
Stability Coefficients for AGLE Scores in Grades 2 to 5 .......... 114
Table 19: Discriminant Validity – Pearson Product-Moment
Correlation Coefficients for AGLE Scores for ELA and
Math with the SCI for Grades 2 to 5 ......................................... 115
Table 20: Concurrent Validity – Pearson Product-Moment
Correlation Coefficients for AGLE Scores in ELA and
Math with API for Grades 2 to 5............................................... 115
viii
ABSTRACT
Current models of evaluation and accountability utilize varying unadjusted
measures of student achievement to reward or sanction schools. These unadjusted
accountability indicators do not account for differences in student or school
characteristics that contribute to variations in assessment results. Since the Coleman
Report (1966), a guiding principle in accountability design has been that educational
outcomes data should be used only after the effects of institutional characteristics have
been statistically removed. Such indices are called adjusted indicators, where an
adjustment is either statistical or through aggregation.
The purpose of this study is to analyze the reliability (internal consistency and
test-retest) and validity (discriminant, convergent, and concurrent) of six available
accountability indicator systems: (a) API improvement scores, (b) API
SED
scores (SED =
socioeconomically disadvantaged), (c) similar schools scores, (d) LA Times value-added
scores, (e) academic growth over time (AGT) value-added scores (ELA and math), and (f)
adjusted normal curve equivalent (ANCE) scores (ELA and math). Each system has been
proposed as adjuncts to the currently operationalized school status (average achievement)
scores.
The population included all K-5 elementary schools in LAUSD; specifically
grades 2 through 5 for the 2009 and 2010 school years. There were approximately 430
schools and their teachers included in this research study.
ix
An initial analysis indicated that unadjusted variables of student achievement were
highly correlated to each other and notably and significantly linked to poverty status. In
particular, the NCLB status model indicator, the Adequate Yearly Progress (AYP), and
the PSAA improvement model indicator, the Academic Performance Index (API), which
are used as primary indicators in the California accountability systems to hold schools
accountable, were substantially correlated to socioeconomic status.
Subsequent to the initial analysis, the reliability and validity of the six different
adjusted accountability indicator systems were investigated. A first conclusion is that the
only justifiable methods for evaluating schools are ANCE scores and similar school scores
that have been adjusted for institutional characteristics using ordinary least squares (OLS)
regression. Of the two justifiable methods, ANCE scores can be disaggregated into ELA
and math components and therefore are preferred.
The third phase of this study examined the reliability (internal consistency and
test-retest stability) and the validity (discriminant and concurrent) of adjusted grade Level
equivalent (AGLE) scores. AGLE scores are grade level ANCE scores in ELA and math
that have been adjusted for institutional characteristics using ordinary least squares (OLS)
regression. Results from this study suggested that AGLE scores are reliable and valid for
use in holding grade-level teams, which are ignored under NCLB and PSAA, accountable
for enhanced learning outcomes. Furthermore, AGLE scores can be simply averaged to
generate reliable and valid school composite scores in an NCE metric.
1
CHAPTER 1
THE PROBLEM
The achievement gap among our nation’s students continues to capture the
attention of policymakers, researchers, and practitioners. In their research and analysis of
the results of the achievement gap as it relates to graduation rates, lack of adequate
employment attainment, weakened competitiveness in comparison to students of other
developed and developing nations, McKinsey and Company (2009) concluded that the
unyielding academic achievement gap inflicts upon the “United States the economic
equivalent of a permanent national recession” (p. 6). Certainly, the unrelenting disparity
in academic outcomes among our American children is cause for great concern.
Accordingly, ameliorating the performance gap is a top priority on the education reform
agenda in the United States.
In an attempt to close the achievement gap, researchers and policymakers have
begun to look more critically at the link between teacher effectiveness and student
outcomes. A large body of literature has concluded that teachers do matter (Ballou,
Sanders & Wright, 2004; Hanushek, 1992; Marzano, 2003; McCaffrey, Koretz,
Lockwood, & Hamilton, 2003; Mullens, Leighton, Laguarda & Obrien, 1996; Sanders,
2000; Sanders & Horn, 1998; Sanders & Rivers, 1996; Sanders, Wright, & Horn, 1997).
Teachers play a significant role in student achievement – whether for good or bad, and in
varying amounts (Sanders & Horn, 1998; Goldhaber, Brewer, & Anderson, 1999; Rivkin,
Hanushek, & Kain, 2005). On this basis, the new generation of accountability is focused
squarely on teachers to close the performance gap plaguing our education production
2
process. The measure of each teacher’s effectiveness in improving student performance
has increasingly been the focus of policy debate. Aside from the current method of
performance evaluation, other approaches have been suggested for use to generate a
performance evaluation design that is coherent and well aligned with the intended
purpose – increased student achievement. Further, the current trend in accountability, as
evidenced by the requirements of the No Child Left Behind (NCLB) Act of 2001 and the
Race to the Top (RTTT) initiative of 2009, has favored linking student assessment scores
to teacher performance. To this end, a sound comparison of the various performance
evaluation approaches can enhance the policy- and decision-making processes that can
ultimately exert a critical impact (positively or negatively) on the teaching and learning
that occurs in our classrooms.
Background of the Problem
The salient discourse on performance evaluation and the need for robust,
coherent, and well-aligned accountability models have evolved from several significant
historical events directly related to student achievement. The following sections will
briefly describe the events, which provide background to the problem, such as the
Coleman Report (1966) and A Nation at Risk (1983). In addition to these reports, research
on teacher effectiveness and details of the federal and state accountability laws and
initiatives will be given to build the foundation for the focus of this current study.
The Coleman Report was the first large-scale commissioned report initiated by
Section 402 of the Civil Rights Act of 1964 to examine disparities in education
(Coleman, 1966). The Coleman Report represented an enormous research study aimed at
3
illuminating concerns over the lack in availability of equal educational opportunities for
individuals as a result of race, color, religion, or national origin in public schools at all
levels and in all U.S. territories. Focused on equity in funding educational opportunities
for disadvantaged minority student groups, Coleman (1966) found that what mattered
most in influencing student achievement was the family and, to a certain extent, student
peers. In other words, school factors were not correlated to student outcomes; schools did
not make a difference. As a result of this finding, the remedy for improved student
outcomes focused on desegregation policies. The theory of action behind the design of
the desegregation policies was if student performance can be enhanced by background
characteristics of peers attending the same school, integration would consequently
improve student learning and thus close the achievement gap for disadvantaged minority
student groups.
Then, in 1983, A Nation at Risk, another commissioned report, placed a glaring
spotlight on the U.S. education production process and warned of “a rising tide of
mediocrity that threatens our very future as a Nation and a people” (The National
Commission of Excellence in Education, 1983, p. 1). Where once America’s supreme
position in the world was secured, it has since eroded. According to the commissioned
report, student achievement scores were falling behind those from other developed and
developing nations. Compared to their global peers, students in the U.S. were also
lacking adequate opportunities to acquire essential basic skills that would edge them
ahead in the competitive international labor market.
4
As a result of the findings from A Nation at Risk (1983), student performance
once again came to the forefront of policy debate agenda. The U.S. government was
catalyzed into action to increase individual student academic achievement and raise
expectations within its schools. Since the publication of A Nation at Risk in 1983, public
school education has been under unprecedented scrutiny by the national government.
Schools and their teachers became the critical units of analysis for the lack of appropriate
student achievement. Efforts to hold teachers and schools accountable ensued for not
only increased quality in education, but for measurable outcomes for all students.
Consequently, teacher evaluations significantly grew in importance.
Validating the federal and state governments’ scrutiny on teachers and schools as
the primary agents of change as they relate to student achievement, research performed
during and since the time of the Coleman Report (1966) leading up to A Nation at Risk
(1983) suggested that school factors can influence student outcomes; schools can make a
difference, especially for disadvantaged students (Hanushek, 1992). In fact, a large group
of research (Ballou et al., 2004; Hanushek, 1992; Marzano, 2003; McCaffrey et al., 2003;
Mullens et al., 1996; Sanders, 2000; Sanders & Horn, 1998; Sanders & Rivers, 1996;
Sanders et al., 1997) found that teachers do have a direct impact on student achievement
whether negatively or positively. Of the elements under the purview of schooling
organizations – curriculum, class size, and other similar factors – teachers have been
identified as the vital component for closing the performance gap. Further, teacher impact
on student outcomes can be enduring and cumulative.
5
Fueled by (a) the caution for equity in educational opportunities for disadvantaged
student groups by the Coleman Report (1966); (b) the lack of adequate progress in
student outcomes as reported by the 1983 commissioned report, A Nation at Risk; and (c)
the increased production of research implications of the critical role teachers play in
enhancing student achievement, policymakers intensified the glaring spotlight on
practitioners of the education environment to ensure all students achieve at the highest
level. Accordingly, several federal laws were enacted. The first of which was the 1994
passage of the Goals 2000 legislation along with the reauthorization of the Elementary
and Secondary Education Act (ESEA), which established the standards-based framework
mandating states to develop content standards and an accountability system for its K-12
schools (One Hundred Third Congress of the United States, 1999). Next, the NCLB Act
of 2001 was passed with significant changes to the previous 1994 laws. Then, in 2009,
RTTT was initiated as a strong inducement for states and school districts to develop
highly rigorous academic standards and to utilize measured student assessment data in
teacher evaluation practices. In addition to these education initiatives taken by the
national government, states and local governments have also paved their own paths
toward improving student outcomes.
The NCLB Act of 2001 – Public Law No. 107-110, 115 Stat. 1425 – was passed
in an attempt to close the achievement gap among the majority and minority races,
among students of various abilities, and students from different poverty levels. Holding
states and districts accountable, NCLB mandated that 100% of the student population,
regardless of ability, race, English proficiency, migrant status, poverty level, or gender,
6
demonstrate mastery at or above the proficiency level on state standards as measured by
state assessments by the end of the 2013-2014 academic year. Therefore, in an effort to
leave no child behind, a single statewide accountability system was applied to all students
regardless of their initial achievement level. Moreover, schools or local education
agencies (LEAs) of the same type (elementary, middle, and high schools) must meet the
same academic targets throughout the state, regardless of their baseline levels of
achievement. To this account, all K-12 learning institutions must satisfy the state’s
annually projected Adequate Yearly Progress (AYP) toward reaching the 100%
proficiency goal for all student groups within 12 years to meet the federal NCLB
mandate.
Under the NCLB accountability model, schools are held accountable for the
overall performance (status) of all students each year. School progress is measured by the
percentage of students who scored at the proficient level or above on the yearly
summative assessments in English language arts and math. In addition, the difference in
the Academic Performance Index (API) – gain or loss – made by a school from one year
to the next, thus from one cohort of students to a different cohort of students in the next
year, is another measure of school performance included in the AYP. Further,
participation rate and high school graduation rate are also factored into a school’s overall
AYP. Rewards and sanctions are implemented for the failure to or success of attaining
appropriate AYP progress.
In addition to the standards-based accountability measures, NCLB mandated that
a “highly qualified” teacher be present in each classroom. To meet the NCLB standard of
7
a “highly qualified” teacher, individuals must have a bachelor’s degree, be state-certified,
and prove they know the subject(s) they teach, either by satisfying minimum course-
taking requirements or passing a test in the subject(s) they teach (No Child Left Behind
[NCLB] Act of 2001, 2002). Furthermore, policymakers concerned with equitable
distribution of teachers also made sure that the NCLB reauthorization of the ESEA
specifically called for states to identify and address the inequitable distribution of highly
qualified, experienced teachers.
Research has shown that the distribution of highly qualified teachers has not
favored disadvantaged students (Murnane & Steele, 2007; Weisberg, Sexton, Mulhern, &
Keeling, 2009). Murnane and Steele (2007) and Weisberg et al. (2009) also concluded
that low-income and minority students have been disproportionately taught by under-
qualified teachers, including teachers who are out-of-field, inexperienced, or fail to meet
their state’s teacher licensing and certification standards. By comparison, poor and
minority students have been more likely to be taught by teachers who are not as well
qualified as teachers in more affluent areas with fewer minority students. Title I, Part A,
Section 1111(b)(8)(C) of ESEA required that states “ensure that poor and minority
children are not taught at higher rates than other children by inexperienced, unqualified,
out-of-field teachers.” The federal government emphasized its view on the importance of
equitable teacher distribution by making funds available through the American Recovery
and Reinvestment Act (ARRA) of 2009, with the requirement that states make progress
on key education reforms, including equitable distribution of qualified teachers (U.S.
Department of Education, 2009).
8
Even prior to NCLB, California had to tackle the “mounting concerns about the
quality of California’s public schools” (Woody, Buttles, Kafka, Park, & Russell, 2004, p.
11). Low test scores caused California to tie for last place on the National Assessment of
Education Progress (NAEP), which catalyzed state policymakers to respond to its
educational mediocrity. As a result, major elements of the state’s educational
accountability system were passed with the implementation of the Public Schools
Accountability Act (PSAA) of 1999. As is evident, in addition to the national
accountability efforts to ensure that each child is provided with a quality education that is
capable of producing measurable outcomes, the state of California also institutionalized
its own accountability measures.
The PSAA was a major accountability initiative operationalized in California to
re-prioritize the state’s education reform agenda “to measure and help improve the
academic achievement of California’s 6.3 million public school students enrolled in
nearly 10,000 schools in over 1,000 local educational agencies” (California Department
of Education, 2010a, p.1). A statewide testing program was instituted with corresponding
rewards and sanctions attached to incentivize appropriate levels of accountability for
student (and school) progress. Currently, the state is required by law to publish before the
start of each school year an accountability progress report illustrating school progress as
measured by state and federal accountability indicators, which are based on student
performance from the preceding year. The current accountability model in California
utilizes state testing results from different groups of students from year-to-year to make
judgment of student (and school) status and progress.
9
Subsequent to the NCLB federal act and the PSAA state law, the RTTT initiative
was put in place in 2009. Under the Barack Obama administration, RTTT was
implemented, as a competition for school districts and states to rethink and redesign their
education reform agenda in an effort to garner needed financial support for their
education production system. School districts and states were asked to design and present
an educational reform agenda, which focuses on among other components, a viable
approach to attracting and keeping great teachers and leaders in the education landscape.
In addition, the plan must comprise an appropriate strategy to revise teacher evaluation,
compensation, and retention to encourage and reward effectiveness. A competitive design
must also ensure that the most talented teachers are placed in the schools and subjects
where they are needed the most. Most importantly, the RTTT’s $4.35 billion reward
package funded from the ARRA is presented as an inducement for school districts and
states to commit to not having any legal, statutory, or regulatory barriers to linking data
from student achievement or student growth to teachers and principals for evaluation
purposes (U.S. Department of Education, 2009).
In light of these laws and initiatives, states like California and school districts like
Los Angeles Unified School District (LAUSD) that have been inflicted with economic
downturn have had to reconsider their education reform agenda. Compounding the need
for financial support, urban districts like LAUSD are also faced with a large diverse
student population, which have demonstrated a persistent student achievement gap (see
Table 1). Taken together, with the emphasis on holding educators and schools more
accountable for student outcomes, alternate forms of accountability have been vigorously
10
pursued by many school districts and certainly by LAUSD to increase student academic
achievement. Consequently, student assessment data from annual state testing from
LAUSD have been manipulated in various ways to eventually generate a robust,
coherent, and meaningful method of performance evaluation that can ensure increased
positive student academic attainment. One of these approaches is the value-added
approach – a type of growth model, which controls for student background
characteristics, prior achievement and/or other factors not under the purview of the
schooling systems. The intent of value-added models is that only after student, teacher,
and school characteristics known to confound performance and evaluation are mitigated
and adjusted, then, estimates of teacher and school effectiveness as they relate to student
learning outcomes can be properly assessed. This is extremely critical and warranted
when high stakes are involved.
Table 1
API Scores for Overall Student Group, Socioeconomically Disadvantaged (SED) Students, English
Learners (ELs), and Students with Disabilities (SWDs) in LAUSD between 2006 and 2010
Reporting Year All Students SED Students ELLs SWDs
2006 658 638 611 461
2007 664 644 611 468
2008 683 664 628 485
2009 694 676 634 485
2020 709 691 644 501
Source: California Department of Education (n.d.).
11
Statement of the Problem
Clearly, the government has taken on an unprecedented role in shaping public
school education since the release of the Coleman Report (1966) and A Nation at Risk
(1983). Federal and state accountability systems have generated various models for
evaluating performance in an effort to improve student achievement and boost teaching
quality. Undoubtedly, the government’s increased scrutiny and the intended purposes of
the high-stakes accountability system of the 21
st
century have enormous potential in
enhancing the quality of education for all students in our nation. While there are certain
benefits associated with the vision of high expectations, there are barriers mitigating the
successful attainment of the intended goals – improving teaching and increasing student
learning.
Research indicates that there is a strong correlation between teacher effectiveness
and student achievement (Kane, Rockoff, & Staiger, 2008; Rivkin et al., 2005; Nye,
Hedges, & Konstantopoulos, 2004). That is, of all factors under the purview of schools,
those closest to the instructional core – the teachers – play the most critical role in
impacting student outcomes. The problem is the current systems of evaluation are not
fully coherent or well aligned to the intended purposes of accountability. The information
value generated from the current evaluation models fails to provide cogent educative
feedback to improve teaching and thus enhance student learning.
For one, the currently operationalized model of classroom observations by school
site administrators is a process that is at best perfunctory (Weisberg et al., 2009). The
frequency and duration of these observational models of performance evaluation vary
12
depending on district and union negotiations. The binary rating system inherent in the
observational models, which is utilized widely among the states, where teachers are
categorized as “satisfactory” or “unsatisfactory,” virtually all teachers (99%) receive the
satisfactory rating (Weisberg et al., 2009). When a more detailed set of ratings is in place,
still, about 94% of the teachers, on average, receive one of the top two ratings (Weisberg
et al., 2009). When these results are examined in light of the enormous achievement gap,
especially in districts like LAUSD where 99.3% of the teachers receive the higher rating
in the binary rating system (Teacher Effectiveness Task Force: Los Angeles Unified
School District Final Report, 2010) – there is an obvious misalignment and a clear case
for lack of coherency.
Secondly, the trend in the test-based accountability systems has shifted to linking
teacher performance to student test scores. States are mandated by federal legislation like
the NCLB Act of 2001 to utilize the status model of accountability to estimate and report
student and school progress. In California, because policymakers recognize the vastly
distinctive characteristics of the student groups presently being taught in the schools in
the state, the NCLB status model is supplemented with another accountability model –
the improvement (or successive groups) model – under the state’s PSAA. Although the
status model is able to communicate to stakeholders information on student and school
status in an easily interpretable manner, it has been found to produce results that are more
favorable towards more affluent schools, leaving schools populated with more diverse
student population at a disadvantage (Goldschmidt & Choi, 2007). Moreover, both the
status and the improvement models of accountability use achievement indicators that do
13
not take into consideration many factors – such as student background, previous level of
educational attainment, language ability, or socioeconomic status – which may cause
variations in comparing one cohort of students from one year to another cohort of
students in another year (Goldschmidt et al., 2005). Furthermore, as Carlson (2000) has
discovered, the varying estimates of performance progress produced by these two
systems and in varying accountability systems can generate more confusion than clarity
for teachers and schools to work towards improving their skill set for the purposes of
enhancing student achievement.
Another problematic issue in the current accountability model is the lack of
connection between teacher impact and student achievement. The correlation between
teacher effectiveness and student learning is not directly accounted for in the current
accountability model as it places emphasis only on school and district level status (and
yearly improvement) for sanctions or rewards (California Department of Education,
2010a). Annual student assessment scores have not been used to hold teachers and grade-
level teams accountable in the California schooling system; nor have they been
adequately informative to incentivize teachers and teacher teams to make adjustments to
their learning environment in order to enhance student achievement.
Value-added models have been introduced into the accountability systems in
California school districts to address the challenges of the status and improvement
models and to also bring the level of analysis to the teacher level. The intent of
policymakers is to make the estimates of teacher and school effects generated by value-
added models to be a critical component of evaluation and accountability processes.
14
However, value-added models come with several mitigating factors. They require
longitudinal data, some require vertical scaling, and the estimates produced by the
various value-added models are variable from year to year (Corcoran, 2010; Drury &
Doran, 2003; Xu, 2000; Shepard, Kupermintz, & Linn, 2000).
At the heart of all accountability is the desire “that accountability results are
accurate and that valid inferences and good decisions can be made based on those results”
(Goldschmidt & Choi, 2007, p. 2). Despite the potential benefits of the current evaluation
models to heighten practices for increased student outcomes, the design and the
information generated may not accurately capture the impacts of teacher effectiveness.
Additionally, there may be unintended consequences imposing threats to teacher impact
as it correlates to student learning. At bottom, there is a lack of robust performance
evaluation models as they relate to student achievement for use to measure the
effectiveness of teachers, grade-level teams, and schools.
Purpose of the Study
The purpose of the current study is to build upon past research on comparing
evaluation and accountability models as they relate to linking teacher and school
performance to student assessment results. Specifically, the goal of this study is to
critically evaluate the following research questions in order to produce relevant
information value for policy and practical decision-making processes as they relate to
enhanced teaching and improved learning:
1. Within the constraints of K-5 elementary schools in LAUSD, to which extent
are the NCLB status indicator – Adequate Yearly Progress (AYP) – and
15
California’s PSAA improvement indicator – Academic Performance Index
(API) – correlated to the school characteristics index (SCI) and to what extent
are these two unadjusted achievement indicators inter-correlated.
2. Within the constraints of K-5 elementary schools in LAUSD, to what extent
are the adjusted indicators named below reliable (internal consistency
reliability and test-retest stability) and valid (discriminant, convergent, and
concurrent) for use as they relate to performance evaluation and
accountability:
a. API improvement scores
b. API
SED
scores (SED = socioeconomically disadvantaged)
c. Similar schools scores
d. LA Times value-added scores
e. Academic growth over time (AGT) value-added scores – ELA and
Math
f. Adjusted normal curve equivalent (ANCE) scores – ELA and Math
3. Within the constraints of K-5 elementary schools in LAUSD, to what extent
are adjusted grade-level equivalent (AGLE) scores reliable (internal
consistency reliability and test-retest stability) and valid (discriminant and
concurrent) for use as they relate to performance evaluation and
accountability.
16
Significance of the Study
A large body of empirical research has implicated the positive correlation
between teacher effectiveness and student achievement. Education reform efforts such as
the current high-stakes accountability system are institutionalized as a result of a common
goal – improve student achievement and ultimately close the persistent achievement gap.
As such, in order to ensure that each student is afforded an appropriate opportunity to
learn to meet adequate progress, it is critical that each classroom be staffed with an
effective teacher. There is a pressing need to operationalize performance evaluation
systems that adequately discern between teachers who are effective at enhancing student
learning gains and those who are not. Moreover, the information value generated by
performance evaluation systems to inform key decisions such as teacher assignment,
professional development, compensation, retention, and dismissal is extremely critical for
current education reform efforts. The generation of reliable information that can be used
in a valid manner is essential for human capital decision-making processes for schools
and districts. In other words, in order to design and implement an enduring approach, it is
imperative that the indicators used are judged as reliable and valid for their intended use.
A robust performance evaluation system can enhance the ability of schools and districts
to modify teacher compensation systems in order to retain highly effective teachers,
displace those who are consistently not producing adequate learning gains in students,
and target professional development for underperforming teachers.
This area of research is growing for large urban districts like LAUSD. As
policymakers are contemplating more vigorously about implementing performance
17
evaluation processes for estimating teacher effectiveness and the possibility of moving
from a single-salary schedule to a merit pay system, it is critical that the value of each
performance evaluation system be critically analyzed. In an effort to improve student
learning, districts like LAUSD are turning to researchers to provide theoretical and
empirical evidence needed to make high-stakes decisions that directly involve those
closest to the instructional core – the teachers – who have the greatest impact on student
outcomes.
Studies completed have examined various aspects of performance evaluation
systems. Some have looked into the use of portfolios, various forms of rating scales,
variations in observations by site administrators (Weisberg et al., 2009), value-added
models for evaluating effectiveness (at the teacher and school levels) (Buddin, 2010;
Meyer, 1995; Sanders & Horn, 1994), and variations in the use of achievement indicators
in assessing student and school performance (Choi, Seltzer, Herman, & Yamashiro, 2007;
Webster, Mendro, Orsak, & Weerasinghe, 1998).
The present study stands in a unique position, as it is the goal of the researcher to
illuminate the information value that can be generated for grade-level teams using student
assessment data. Performance of grade-level teacher teams have been found to exert a
positive impact on student gains, but research on performance evaluations to this point
have focused mainly on teachers and/or schools as the unit of analysis. There is lack of
research completed, which analyzes the contributions of a grade-level teams, a
statistically significant subgroup in the education production process.
18
The impetus of this study is to place the spotlight on the correlation among the
achievement indicators that are in use and those available for use in an attempt to
evaluate their comparative information value. In analyzing the information generated
from the current and alternative performance evaluation processes, this study attempts to
answer an over-arching question currently clogging the education reform policy design
process related to performance evaluation and accountability: how do the up and coming
methods used for estimating teacher effectiveness (and that of grade-level teams)
compare to currently operationalized evaluation practices. Overall, this study seeks to add
to the growing area of research by examining the extent to which the approaches to
performance evaluation and accountability models in large urban districts, where an
overwhelming amount of the achievement disparity is evidenced are reliable and thus,
valid inferences can be made for the purposes of improving teaching practices, enhancing
student learning, and making high stakes decisions.
Definition of Terms
The following are operational definitions of terms that are used extensively
throughout this study. More context will be provided in the following chapters.
Academic Performance Index (API). The API is a single number, ranging from a
low of 200 to a high of 1000, which reflects a school’s, an LEA’s, or a subgroup’s
performance level, based on the results of statewide testing. Its purpose is to measure the
academic performance and growth of schools. The API was established by the PSAA, a
landmark state law passed in 1999 that created a new academic accountability system for
K-12 public education in California.
19
The API is calculated by converting a student’s performance on statewide
assessments across multiple content areas into points on the API scale. These points are
then averaged across all students and all tests. The result is the API. An API is calculated
for schools, LEAs, and for each numerically significant subgroup of students at a school
or an LEA (California Department of Education, 2010c).
Academic Growth over Time (AGT) Value-added Scores. Academic growth over
time is a statistical model used to identify the individual impact of teachers (or school
leaders or entire schools) in LAUSD on student learning. The AGT model compares the
performance of each teacher’s students to that of teachers with similar students. The AGT
model uses students’ standardized test scores in combination with student demographics
to create growth predictions. AGT measures longitudinal student growth through the use
of a multivariate regression model based on test scores (prior and current) as well as
student characteristics controlling for non-school factors such as English language
proficiency, cognitive abilities, access to books/computers at home, parental support with
homework, and so on. Currently, AGT scores are only generated in the areas of English
language arts and math for the evaluation of teachers, site leaders, and schools. (AGT
scores were generated for LAUSD by the Value-added Research Center – VARC)
(Battelle for Kids, 2011).
Adjusted Achievement Indicators. Adjusted achievement (assessment) indicators
are variables used for evaluating performance, which account for various factors
contributing to differences in student test scores that may or may not be under the control
of the teacher, school, district, or LEA.
20
Adjusted Grade Level Equivalent (AGLE) Scores. Grade level equivalent scores
are equal interval standard scores that indicate a grade standing relative to 99 “grade level
tiers.” Grade level tiers range from one to 99 (e.g., 2.01 to 2.99 for second graders), and
are computed and compared relative to all students who are in the respective statistically
matched subgroup at each level. AGLE scores are computed similarly to adjusted normal
curve equivalent (ANCE) scores (defined below). In this study, AGLE scores are used for
grade level analysis (Hocevar, 2010; Hocevar, Brown, & Tate, 2008).
Adjusted Normal Curve Equivalent Scores. Adjusted normal curve equivalent
(ANCE) scores are scores that have been scaled in such a way that they have a normal
distribution, with a mean of 50 and a standard deviation of 21.06 in the normative sample
for a specific grade. ANCE scores range from one to 99. They appear similar to
percentile scores, but they have the advantage of being based on an equal interval scale.
That is, the difference between two successive scores has the same meaning throughout
the scale. They are useful for making meaningful comparisons between different
achievement tests and for statistical computations, such as determining an average score
for a group of students. In this study, ANCE scores are used for school level comparison
(Hocevar, 2010; Hocevar et al., 2008).
API Improvement Scores. The API improvement scores reflect the value of the
difference between the Base API and the Growth API for a school, district, or LEA
(California Department of Education, 2010c).
21
API
SED
Scores (SED = Socioeconomically Disadvantaged). The API (SED) score
is the target API for the socioeconomically disadvantaged student group for a school,
district, or LEA (California Department of Education, 2010c).
Adequate Yearly Progress (AYP) Proficiency Scores. AYP proficiency scores
represent the ratio of students scoring proficient or advanced on the state standardized
tests in ELA and math to the total number of students tested in a given year (California
Department of Education, 2010a).
Concurrent Validity. Concurrent validity means that a particular measure or set of
assessment results varies directly with another measure or test of the construct. It can also
be indicated by how indirectly a measure correlates with another of the opposite of the
construct (Lissitz & Samuelson, 2007).
Convergent Validity. Convergent validity is defined as the degree to which an
achievement indicator is similar to other achievement indicators (American
Psychological Association, 1999; Lissitz & Samuelson, 2007).
Discriminant Validity. Discriminant validity is defined as the extent to which an
achievement indicator is not similar to (or diverges from) another variable to which it
theoretically should not be similar (Lissitz & Samuelson, 2007).
Growth Models. Growth models generally refer to models of education
accountability that measures progress by tracking the achievement scores of the same
students from one year to the next with the intent of determining whether or not, on
average, the students made progress (Goldschmidt et al., 2005).
22
Improvement Models. Improvement models of accountability measure change
between different groups of students (e.g. fourth grade students in 2004 versus fourth
grade students in 2005) (Goldschmidt et al., 2005).
Internal Consistency. Internal consistency is the extent of inter-item correlation
(American Psychological Association, 1999; Lissitz & Samuelson, 2007).
Los Angeles (LA) Times Value-added Scores. The LA Times value-added scores
generated for LAUSD elementary school teachers were calculated using a lagged
achievement model. This longitudinal analysis of student academic growth controlled for
prior achievement in ELA and math and other moderating variables like those that
inherently impede or accelerate academic gain regardless of teacher performance and
peer characteristics in order to isolate teacher effects on student outcomes (Buddin,
2010).
Reliability. Reliability is a measure’s ability to repeatedly yield consistent results
(American Psychological Association, 1999; Choi, Goldschmidt, & Yamashiro, 2005).
Similar Schools Scores. Using the school characteristics index (SCI), a composite
of the school’s demographic characteristics, and the Base API, the similar schools scores
are calculated among a comparison group of 100 similar schools for the varying school
types – elementary, middle, and high school (California Department of Education,
2010d).
Stability. Stability is consistency over time (Lissitz & Samuelson, 2007; Pedhazur
& Schmelkin, 1991). In other words, it is the extent to which the same results are readily
reproduced by an accountability measure. For this study, stability is the measure of
23
consistency among scores as correlated among the same variable type over a two-year
period.
Status Models. A status model (such as AYP under NCLB) takes a snapshot of a
subgroup or school’s level of student proficiency at one point in time (or an average of
two or more points in time) and often compares that proficiency level with an established
target (Goldschmidt at al., 2005).
Unadjusted Achievement (Assessment) Indicators. Unadjusted achievement
indicators are those used in the accountability systems, which do not take into account or
control for variations in student, teacher, or school characteristics that may impede or
accelerate academic gain.
Valid. A valid measure is one that measures what it was purported to measure
(American Psychological Association, 1999; Choi et al., 2005). In this study, the
researcher examined the information value that is generated from varying accountability
indicators and the extent to which these measures lead to valid conclusions of teacher,
grade-level team, and school effectiveness.
Value-added Models. Value-added models are one type of growth model in which
states or districts use student background characteristics and/or prior achievement and
other data as statistical controls in order to isolate the specific effects of a particular
school, program, or teacher on student academic progress (Goldschmidt et al., 2005).
Organization of the Study
Chapter 1 provided an introduction, the background of the problem, the statement
of the problem, the purpose of the study, the significance of the study, and the key
24
operational terminologies used throughout this study. Chapter 2 will begin with a
discussion on the evolution of teacher evaluation practices. Further, chapter 2 will
examine the major factors that have influenced the trend in teacher evaluation processes
and school accountability models from input-focused to outcome-focused. This will be
followed by an in-depth discussion of the varying test-based accountability models
currently in use or available for use. Chapter 2 will close with a discussion of why it is
critical to examine socioeconomic status in accountability. Chapter 3 follows and will
explicate the methodologies used to address the research questions of this study. Chapter
4 will detail the results of the analysis that was guided by the research questions of this
study. Lastly, chapter 5 concludes the study with a discussion of the findings and the
practical and policy implications supported by the analysis of this current research on the
approaches to evaluating teachers, grade-level teams, and schools as they relate to
student assessment data. At the conclusion of this section, as it is the close of this study,
future research agenda will be provided.
25
CHAPTER 2
REVIEW OF THE LITERATURE
In recent years, evaluating teacher effectiveness has become a dominant topic in
education reform efforts in the U.S., an emphasis motivated in part by large variation in
teacher productivity as indicated by the ability to enhance student achievement
(Danielson & Ebrary, 2007). Of all factors under the purview of the schooling system,
“no other resource is so directly and intensely focused on student learning” than the
classroom teachers (Corcoran, 2010, p. 1), and research studies have found that teachers
matter, but can vary greatly in their effectiveness (Kane et al., 2008; Rivkin et al., 2005;
Nye et al., 2004). Additionally, as teachers are the primary resource of public education
and constitute the largest share of the K-12 education budgets, increased focus on
teachers (and their impact on student achievement) is warranted (Wayne & Youngs,
2003).
There is no denying that there is a persistent achievement gap that continues to
infect our school systems. Schools are failing their students, especially the most
disadvantaged students (Martinez-Garcia, LaPrairie, & Slate, 2011). Research has found
that teacher quality vary across schools in a manner that systematically places poor, low-
achieving, and racially isolated schools at a disadvantage (Boyd, Lankford, Loeb,
Rockoff, & Wyckoff, 2008; Clotfelter, Ladd, & Vidgor, 2006; Lankford, Loeb, &
Wyckoff, 2002).
The educational mediocrity of the American K-12 schooling system was
illuminated in two-large scale commissioned reports, The Coleman Report (1966) and A
26
Nation at Risk (1983). The lag in educational progress in student outcomes and decreased
competitive edge in the global market have forced policymakers and researchers to look
into various methods of evaluation and accountability, their effectiveness in assessing
what they purport to do, and the concern for better evaluation approaches that can lead to
improved teaching and increased student gains. Accordingly, teacher effects and
achievement gap effects on the local and worldwide labor market make performance
evaluation and accountability a more salient discourse at this juncture in education reform
(Hanushek & Woessmann, 2011).
Furthermore, the increased scrutiny from the national government on the K-12
education landscape was evidenced with the implementation of a) the No Child Left
Behind (NCLB) Act of 2001, which mandated a “highly qualified teacher” in each
classroom; and b) the Race to the Top (RTTT) initiative of 2009, which strongly
encouraged, through the use of monetary incentives, states to develop rigorous student
achievement standards and to use student achievement test scores in teacher evaluation.
Although these steps taken by the federal government to ensure that all students succeed
are justifiable as the U.S. face a downward trend in its competitive edge over other
developed and developing nations, thoughtful consideration of performance evaluation
processes and the varying accountability models is critical and necessary if their
generated results could cause job termination and other detrimental sanctions against the
most critical components of the education process – school teachers. Hence, the need to
ensure that evaluation methods used for assessing performance are measuring what they
are assumed to do is paramount.
27
In light of the purpose of this current research to generate through statistical
analysis the information value of the varying measures of evaluation and accountability, a
more critical and in depth examination of research literature related to performance
evaluation and accountability – at both the teacher and school level – is necessary.
The following review of literature is divided into six major sections compromised
of a review of literature pertaining to teacher and school evaluation and accountability.
The topics of the discussion include: (a) evolution of teacher evaluation practices, (b)
major historical events shaping the evolution of teacher evaluation practices, (c)
legislations impacting teacher evaluation practices, (d) shift from inputs to outcomes in
school-level accountability models, (e) test-based evaluation and accountability models,
and (f) socioeconomic status (SES). The last section of this chapter illuminates the unique
role of this current study within the larger body of literature on performance evaluation
and accountability models.
Evolution of Teacher Evaluation Practices
Teacher performance can be evaluated through three different approaches –
measurement of inputs, processes, and outputs (Goe, Bell, & Little, 2008). According to
Goe et al. (2008), inputs are the characteristics that the teacher brings into the education
production process. These factors include: background, beliefs, expectations, prior
experience, pedagogical and content knowledge, education level, and certification and
licensure. These indicators are used in literature and subsequently by NCLB to represent
“teacher quality.” Goe and colleagues (2008) went further to explain that processes refer
to what goes on in the classroom. The strategies used by teachers to interact with his/her
28
students in the classroom to produce learning fall in the processes category. Outputs, on
the other hand, represent the learning outcomes (Goe et al., 2008). These elements
include: student achievement, graduation rates, student behavior, engagement, attitudes,
and social-emotional well-being. Typically, the term teacher effectiveness is directly
linked to student outputs and research literature has often defined teacher effectiveness as
the amount of impact a teacher has on student learning as measured by the varying
outcome indicator mentioned above to include student assessment scores.
The practice of evaluating teachers is as old as the education system in the U.S.
(Clark, 1993) and it has undergone many transformations as a result of the changes in (a)
teacher roles, (b) values and beliefs about the effective teaching and teacher
responsibilities, (c) perceptions of how students learn best, and (d) societal demographics
and teaching contexts (Ellet & Teddlie, 2003).
Teacher Evaluation from 1900 – 1950
Teacher evaluation was based on a moralistic and ethical perspective between the
1900’s and 1950’s (Ellet & Teddlie, 2003). Teachers, during this time, were evaluated on
the basis of their personal characteristics. A “good” teacher rating meant someone with
high moral and ethical standards, someone with basic reading skills (at or above the high
school level), and someone who was regarded as a good role model. The incorporation of
these traits in evaluation processes was influenced by research in personality and
personality trait characteristics in psychology, which took place during the 1920’s and
1940’s (Ellet & Teddlie, 2003).
29
Teacher Evaluation from 1950 – 1980
Between 1950 and 1980, teacher evaluation processes turned to linkages between
observable teaching practices, meaning teacher behaviors, with student outcomes.
Enhanced by the space race, which was initiated by the launching of the Sputnik satellite
by the Soviet Union and philosophies of scientific management and behaviorism in
psychology and education, classroom-based research specifically focused on effective
teaching methods ensued (Clark, 1993). The use of paper and pencil tests as a means of
state licensing teachers was implemented. Additionally, methods for classroom
observations and evaluation of teachers began to take shape (Medley & Mitzel, 1963).
Furthermore, the industrial revolution brought about changes to the evaluation
process as a result of schools growing in size and as unions started to exert their influence
to protect the rights of educators (Ellet & Teddlie, 2003). In addition, the cold war era
brought about increased desire to find better teachers in order to compete with the
Soviets. As a result, more men joined the profession and union influence rose in their
impact due to increased union membership. Correspondingly, unions established specific
evaluative criteria for teachers and rules for dismissal and advancement (Clark, 1993;
Ellet & Teddlie, 2003). However, Clark (1993) contended that these criteria tended to be
minimal and were still dominated by local boards of education.
Teacher Evaluation from the 1980’s to the 21
st
Century
Supported by a growing body of research, which established connections between
teacher evaluation criteria used during this time with student outcomes, data on teacher
evaluation expanded dramatically. This growing knowledge contributed to a larger
30
understanding of the educational, social, and political landscape in which teacher
evaluation was situated (Ellet & Teddlie, 2003). Consequently, increased emphasis was
placed on holding teachers more responsible at enhancing student learning came about
during this time as it was viewed that teachers had more direct contact with students than
any other component of the schooling system. In addition, a significant shift toward state-
mandated, on-site assessments and evaluations of teaching for licensure purposes
replaced local, district policies, which evaluated teachers as employees. Furthermore, as
supported by educational research, a critical development in teacher evaluation at the end
of this period changed the focus of classroom-based evaluation systems from teaching to
learning. Student learning outcomes and level of engagement in classroom activities
signaled the effectiveness of a teacher.
Currently Operationalized Teacher Evaluation Model – Classroom Observations
Either through state law or federal mandates, each state has in operation some
form of teacher evaluation approach. Accordingly, there are various models of teacher
performance evaluation in practice. Although there are varying approaches to
performance evaluation of teachers, the most common approach among different states
and districts is principal classroom observation (Weisberg at al., 2009).
Overall, the process typically requires the site administrator to observe the teacher
within a class period (usually about 45 to 55 minutes). According to Toch and Rothman
(2008), often the evaluator has no specific training in the evaluation process and the
rubric focuses primarily on superficial aspects of teaching, such as professional dress
code standards and attendance, rather than teaching and learning. The results of the
31
principal observations consist of a rating of the observed teacher on a binary
“satisfactory” or “unsatisfactory” scale. A number of school districts have a more
stratified scale, which contains such ratings as “unsatisfactory,” “basic,” “proficient,” and
“advanced.” Researchers have come to refer to these principal observation processes as
“drive-bys” (Toch & Rothman, 2008; Weisberg et al., 2009). Murnane (1975) concluded
in his research of the effects of principal evaluations on teacher quality that it is possible
that principals can correctly identify effective and ineffective teachers through brief
observation of their practice. However, this is not reflective when compared to the actual
ratings principals give to their teachers. That is, when K-12 learning organizations utilize
the binary (satisfactory/unsatisfactory) scale, less than one percent of teachers are given
unsatisfactory ratings; and when a more detailed scale is used, similar results are garnered
(Weisberg et al., 2009). Further, 94% of teachers are rated in one of the top two
categories in a more layered rating scale. As a result of the small number of teachers
rated as unsatisfactory, it was posited by Weisberg and colleagues (2009), that the current
evaluation practice is not adequately differentiating teacher effectiveness. Substantiating
this point, according to Donaldson (2009), the criteria in teacher evaluations are often
constructed to minimize differentiation. Consequently, neither the principal nor the
observed teachers see the evaluation process as meaningful. Notably, connections
between teacher performance evaluation practices such as principal observations and
increased student learning have not been substantiated in research (Donaldson, 2009;
Milanowski, Kimball, & White, 2004; Piata, 2005; Tyler, Taylor, Kane, & Wooten,
2010).
32
Since the inception of the U.S. schooling system, performance evaluation
processes have existed. The initial intent of teacher evaluation approaches was to have
information from which to determine job continuation and pay increase (Clark, 1993).
Although this may have always been true to some extent, the means by which the
performance of teachers is measured have transformed dramatically as more and more
focus is placed on adequate achievement outcomes for all students. Increasingly, large-
scale test results have become the primary indicator for measuring the condition of public
education and have become the symbol of the new accountability era (Elmore, Abelman,
& Fuhrman, 1996; Linn, 2000).
Major Historical Events Shaping the Evolution of Teacher Evaluation Practices
Three major events shaped the evolution of teacher evaluation practices in the
United States. These events include: (a) the publication of the Equality of Educational
Opportunity of 1966, also known as the Coleman Report (Coleman, 1966); (b) the
implementation of the National Assessment of Education Progress (NAEP); and (c) the
publication of A Nation at Risk - Excellence in Education – of 1983. The following
sections discuss these events in chronological order.
The Coleman Report – Equality of Educational Opportunity of 1966
The publication of the landmark commissioned report, Equality of Educational
Opportunity of 1966, also known as the Coleman Report (Coleman, 1966), was the first
step to moving the accountability and evaluation systems from a focus on inputs to a
focus on outputs. Commissioned by the United States Department of Health, Education,
and Welfare, Coleman (1966) set out to conduct the large-scale survey in an effort to
33
“assess the availability of equal educational opportunities to children of different race,
color, region, and national origin” (Coleman, 1966, p.1). The analysis of test scores and
questionnaires developed for the purposes of the study suggested that inputs into the
education production system such as school resources to include teacher characteristics
did not have an impact on student learning as compared to student background and peer
influence.
The Coleman Report (Coleman, 1966) was significant for many reasons. One of
its many critical impacts was for the first time, education reform in the U.S., the focus
shifted from inputs placed into the education production process such as teachers’
salaries, facilities, or compliance issues to outcomes such as student achievement
(Hanushek, 1992). As Ravitch (2002) stated:
Before the Coleman Report, educational reform had focused solely on the issue of
resources, on the assumption that more generous provisions for teachers’ salaries,
facilities, textbooks, and supplies would fix whatever ailed the nation’s schools.
After the Coleman Report, reformers advanced a broader array of proposals,
many of which sought changes in performance rather than (or in addition to)
increases in resources. (p. 14)
In addition, the Coleman Report (Coleman, 1966) revealed that there was inadequate
collection and analysis of achievement data in the U.S. As a result, Congress authorized
the creation of the first educational indicator – The National Assessment of Educational
Progress – to gain more information on educational outcomes like student achievement
(Coleman, 1966).
The National Assessment of Educational Progress (NAEP)
The NAEP was first implemented in 1969, and it has since been a nationally
represented assessment that monitors trends in the knowledge and skills of American
34
students. Student achievement at grades 4, 8, and 12 in various content areas (reading,
mathematics, science, writing, history, etc.) is reported accordingly. In 1990, NAEP made
some changes to its original reporting procedures. Up to this time, NAEP reported to the
public only results at the grade level and student groups within the grade levels and not
on the performance of individual students, schools, school districts, or states. Starting in
1990, the state-level scores were reported, which enabled education stakeholders to
compare and benchmark academic achievement among the states (National Education
Association, 2006). The NAEP is generally regarded as a fair measure of what students
know and can do since test results are reported at the state-level, which seems to have
mitigated contamination (like teaching to the test) by the education systems to acquire
better test scores.
A Nation at Risk (National Commission on Excellence in Education, 1983)
A second large-scale commissioned report, A Nation at Risk – National
Commission on Excellence in Education – was published in 1983. The authors of the
report declared, “The educational foundations of our society are presently being eroded
by a rising tide of mediocrity that threatens our very future as a nation and a people” (The
National Commission on Excellence in Education, 1983). The U.S. mediocre
performance in the global market was attributed to the mediocre student performance on
national and international tests (Sirotnik, 2004). As a consequence of this report, scrutiny
from business leaders, the general public, and the national government intensified at an
unprecedented rate.
35
Decreased competitiveness of the U.S. in the international market drew attention
from the business community to the education landscape. Under-performance of
American businesses as compared to other developed and developing nations caused
business leaders to scrutinize the education production process. In many ways, the
business community blamed the American schooling system for inadequately preparing
students to succeed in the competitive world economy. The NAEP substantiated this
inadequate progress in education attainment on the part of American students. Taken
together, it was imperative that the U.S. worked at improving education to gain
momentum on productivity (Furhman, 2004). In effect, the corporate leaders demanded
the schooling systems to focus more heavily on the outcomes garnered from the
education production process.
Compounded by the lagged progress in the economy, the report essentially
triggered a nationwide education reform movement, which included increasing high
school graduation requirements, lengthening of the school year, and notably adding more
tests for students to take in effort to monitor for adequate student progress (Cuban, 2004).
Whether intended or not, A Nation at Risk sparked a decade-long wave of educational
reform effort centered on the design and collection of more detailed performance data for
improving learning of all students. Thus, the shift from input-focused to outcome-focused
hugely intensified.
Legislations Impacting Teacher Evaluation Practices
Aside from the above-mentioned three events, there were also two major federal
legislations that have critically framed the landscape on performance evaluation and
36
accountability. They are: (a) the No Child Left Behind (NCLB) Act of 2001 and (b) the
Race to the Top (RTTT) Initiative of 2009.
No Child Left Behind (NCLB) Act of 2001 – Teacher Quality Component
In 2002, NCLB was operationalized covering three broad areas in the education
environment: (a) accountability provisions, (b) teacher quality provisions, and (c)
provisions on state flexibility in the use of federal funds (Linn, Baker, & Betebenner,
2002). This section will focus on the teacher quality provision.
NCLB attempted to address learning outcomes at the instructional core with its
mandate that each class be staffed with a “highly qualified teacher.” As a result of NCLB,
teacher quality was formalized as a set of minimum qualifications. Under the NCLB
mandate, a “highly qualified” teacher is an individual who has (a) a full state teacher
certification; (b) a minimum of a bachelor’s degree obtained from an accredited
institution of higher education; and (c) subject-matter and teaching skills competency in
each of the academic subjects taught; for elementary teachers, this includes “reading,
writing, mathematics, and other areas of the basic elementary curriculum” (United States
Congress, 2001). Veteran teachers can meet NCLB’s “highly qualified” teacher standard
by passing subject matter exams or through a process known as the High Objective
Uniform State Standard of Evaluation (HOUSSE), defined separately within each state
(Boyd, Goldhaber, Lankford, & Wyckoff, 2007). One reason for the “highly qualified”
teacher mandate was to ensure that disadvantaged students “are not taught at higher rates
by unqualified, out of field, or inexperienced teachers” (U.S. Congress, 2001).
37
Although some researchers have suggested that the NCLB’s “highly qualified”
teacher legislation was well intended and was perhaps successful in illuminating
education inequalities (Darling-Hammond, 2007), other existing research has generated
mixed results regarding the relationship between student achievement and teacher
certification, education, and competency (Phillips, 2010).
Teacher education research. The effects of teacher education on student
outcomes are inconclusive. Ferguson and Ladd (1996) found that the relationship
between student learning and teachers with subject-specific degrees was both positive.
However, Elberts and Stone (1984), Ehrenberg and Brewer (1994), and Kiesling (1984)
found that they are correlated negatively. Data improvements that allow researchers to
examine both the degree earned as well as the content focus of that degree has helped
disentangle the inconclusive results in earlier studies. Goldhaber and Brewer (1997,
2000) posited that mathematics students whose teachers had master’s degrees in
mathematics (and not in mathematics teaching) gained more on achievement tests.
However, these findings were not generalizable to other subject areas or grade levels. At
the elementary level, one study found no evidence that teacher education improved
student achievement in kindergarten (Guarino, Hamilton, Lockwood, & Rathbun, 2006).
However, other researchers have found that teachers with bachelor’s degrees for
elementary education (a subject-specific degree for elementary teachers) significantly
improved elementary students’ achievement gains in reading, but not in mathematics
(Croninger, Rice, Rathbun, Nishio, 2007).
38
Teacher certification research. As is similar to the above discussion on teacher
education attainment, literature provides mixed results on the correlation between teacher
certification and student achievement. Hawk, Coble, and Swanson (1985) concluded that
students learn more from fully certified teachers regardless of certification type or subject
specific certification. However, their study did not account for SES and therefore
overestimated effects of teachers and teacher characteristics. Goldhaber and Brewer
(1997) found that having a college degree and full certification made a difference in tenth
grade students’ mathematics achievement, but only if the degree and certification were
subject specific (i.e., a math teacher received a degree in mathematics as well as
certification in mathematics). Subject specific certification in content areas other than
math (such as English and science) did not yield significant findings (Goldhaber &
Brewer, 1997, 2000). Guarino and colleagues (2006) found in their study that full
certification did not predict student achievement gains for kindergartners.
Teacher competency research. Of all the teacher characteristics that have been
addressed in the teacher quality literature, subject matter competency is most difficult to
measure and define because of the varying philosophical viewpoints among states. As a
result, researchers have examined an array of characteristics that act as reasonable
measures of teacher competency, such as subject specific degrees (which has been
addressed in the preceding section), teachers’ scores on licensure and competency tests,
and subject specific course-taking patterns. Research on these indicators has been
inconclusive (Croninger et al., 2007; Phillips, 2010; Wayne & Youngs, 2003).
39
In sum, the minimum requirements of an NCLB “highly qualified” teacher have
not been found by research to be strongly predictive of student outcomes as they relate to
student assessment scores (Goldhaber 2002; Kane et al., 2008; Wayne & Youngs, 2003).
Furthermore, lack of consistency in the results produced across research on the NCLB
“highly qualified” teacher characteristics as they relate to student achievement has
continued to confound proper assessment of their predictive nature (Phillips, 2010).
Race to the Top (RTTT) Initiative of 2009
Backed by the Barack Obama administration, RTTT, which was funded with
$4.35 billion from the American Reinvestment and Recovery Act (ARRA), was instituted
to accelerate education reform in America’s public schools. In promoting RTTT,
President Obama (2009) stated, “success should be measured by results…any state that
makes it unlawful to link student progress to teacher evaluation will have to change its
ways.” On this basis, RTTT initiative continues the demand for better student outcomes
and to a large extent solidifies the federal government’s perspective as to what is more
valued when it comes to performance evaluation and accountability. Accordingly, RTTT
defines “highly effective teachers” as those whose students achieve high rates of
academic growth as defined by the change in tests scores between two or more points in
time (U.S. Department of Education, 2010).
In their applications for monetary rewards from the RTTT’s $4.35 billion
competitive grant program, states are judged on the extent to which they or their districts
will: (a) measure individual student growth; (b) implement evaluation systems that use
student growth as a significant factor in evaluating teachers and principals; (c) include
40
student growth in annual evaluation; (d) use these evaluation to inform professional
support, compensation, promotion, retention, tenure, and dismissal; (e) link student
growth to in-state teacher preparation and credentialing programs, for public reporting
purposes and the expansion of effective programs; and (f) incorporate data on student
growth into professional development, coaching, and planning (U.S. Department of
Education, 2010).
States and communities across the nation have recently undertaken efforts
designed to promote education reforms that are consistent with the principles reflected by
the RTTT initiative. California recently enacted legislation to enable student achievement
data to be linked to teacher and principal performance.
Shift from Inputs to Outcomes in School-Level Accountability
The second component of this current research is on school accountability. Thus,
to frame the current study, an examination of the evolution of school accountability and
the increasing focus on student assessment outcomes as a measure of performance is
warranted. The following will include discussions on: (a) the Elementary and Secondary
Education Act (ESEA) of 1965, (b) Improving America’s Schools Act (IASA) of 1994,
(c) the No Child Left Behind (NCLB) Act of 2001 – accountability provisions, and (4)
California’s Public Schools Accountability Act (PSAA) of 1999.
Elementary and Secondary Education Act (ESEA) of 1965
The ESEA was the first and most extensive federal education law passed by the
U.S. Congress. As part of President Lyndon Johnson’s War on Poverty plan, ESEA
allocated through Title I, a compensatory education program, substantial federal funds in
41
an effort to ameliorate the disparity in academic achievement of low-income families
(Archuleta, 2002). To prevent districts from using and wasting funds for purposes other
than those intended by Title I, annual evaluation of the program was implemented
accordingly. Further, as Hamilton and Koretz (2002) stated, “standardized achievement
tests became a central means of evaluating Title I programs. Some observers viewed this
as a key step toward the use of tests as monitoring and accountability devices” (p. 16). On
this basis, judging success of ESEA was based on student academic achievement (Cuban,
2004). Moreover, according to Cuban (2004), the next two decades saw many states
implementing statewide testing programs. Approximately 60% of the states had adopted
statewide testing programs by the 1970’s and most states had such programs by the
1990’s.
Improving America’s Schools Act (IASA) of 1994
During the Clinton administration, specifically in 1994, the ESEA was
reauthorized into the IASA. In comparison to the ESEA and the previous
reauthorizations, IASA was characterized as being increasingly more outcome-focused
and as having narrower and more specific requirements regarding standards and
assessment. These narrower and more specific provisions became the foundation for
NCLB of 2001 (Riddle, 2004).
Notably, IASA mandated states to develop and implement standards for Adequate
Yearly Progress (AYP), which became the keystone of NCLB. The difference was under
IASA, standards for the AYP were “transitional,” which meant the legislation provided
several years (1994 to 2002) to phase in the requirements of creating and putting into
42
practice curriculum content standards, pupil performance standards, and assessments
linked to these standards. During the phasing in period, performance standards,
assessments, and AYP standards were all “transitional” (Riddle, 2004). Further, although
IASA did not specify the timeline for meeting 100% proficiency, it did require states to
ensure that all students master the content standards at the proficient level or above
(Riddle, 2004).
Along with the passage of the IASA in 1994, an emergence of state-wide
accountability systems were developed and implemented in the 1990’s. According to
Kane and Staiger (2002), approximately 90% of all states published report cards and
more than roughly 35% of all states included financial rewards and/or sanctions as part of
their accountability systems by the 2001-2002 school year.
No Child Left Behind Act (NCLB) of 2001 – Accountability Provisions
NCLB (2002) was designed and implemented in an effort to “close the
achievement gap with accountability, flexibility, and choice, so no child is left behind.”
(p.1). The 107th Congress reauthorized the ESEA of 1965 and signed into law by
President George W. Bush a policy by a new name – No Child Left Behind.
The accountability provisions under NCLB are very specific. NCLB required all
states to design and submit a plan, which outlined how all students progressed toward
proficiency in reading and math by the end of the 2013-2014 academic school year.
Discretion was given to the states on how to design the plan. The key requirements under
the accountability provisions were that the plan had to include challenging state standards
on math and reading, an outline of how students received instructional programs that
43
focused on these standards, and appropriate annual assessments of student mastery of
these standards in grades three through eight. Further, each state needed to determine and
to define the calculation of AYP, an annual measure of academic progress toward
achieving the 2013-2014 proficiency goal based on all students’ scores obtained from the
statewide achievement tests. Progress towards AYP must be included in an annual report
by the states to the United States Department of Education, which must indicate how
each student group demonstrated proficiency within 12 years of the law’s enactment. To
that end, NCLB required states to provide disaggregated data related to the adequate
yearly progress of four student groups – economically disadvantaged students, students
with disabilities, limited English proficient students, and students from major racial and
ethnic groups.
California’s Public Schools Accountability Act (PSAA) of 1999
In addition to the national initiatives discussed in the preceding sections,
California has also enacted its own educational policy. The accountability system in
California is based on the federal requirements of NCLB and state requirements under the
PSAA in 1999. Even before the requirements of NCLB, state policy makers had to
address the “mounting concerns about the quality of California’s public school” (Woody
et al., 2004, p. 11). The state was tied for last place on the NAEP.
California instituted the PSAA in 1999 and as a result, it also developed a
statewide testing and reporting system – Standardized Testing and Reporting (STAR)
program – with rewards and sanctions attached as consequences (California Department
of Education, 2010b). The impetus was to ensure that schools were held accountable for
44
adequate student progress. It was the initial step for the state government of California in
developing an accountability system to measure and improve the learning of California
students. To this end, California’s accountability system required schools and districts to
work towards meeting the statewide target of 800 on the Academic Performance Index
(API) (California Department of Education, 2010b). A report is published annually to
illustrate student achievement from the previous school year. Schools that have met or
exceeded the targeted goal must continuously work to improve the academic performance
for all their students.
The API is a numeric index, which ranges from a low 200 to a high 1000. It
represents a school’s progress as measured by student test scores. Tests that are used to
measure a school’s API are: (a) the California Standards Test (CSTs) that include
English-language arts in grades 2 through 11, a writing assessment in grades 4 and 7,
mathematics in grades 2 through 11, history-social science in grades 8, 10, and 11, and
science in grades 5, and 8 through 11; (b) the California Alternate Performance
Assessment (CAPA) in English-language arts and mathematics, in grades 2 through 11,
and science in grades 5, 8, and 10, for students with extreme cognitive disabilities; (c) the
California Modified Assessment (CMA) in English language arts (grades 3 through 9)
and math (grades 3 through 11), and science in grades 5, 8, and 10, for students for whom
both the CST and the CAPA are not appropriate; and (d) the California High School Exit
Examination (CAHSEE) in English-language arts and mathematics in grades 10 and 11
(California Department of Education, 2010a).
45
All California public schools receive two APIs during a single school year – a
Base and Growth API. The API Base Report provides the school’s current level of
academic performance and establishes growth targets for the school year (see Table 2).
The API Growth Report then presents data on how the growth targets were made. In
addition, all significantly represented subgroups including, if any, the socially
disadvantaged group must meet their projected growth target (see Table 2). A significant
subgroup is defined as one that has “at least 50 students enrolled or with valid tests scores
who make up at least 15% of the school’s enrollment of total valid test scores” or have an
enrollment of at least 100 students with valid test scores (California Department of
Education, 2010a, p. 25). The State Board of Education added English learners and
students with disabilities beginning with the 2005 Base Report. Under state regulations, if
a school meet growth targets as set in the API Base Report, it may be eligible for
monetary compensation or be recognized as a California Distinguished school or a
National Blue Ribbon school. However, if it fails to meet performance targets, the state
can intervene to improve the school’s academic performance.
Table 2
State API Growth Targets
Annual School-wide Target Annual Subgroup Target
1. Growth of 5% of the distance from the school's API or 800 1. 80% of the School-wide Target
OR OR
2. API of 800 or above 2. API of 800 or above
Source: California Department of Education (2010a).
46
The API Growth Report also includes the school’s decile ranking and a similar-
schools rank. A school’s decile rank can range from a low of 1 to a high of 10, with 10
being the best in comparison to other schools in the state of the same type – elementary,
middle or high school. To calculate the school’s similar-schools rank, the school’s API s
compared to 100 other schools that are considered similar in demographics. The purpose
of the statewide decile ranking is to provide information as to where a school stands in
relation to all the schools of the same type in the state. The similar schools rank allows a
school to gauge its academic progress and achievement compared to other schools that
have similar opportunities and challenges (California Department of Education, 2010d).
States like California, with great diverse student needs, have utilized the similar
schools ranking as a method especially for the purposes of improving the performance of
at-risk schools and districts. Understanding that all schools and students are greatly
different in student characteristics and resources, the system employs relative standards
as a way for schools to examine their performance and to benchmark the practices of
schools that face similar educational challenges. The use of relative rankings attempts to
make the system more educative, effective, and equitable for schools with the greatest
academic needs (Flanagan & Grissmer, 2006). In addition, while similar schools ranking
is not currently attached to any rewards or sanctions, it is implicitly linked to public and
administrative scrutiny as a result of comparative analysis (Fuhrman, 2004, p. 23).
Nevertheless, similar schools ranks remain a vital part of California’s accountability
system.
47
Test-Based Evaluation and Accountability Models
Although the idea of measuring teacher effectiveness on the basis of student
outcomes has been circulating through the education reform agenda (Murnane & Cohen,
1986; Odden & Kelley, 2002), four factors have come together to forge the shift towards
looking at teacher impact through the lens of student achievement results: (a) High-stakes
accountability legislation has not only produced a plethora of assessment data, but it has
increased its pressure on schools to account for results in a measurable way; (b) The
development of longitudinal data systems allow student test scores to be tracked over
time and matched to classroom teachers; (c) Advancements in data capacity and
statistical modeling have generated varying techniques with potential for estimating a
teacher’s individual impact on student achievement; and (d) Major businesses and
education foundations (such as the Bill and Melinda Gates Foundation) have invested
large sums of money to assist in the outcome-oriented shift in accountability and
evaluation processes (Corcoran, 2010).
Test-based accountability is the status of the currently implemented performance-
based accountability system in the U.S. “Nowadays, one thinks of testing and
accountability as twins in education; tests, it is assumed, produce the data on which
accountability for results are based” (Ravitch, 2002, p. 1). In the 1980’s, testing was
viewed as a useful tool to both monitor and stimulate educational reform efforts. As a
consequence, policymakers turned to test-based accountability as a critical component of
educational reform, reasoning that sanctions and rewards to schools will induce teachers
and school administrators to be more effective (McDonnell, 2004; Rouse, 2005).
48
Not everyone has viewed the shift from inputs to outcomes in the same manner. A
large number of educators, if not all, view that reducing their efforts into a single
indicator is not a true reflection of what they bring to the classroom. They argue that
standardized tests are inadequate measures and believe that multiple outcome measures
are necessary to make fair and accurate judgments of their performance (Choi et al.,
2005). While some researchers agree with school educators and strongly advocate for a
broader array of indicators on top of large-scale assessment results to be included in
accountability models (Oakes, 1989; Porter, 1988), some believe in the need to monitor
the quality and rigor of the academic curriculum (Oakes, 2003), while others favor the
safety and cleanliness of school facilities (Oakes, 1989), and still more feel adamantly
that the equitable distribution of other resources, such as qualified teachers is more
critical and pertinent (Darling-Hammond, 2000). Despite the lack of consensus, current
legislations aim at accountability models that focus squarely on student outcomes (Choi
et al., 2005).
Currently, there are several models of evaluation and accountability based on
student test scores that have been put into practice or are being considered for
implementation. U.S. states are using one or a combination of several models in their
accountability systems to measure the performance of students, teachers, and schools.
Although the manner in which the computations are carried out varies and the
assumptions inherent in each model type are different, the varying approaches to
evaluation and accountability utilize the assessment results generated from the mandates
49
placed upon each state by NCLB. The following sections will describe in detail the four
generic test-based accountability models: status, improvement, growth, and value-added.
Status Model
The status model takes “a snapshot of a subgroup’s or school’s level of student
proficiency at one point in time (or an average of two or more points in time) and often
compares that proficiency level with an established target” (Goldschmidt et al., 2005, p.
3). A status model evaluates school level performance against an established achievement
target generally for one particular school year. As a result, it is also referred to as a
“school-mean performance approach” (Raudenbush, 2004).
Mandated by NCLB, all states and all school districts use the status model to
measure student and school progress. The AYP under NCLB is an example of a status
model indicator. School progress is based on the comparison of the AYP attained by a
school to the established target also known as the Annual Measurable Objective (AMO),
which is the state set annual goal for schools and their students. Without accounting for
varying student -, teacher -, or school-level characteristics, academic progress in terms of
AYP is determined by the percentage of students who score at or above the proficient
level indicating proficient or advanced mastery of the state content standards for a
particular year. In addition, schools are evaluated based on the whether the various
student groups met or did not meet the targeted goal. On this basis, the model’s basic
question is “on average how are students performing this year” (Goldschmidt et al., 2005,
p. 3).
50
The status model has appeal because it sets the same performance expectations for
all students and schools regardless of where students start. However, status approaches to
accountability “pose the greatest challenges to high-poverty schools, which enroll a large
percentage of students who have traditionally scored poorly on standardized achievement
tests (Kim & Sunderman, 2005, p. 4). Moreover, school accountability based on attaining
a certain AYP in order to meet AMOs as defined by the status model may not correctly
classify school performance (Goldschmidt et al., 2005). There are several reasons for the
potential misclassification: (a) Schools with more student groups represented are more
likely to miss meeting the AYP due to the greater number of AMOs they need to meet
(Novak & Fuller, 2003); (b) Classification based on a cut score capture only a small
proportion of student performance, especially when scores are close to the cut score
(Thum, 2003); and (c) AYP does not recognize that each student has an educational
history and performs according to current and past opportunities to learn skills and build
knowledge. As a result, school performance is heavily influenced by the characteristics of
the students who enroll in the school rather than how well the school instructs its students
(Goldschmidt et al., 2005). An accountability model that classifies performance on a cut
score will not provide a good indicator of school quality (Choi et al., 2005).
Improvement Model
An improvement model is a variation of the status model and is also referred to as
a successive groups model or a cross-sectional model. An improvement model’s
underlying question of analysis is “on average, are students doing better this year as
compared to students in the same grade last year” (Goldschmidt et al., 2005, p. 4).
51
Correspondingly, an improvement model of accountability indicates school progress by
examining the change in assessment scores between different groups of students. For
example, percent of grade 4 students scoring proficient or above in 2000 is compared to
the corresponding percent for grade 4 students in 1999.
In addition to states like Colorado, Maryland, and Washington, Kentucky’s
accountability system and California’s Academic Performance Index (API) are examples
of the successive cohorts approach, which measure improvement in student growth. A
reason for using the improvement model is to account for the varying starting points in
academic attainment of students served by the school. However, comparing the
performance of different groups of students at a grade level in different school years rests
upon the implicit assumption that the characteristics of student population at any one
school remains stable year after year (Rouse, 2005). Although this may be a reasonable
estimate of performance for most schools, it is not as reliable or valid for large urban
school districts serving families whose demographic characteristics change rapidly (Linn,
2001a; Linn, 2001b; Hanushek, Raymond, & Rivkin, 2004).
According to Linn (2001a) and Kane and Staiger (2002), change scores that are
generated in the comparison between successive groups in the improvement model can be
quite unreliable for several reasons. They include: (a) Measurement and sampling error
can cause variability in school summary scores (Linn & Haug, 2002; Linn et al., 2002);
(b) Change scores are less reliable than the scores used to compute change because the
scores for year one have a strong positive correlation with the results of year two (Kane
& Staiger, 2002); (c) It is common that the between-school variability of the change
52
scores is considerably smaller than the between-school variability of the scores for a
given year (Kane & Staiger, 2002); (d) A large part of the variability found in change
scores for schools is due to non-persistent factors that influence scores in one year but not
the next (Kane & Staiger, 2002; Linn & Haug, 2002). Such factors include illness or a
traumatic school event such as a student death – factors over which schools have little or
no control; and (e) Hanushek et al. (2004) suggested that the successive groups model
may exhibit variability in results purely from design error.
In their work on data from North Carolina, Kane and Staiger (2002) found that
persistent factors having to do with the school environment accounted for only about a
fifth to a fourth of the variability in school change scores. The researchers estimated that
for schools in the smallest quintile, the researchers found that 58% of the between-school
variability in year-to-year changes in fourth grade reading and math scores was a result of
a combination of sampling variability and other non-persistent factors. The corresponding
percentage for the largest 20% of the schools was only slightly smaller (73%). As a
result, Kane and Staiger (2002) corroborated their assertion that the variability in change
scores can result from non-persistent factors and sampling variability. Taken as a whole,
because change scores may be reflective of random fluctuations attributable to non-
persistent factors that may or may not repeat themselves in the following testing cycles,
basing high-stakes decisions on them may be more detrimental than good (Linn, 2001b).
Further, the volatility due to sampling error and non-persistent factors is so great that
schools identified in a given year are unlikely to be similarly identified the following year
(Kane & Staiger, 2001; Linn & Haug, 2002).
53
In Linn et al. (2002), an illustration is provided through the use of test data from
the Colorado Student Assessment Program (CSAP) of the instability of school-building
results inherent in the successive groups model. Specifically, the results of the fourth
grade CSAP test in reading for 734 schools in a four-year span were examined in the
analysis. Despite the fact that on average, schools has 4.7% more students at the
proficient level or higher in 2000 than in 1997, only one of 20 schools would have met
the target increase of one point three years in row. As a result, many schools that meet the
target in one year, however, will fail to do so the next year. Given the challenges
inherent in successive groups approaches, Linn et al. (2002) suggested four alternatives to
the defining adequate yearly progress that may ameliorate the problem of school level
score instability: (a) the use of longitudinal tracking of students from year to year, (b) the
use of rolling averages of two or more years of achievement results, (c) the use of
composite scores across subject areas and grades, and (d) the use of separate grade-by-
subject-area results but setting the targets other than all combinations showing
improvement, such as five out of eight or seven out of ten possible grade-by-subject
combinations. Each of these alternative approaches would reduce the magnitude of year-
to-year fluctuations of results due to differences in cohorts of students attending a school.
Growth Model
In education, growth models are accountability approaches that measure progress
by examining test scores of the same students from one year to the next with the purpose
of determining whether or not the students have made progress (Goldschmidt et al.,
2005). Usually, the cohorts of students are tracked (typically through unique
54
identification numbers) over time as they advance through the K-12 schooling systems
(Riddle, 2005). Comparing data for the same students over time allows for the estimate of
improvement made by each individual student as compared to a statewide or local target.
To this end, the basic question related to the growth model is “how much, on average, did
students’ performance change” (Goldschmidt et al., 2005, p. 4). Further, the aggregate of
growth for individual students of a school makes up the achievement growth over time at
the school level. Growth models can be delineated into two separate but similar models
of accountability: longitudinal and quasi-longitudinal.
As in other models of accountability, there are also benefits and challenges to the
growth model. A system based on growth in performance is attractive as it attends to
student learning and thereby may be seen as fairer to students and schools because it
takes into account previous performance (Linn, 2001b). On the other hand, advocates of
poor and minority children worry that a growth model will result in lower expectations
for “the nation’s most disadvantaged young people” (Olson & Hoff, 2005, p. 16).
Moreover, modeling growth requires at least three years of student level data and it may
require vertical equating, which involves developing a continuous scale that has the same
meaning across grades.
Longitudinal model. In this version of the growth model, students with valid test
scores for the two years (or more) in the comparison are accounted for. According to
Linn (2001a), the longitudinal model is favorable in several ways: (a) It holds schools
accountable only for gains made by students who have been in the school for the full
year; (b) Taking the prior year’s achievement directly into account minimizes concerns
55
about unfair comparisons among schools that serve student bodies with substantially
different socioeconomic backgrounds, since most of those differences are accounted for
by controlling for prior achievement. This is a factor to take into consideration as
research has indicated a strong correlation between achievement and SES, particularly
when considering data aggregated to the school level. More discussion on the correlation
between SES and student achievement will be discussed in a later section; and (c)
Longitudinal models produce more dependable estimates of gains for schools than those
produced by the successive groups model.
While there are some benefits to using the longitudinal model, there are some
drawbacks. First, the longitudinal model requires that scores be reported on a scale that is
comparable across grades. This requirement means that reports in terms of number of
students meeting standards cannot be used in the analyses because the performance levels
at one grade are not comparable to those at another. Second, there is still a substantial
degree of uncertainty in the estimated gains due to a combination of sampling variability
and other non-persistent factors that affect the scores in one year but not another. Using
data from North Carolina, Kane and Staiger (2002) estimated that 29% of the between-
school variance in gains from grade 3 to grade 4 for the schools in the largest quintile was
due to a combination of sampling variability and other non-persistent factors. For schools
in the smallest quintile, the corresponding figure was 58%. These results indicate that
longitudinal growth models may be good for larger schools, but the same may be
unreliable for schools with smaller numbers. Lastly, the requirement that all students
have test scores in both years (or all years if more than two are used) generally means
56
that mobile students are more likely to be excluded from the accountability calculations.
Although this may be a strong positive for teachers, focusing only on students who are
stable in the schools they attend can distort the overall accountability results and may
paint an overly optimistic picture of the gains.
Quasi-longitudinal model. As another version of the growth model and similar
to the longitudinal model, the quasi-longitudinal model accounts for student growth
rather than year-to-year changes in achievement like in the successive groups model. On
this basis, a quasi-longitudinal model, like the longitudinal model, requires vertical
equating of test scores across grade levels. However, different from the longitudinal
model, the quasi-longitudinal model estimates gains for all students in a school who were
tested in either the first or second year. In other words, both mobile and non-mobile
students are part of the computation. As a result, there is little variation in estimates of
performance between the longitudinal and quasi-longitudinal for schools that have
relatively stable student population. However, for schools with higher transience rates,
the two models can yield very different results.
Value-added model
Value-added models (VAMs) are a special type of growth model (Lissitz, Doran,
Shafer, & Willhoft, 2006). Although VAMs also require vertical equating and
longitudinal student data like the other growth models, what sets VAMs apart is that
VAMs use student background characteristics and/or prior achievement and other data as
statistical controls in order to isolate the specific effects of a particular school, program,
or teacher on student academic progress (Goldschmidt et al., 2005; Lissitz et al., 2006).
57
The main purpose of VAMs is to statistically separate the effects of non-school-related
factors (such as family, peer, and individual influence) from a teacher or school’s
performance at any point in time so that student performance can be attributed
appropriately. A value-added estimate of a school or teacher is simply the difference
between the actual growth and the expected growth. Schools may demonstrate positive
achievement, but may have a value-added estimate in the negative. The basic question for
VAMs is “on average, did the students’ change in performance meet the growth
expectation” or in other words, “by how much did the average change in student
performance miss or exceed the growth expectation” (Goldschmidt et al., 2005, p. 5).
Aside from possessing the benefits of what a growth model can offer to the
accountability and evaluation landscape, VAMs are also appealing as their use can
potentially isolate a teacher’s unique contribution to students’ academic outcomes.
However, there are some critical limitations: (a) Value-added measurement works best
when students receive a single objective numeric test score on a continuous
developmental scale (Corcoran, 2010). This may not be possible for all content areas,
especially like history, English, or music; (b) The separation between teachers and
schools may prove difficult to make because school-level factors can and do affect
teacher’s value-added scores (Jackson & Bruegmann, 2009). Other studies have found
effects of principal leadership on student outcomes (Clark, Martorell, & Rockoff, 2009).
Gordon, Kane, and Staiger (2006) posited that teacher effectiveness varies across schools
within a district and to focus only on variation within schools would ignore important
variation in teacher quality across schools; (c) Teacher estimates can vary greatly
58
depending on which test is used (Corcoran, Jenning, & Beveridge, 2010) and when tests
are administered (Alexander, Entwisle, & Olson, 2001); (d) Missing data can negatively
affect a teacher’s value-added score; (e) Annual value-added estimates are highly
variable from year to year, and in practice, many teachers cannot be statistically
distinguished from the majority of their peers (Corcoran, 2010). Using value-added data
in reading from grades four and five from 2000 to 2006 generated from the Houston
Independence School District (HISD), Corcoran (2010) found that 23% of the previous
year’s lowest performers were in the top two quintiles in the following year. Other
research has concluded similar findings (Drury & Doran, 2003); (f) Value-added ratings
are on a relative scale. Therefore, there will always be teachers at the very bottom tier;
and (g) Some critics argue that the analysis of VAMs are beyond the comprehension of
educators and thus, not interpretable for the vast majority for practical purposes.
Tennessee value-added assessment system (TVAAS). The Tennessee Value-
Added Assessment System (TVAAS) was the first value-added model operationalized
statewide in the U.S. Developed by William Sanders (Sanders & Horn, 1994), TVAAS,
which calculated growth by subtracting the expected growth from the actual growth, set
different goals for different students, student groups, and schools based on their
respective previous scores. Sanders and Horn (1994) purported that this model was able
to estimate the unique contribution of the teacher and the school to a child’s growth in
scores over time through vertical scaling of test results. In effect, the model was also able
to determine whether the student, student group, or school was below, at, or above their
respective expected gain in achievement.
59
Despite Sanders and Horn (1994) implicating that the methods used for TVAAS
was free from confounding factors like SES or ethnic background, Xu (2000) and
Shepard et al. (2000) discovered findings to the contrary. That is, from his own
quantitative study in which he used information from 58 elementary schools, Xu (2000)
found that per pupil expenditure and average achievement in reading and mathematics
correlated at r = .39; percent minority and mathematics achievement correlated at r = .24;
percent minority and reading scores correlated at r = .28; and percent of students on the
free or reduced lunch programs correlated with reading and math achievement at r = .27
and .49, respectively. Essentially, these background characteristics were able to explain
between 6% (percent minority) and 24% (percent of students on free or reduced lunch
programs) of the variance in student achievement in either reading or math. Shepard et al.
(2000) came up with relatively similar findings are Hu (2000). On this basis, Hu (2000)
and Shepard et al. (2000) found a critical gap in the TVAAS approach in that estimates of
school performance from the model were confounded by other student background
factors.
In addition to the TVAAS, there are several other value-added models, such as the
Chicago Public School Productivity Model (Bryk, Thum, Easton, & Luppescu, 1998), the
CRESST Student Growth Distribution Model (Choi et al., 2007), the REACH Model
(Doran & Izumi, 2004), and the RAND Model (McCaffrey, et al., 2003). Although these
models are different in their design, Goldschmidt et al. (2005) stated, “the difference in
inferences based on different VAMs will be much less than the difference in inferences
between a VAM and a status model such as AYP” (p. 16).
60
Comparison of Different Accountability Models
There are critical differences in requirements and assumptions made among the
varying accountability models that influence the results they generate. A number of
studies have been conducted to investigate the appropriateness of the varying models of
accountability (Carlson, 2000; Choi et al., 2005; Choi et al., 2007; Linn & Haug, 2002;
Meyer, 1995; Raudenbush, 2004; Webster et al., 1998).
In a study conducted by Carlson (2000) where he computed correlations between
the estimates of school gain scores from both the quasi-longitudinal and longitudinal
models and change scores from the successive groups model, it was discovered that the
choice of accountability model clearly mattered. Carlson (2000) concluded that the quasi-
longitudinal and the longitudinal models yielded more reliable and dependable estimates
than the successive groups model. Linn (2004) substantiated Carlson’s (2000) claim and
stated that longitudinal models are “less influenced by measurement error, sampling
variability, and variability due to non-persistent factors” (p. 90). The study completed by
Carlson (2000) implicated that essentially a school that is identified as outstanding by one
model would not necessarily be so by another model. The varying models were not
consistently producing the same performance estimates.
In a separate study, Webster et al. (1998) evaluated the variations between value-
added models and several other statistical models in estimating school and teacher effects
as they relate to student learning and other educational outcomes. A comparative analysis
of school effectiveness indices generated by (a) unadjusted student test scores, (b) gain
scores, and (c) various ordinary least squares (OLS) models were examined against those
61
produced by varying hierarchical linear models (HLM). The basis of their research was to
investigate the “fairness” of the varying models. A model was judged as appropriate or
“fair” if it (a) correlated highly with school or teacher effectiveness indices generated by
HLM (the standard method for this study) and (b) had a low correlation with individual
student background variables and aggregate school factors. On this basis, the three
models named above were deemed appropriate. In addition, the researchers posited that
the two-stage, two-level (students nested within schools) model was the most appropriate
for estimating effects, but the two-stage, two-level (students nested in teachers model was
more appropriate for teacher effect estimates.
Similar to the preceding two studies, Choi et al. (2007) also performed a
comparative analysis of varying accountability models. These researchers examined the
performance classifications that were produced from AYP results and those from the
results of various value-added models. The research was performed on a longitudinal
dataset obtained from an urban school district in the Pacific Northwest. Outcomes on
reading scores for third graders in 2001 and the corresponding scores for the same set of
students when they were in the fifth grade in the year 2003 were examined. What Choi et
al. (2007) discovered was when they classified the schools under investigation into AYP
schools (schools with a high enough proficiency rate) and non-AYP schools, while 12 of
the 51 schools had an estimated gain that was statistically greater than the district mean
gain, almost half of the non-AYP schools had gains larger that the district average.
Furthermore, when Choi and his team of researchers (2007) applied the same information
into the various value-added models that were included in their study and that differed in
62
the number of background characteristics controlled, it was discovered that the value-
added models produced more information and the most valid evaluation of school
performance.
Meyer (1995) also contributed to this line of research. In a simulation study,
Meyer (1995) compared results derived from mean test scores and those from value-
added models. It was concluded that value-added models provided a more promising
alternative as the results of the mean test scores were highly flawed and of limited value.
However, although the value-added measures produced more statistically favorable
results, the validity of the results were dependent on a number of variables, such as the
quality and appropriateness of the test, the adequacy of the control variables included,
and the technical validity of the statistical models used to design the indicators.
Socioeconomic Status (SES)
There is a growing body of research that suggests SES does positively and
negatively affect learning (Bradley & Corwyn, 2002; Grinion, 1999; Meece, 2002;
Pintrich & Schunk, 2002). Accordingly, policymakers and researchers are asking whether
SES should be considered in accountability and evaluation models.
One side of this issue … argues that schools can fairly be held accountable only
for factors they control, and, therefore, performance accountability systems
should control for or equalize student socioeconomic status before they dispense
rewards and penalties…The other side of the issue argues controlling for students’
background or prior achievement institutionalizes low expectations for poor,
minority, and low-achieving students. (Elmore et al., 1996, pp. 93-94)
Despite the opposing views, and despite the many varied definitions of SES, which
generally includes parent education level and poverty level (Schunk, 2000), it has been
found to have a strong relationship with achievement, specifically, high school grade-
63
point average (Grinion, 1999). In a study performed by Ferguson (2002), the correlation
between SES and levels of achievement was investigated. SES variables, according to
Ferguson (2002), were based on home intellectual resources such as books, computers,
parent educational level, and the number of parents per child ratio. The SES scale was
standardized into four categories: lowest SES, lower-middle SES, upper-middle SES, and
highest SES. Using regression coefficients for students’ most current grade-point average
estimated separately by race/ethnicity, simulations were completed in order to generate
achievement predictions. The analysis showed that high SES students performed at
higher levels than the middle- and low- SES students. Although no explanations were
given, African-American and Hispanics displayed the smallest differences in
achievement between the highest- and lowest- SES students.
In a longitudinal study using 800 districts in Illinois, Flannagan and Grissmer
(2006) examined achievement with third grade reading and eighth grade math scores
from 1993-1998. In addition, data on a number of student related SES characteristics
were manipulated to examine the effect of different indicator sets – those not adjusting
for SES and other that made the adjustments – on a school’s ranking relative to schools
similar in characteristics. The findings were alarming in that two thirds of the districts
that were lowest performing under one set of SES variables were not failing under
another set of predetermined SES variables revealing the inconsistency of rankings.
Overwhelmingly, research does suggest that SES strongly correlates to student
achievement. The analysis presented offers useful information that is directly applicable
in the design of accountability and evaluation approaches. Although states like California
64
and Pennsylvania have included into their accountability systems elements of adjusted
indicators, they are currently secondary considerations and are not used in accountability
decisions involving rewards and sanctions. While it may be unclear which aspects of SES
contribute to learning (Schunk, 2000), the accountability design that accounts for
different SES variables is surely to affect the evaluation of schools. This warrants careful
attention in its inclusion or exclusion in the design of accountability and evaluation
models.
Unique Contribution of This Current Research Agenda
Current accountability focuses squarely on student outcomes, specifically student
test scores, as a measure of teacher and school effectiveness. There has been a noticeable
paradigm shift in accountability from input-focused to outcome-focused. Although
presently, it is only a strong inducement and not yet a mandate, RTTT initiative of 2009
sets the tone for the K-12 education landscape, specifically in the performance evaluation
spectrum. Accordingly, accountability and performance evaluation processes are
undergoing a dramatic change in the U.S. schooling system.
Research has implicated the relationship between teachers and student
achievement and has found that teachers do make a difference and in varying degrees
(Kane et al., 2008; Marzano, 2003; Nye et al., 2004; Rivkin et al., 2005). Additionally,
the largest portion of K-12 budgets is devoted to teacher salary (Phillips, 2010). As a
bottom line act to close the persistent achievement gap, policymakers have aggressively
turned their attention to test-based accountability and have placed teachers at the center
of their target.
65
The current research study will add to the line of literature, which compares
varying accountability and evaluation approaches. Using data from a large urban school
district (Los Angeles Unified School District), this present study will generate a
comparative analysis of the varying accountability indicators currently in use or available
for use to evaluate teacher grade-level teams, and school performance as they relate to
student assessment data. Critically and uniquely significant, this study will also analyze
assessment data for performance evaluation for grade-level teams (a significant subgroup
ignored by both NCLB and PSAA) through the use of the ordinary least squares
approach.
66
CHAPTER 3
RESEARCH METHODOLOGY
The purpose of this study was to illuminate the information value of current and
alternative performance evaluation systems as they relate to student achievement. Student
achievement in the United States has lagged behind that of students from other developed
and developing nations, particularly in science, technology, engineering, and
mathematics (Friedman, 2005; Gonzales et al., 2008). In addition, low-income and
minority students have evidenced a large and persistent gap in academic achievement as
compared to their counterparts (Hanushek, 1992). Consequently, due to enormous
diversity in student demographics and overrepresentation of disadvantaged students, large
urban school districts like Los Angeles Unified School District (LAUSD) have had to
address the issue of the unrelenting achievement gap and the lack of adequate yearly
progress more so than in other districts (McKinsey & Company, 2009). Many districts
like LAUSD have vigorously pursued education reform efforts with the intent to
drastically improve student performance, make gains in human capital, and close the
persistent achievement gap (Teacher Effectiveness Task Force: Los Angeles Unified
School District – Final Report, 2010).
Different models of performance evaluation as they relate to student test scores
have been examined in order to deduce a robust accountability system that can enhance
both teaching and student learning. The manner in which these various accountability
approaches have manipulated student assessment data and the high stakes that are
67
attached to their estimates of performance effectiveness have become the focal point of
interest for policy designers and education practitioners.
Accountability models that link performance evaluation to student achievement
measures should provide reliable results that are valid for use to effectively encourage
improved learning, generate quality high stakes decision-making, and produce and
maintain confidence in the entire accountability system. To that end, the intent of this
current research study is to illuminate the information value of the varying performance
evaluation and accountability models that are in use or are available for use as they relate
to student assessment scores. Specifically, the following research questions were
critically examined:
1. Within the constraints of K-5 elementary schools in LAUSD, to which extent
are the NCLB status indicator – Adequate Yearly Progress (AYP) – and
California’s PSAA improvement indicator – Academic Performance Index
(API) – correlated to the school characteristics index (SCI) and to what extent
are these two unadjusted achievement indicators inter-correlated.
2. Within the constraints of K-5 elementary schools in LAUSD, to what extent
are the adjusted indicators named below reliable (internal consistency
reliability and test-retest stability) and valid (discriminant, convergent, and
concurrent) for use as they relate to performance evaluation and
accountability:
a. API improvement scores
b. API
SED
scores (SED = socioeconomically disadvantaged)
68
c. Similar schools scores
d. LA Times value-added scores
e. Academic growth over time (AGT) value-added scores – ELA and
Math
f. Adjusted normal curve equivalent (ANCE) scores – ELA and Math
3. Within the constraints of K-5 elementary schools in LAUSD, to what extent
are adjusted grade-level equivalent (AGLE) scores reliable (internal
consistency reliability and test-retest stability) and valid (discriminant and
concurrent) for use as they relate to performance evaluation and
accountability.
Research Design
A quantitative approach was used to bring this study to fruition. To critically
examine the reliability and validity of the varying achievement indicators, various
correlation analyses were completed on student assessment data for all K-5 elementary
schools in LAUSD. In addition, to set the context for this research, descriptive statistics
of cogent variables were computed accordingly. The design of this correlation study will
enable a rank order comparison of the information value of the various performance
evaluation models as analyzed through the above-mentioned statistical manipulations and
analyses, which were guided by the research questions of this research study.
The independent (causal) variable of this present research agenda was determined
to be the school characteristic index (SCI) created as a result of the California’s Public
69
Schools Accountability Act (PSAA) of 1999 (details of which will be discussed under the
instrumentation section below).
The dependent (affected) variables include both adjusted and unadjusted
accountability indicators currently in use or are available for use in performance
evaluation and accountability systems. Unadjusted indicators consisted of the PSAA
Academic Achievement Index (API) and the No Child Left Behind (NCLB) Act of 2001
Adequate Yearly Progress (AYP) proficiency scores. Adjusted indicators comprised of
API improvement scores, API(SED) scores, similar school scores, LA Times value-added
scores, LAUSD academic growth over time (AGT) value-added scores, adjusted normal
curve equivalent scores, and grade-level equivalent scores.
Population and Sample
This study is set within the context of a large urban school district in the state of
California – LAUSD. Online district database provided the following statistics for
LAUSD. The overall student population in LAUSD from 2005 to 2010 ranged from a
low of 676,420 students to a high of 723,964 students. Of all the students in attendance,
these were the significant racial groups: Alaskan, Asian, Filipino, Pacific Islander,
African American, and Caucasian. LAUSD is comprised of large amounts of English
language learners (39%), low-income students (84%), and Hispanic students (76%).
Further, 37% of the student population is from homes where neither parent graduated
from high school. Student attendance rate in LAUSD from 2005 to 2009 averaged
93.83% (with an average stability rate of 84.09% and an average transience rate of
26.47%).
70
In order to conduct a sound comparative analysis of the information value of
various performance evaluation approached, this correlational research study required
strategic consideration to the sampling selection process. The sampling used in previous
studies and the sampling for the current research must align significantly. Consequently
this study examined all public elementary schools in LAUSD with particular attention
given to grades 2 to 5 from each school. Thus, the number of schools under analysis was
approximately 430. Further, accountability assessment data from the 2009 and 2010
school years were collected and analyzed. In addition, the socioeconomically
disadvantaged (SED) student group was also examined to evaluate the impact of the
accountability system on students of low- and high-socioeconomic status (SES) and the
performance of their respective teachers and schools.
Instrumentation: School Characteristics Index (SCI)
The school characteristics index (SCI), established in April 2000 (California
Department of Education, 2011), is used within the California PSAA for the creation of
similar school scores. The current comprehensive list of characteristics used to group
varying schools into similar groups comprise of: student mobility, student ethnicity,
student SES, percent of teachers who are fully credentialed, percent of teachers who hold
emergency credentials, percent of students who are English learners (ELs), average class
size per grade level, whether the school operates in a multi-track year-round educational
program, percent of grade span enrollments (grade 2, 3 to 5, 6, 7 to 8, and 9 to 11),
percent of students in the gifted and talented education program, percent of students with
71
disabilities (SWDs), percent of reclassified fluent English-proficient (RFEP) students,
and percent of migrant education students.
Of the characteristics included in the calculation for the SCI, data generated by
the California Department of Education (2011) indicated that the highest correlations
with the PSAA API accountability indicator were the two SES components – average
parent education and percent of students participating in free or reduced price lunch
programs – with correlation coefficients of 0.82 and -0.81, respectively (p <= 0.05). The
next highest correlation with API was the percent of Hispanic or Latino, r = -0.69, p <=
0.05. Then, the correlation between API and the percent of students classified as ELs and
reclassified fluent-English-proficient (RFEP) was r = -0.61, p <= 0.05. Aside from the
correlation between API and the percent of Caucasian students (r = 0.59, p <= 0.05), the
other variables included in the SCI were correlated to API at or below r = 0.45 on the
positive (percent of Asian) and r = -0.46 on the negative (school mobility) with p <= 0.05
for both.
Instrumentation: Achievement Measures
California’s comprehensive accountability system mandates all public schools, all
charter schools, and all local education agencies (LEAs) serving K-12 students to
measure and publicly report their performance and progress each year. California’s
accountability system was instituted as a result of state requirements established by the
PSAA of 1999 and the federal requirements developed by the Elementary and Secondary
Education Act (ESEA) of 1994 and the NCLB Act of 2001. Accordingly, California
public schools are obligated to administer various summative assessments each spring to
72
their different student groups at various grade levels to account for the academic
achievement of all students.
The Standardized Testing and Reporting (STAR) program was enacted to ensure
that the above-mentioned goal was brought to fruition. Under the purview of the STAR
program, California students in different grade levels are evaluated each spring in the
manner outlined as follows: (a) All students take the California Standards Tests (CSTs)
according to their grade level (see Table 3); (b) Spanish-speaking students in grades 2
through 11 who have been enrolled in schools for less than a year are administered the
Spanish Assessment of Basic Education, Second Edition; (c) Students with severe
cognitive disabilities who were unable to take the CSTs, the California Alternate
Performance Assessment (CAPA) is given; (d) In line with federal regulations, students
in grades 3 through 8, for whom both the CSTs and the CAPA are not appropriate, are
given the California Modified Assessment (CMA); and (e) Since 2006, all high school
students have been required to take and pass the California High School Exit Exam
(CAHSEE) to receive a diploma. Up to the 2008-2009 school year, the California
Achievement Test, Sixth Edition Survey (CAT/6) was given to students in the 3
rd
grade.
Further, the Aprenda, La prueba de logros en espanol, Tercera edicion (Aprenda 3), a
nationally norm-referenced achievement test of general academic knowledge in Spanish
for Spanish-speaking English learners, was given to students in grades 8 through 11 until
it was replaced by the Standards-based Test in Spanish (STS) for students in grades 2
through 11 in 2009.
73
Table 3
California Content Standards Tests and Grade Levels at Which They are Administered
Subject Area Grades Tested
English language arts 2-11
Writing 4 and 7
General Math 2-8 and 11
Subject specific math (including Algebra I, Geometry, and Algebra II; or
Integrated Math 8-11
History/Social Science 8, 10, and 11
General Science 5
Subject specific science (including Biology, Chemistry, Physics, and Earth
Science; or Integrated Science 9-11
Source: California Department of Education (2010a).
As the primary unit of analysis of this study is student assessment data at the
elementary level, the discussion on instrumentation as it relates to achievement will be
limited to the CSTs, CMA, and the CAPA, which are administered to elementary school
students and are included in the computation of the API and AYP. The following sections
will describe each component of these assessments in detail. After the detailed
descriptions of the various assessments mentioned above, there will be a discussion on
the Base and Growth API reports that are generated and published for additional public
accountability. The last portion of the instrumentation section will detail the respective
unadjusted and adjusted achievement variables that were examined in this study.
74
California Standards Tests (CSTs)
As in other states, California schools are judged by the performance of their
students on state accountability measures. At the elementary level, students are primarily
assessed through the use of the CSTs. CSTs are criterion-referenced exams aligned to
state adopted content standards. On that basis, the assessments are specific to each grade
level and are designed to access the level of mastery of each student with respects to state
content standards and not how student scores compare with those of other students taking
the same tests. CSTs in English language arts (ELA) and math are administered to
students in grades 2 to 6 each spring. In addition to these two tested areas, fourth grade
students take a writing assessment and fifth grade students take a general science exam.
Possible scores on the CSTs range from a low 150 to a high of 600.
California Alternate Performance Assessment (CAPA)
In response to federal requirements of the Individuals with Disabilities Education
Act (IDEA), Amendments of 1997, and the ESEA, California instituted the CAPA. The
national mandates required that each state provide evidence that all students are included
in the assessment and accountability system. For that reason, students with severe
cognitive disabilities who are unable to take the CSTs (and the CAT/6, which was
eliminated in 2008) even with accommodations and modifications, take the CAPA. The
CAPA is a criterion-referenced test that is linked directly to the selected portions of the
state academic standards in ELA and mathematics, which are accessible to students with
significant cognitive disabilities.
75
The CAPA is given at five different levels in ELA and math. Students are given
one of the five different levels of the CAPA based on their grade level. Exception to this
rule is provided to students with extreme disabilities; regardless of their grade level
(grades 2 to 11), these students take the CAPA level I as designated in their
individualized education program (IEP) (California Department of Education, 2010a).
Each level of the CAPA is comprised of eight performance tasks. Test results range in
scale from a low 15 to a high 60 and are used as an indicator representing a student’s
level of mastery of state content standards just as in the CSTs. The cut off for the basic
level is 30 and for the proficient level is 35.
California Modified Assessment (CMA)
To ensure that the assessment and accountability system included all students, the
United States Department of Education enacted regulations for an alternate assessment in
April 2007. The new federal requirements were focused on students who were not able to
take the CSTs or the CAPA due to the disabilities outlined in their respective IEPs. In
2008, the first administration of the CMA was conducted. The CMA was first given only
to a small percentage of students in grades 3 through 5. As of 2009, the CMA is given in
ELA to students in grades 3 to 8 and in math to students in grades 3 to 7. The same year,
the State Board of Education (SBE) adopted performance levels for the CMA. The raw
score cut off points for the five performance levels differ among grade level and content
area; however, they range between 17 and 30 (California Department of Education,
2010a). Scores below a raw score of 17 were set to the far below basic level and scores at
or above 30 were determined to be in the proficient or advanced level.
76
Accountability Reports
Under the mandates of the PSAA, all California schools receive state-required
accountability information in the form of API reports – the API Base report and API
Growth report. These two reports allow for adequate phasing in of new achievement
assessment indicators and provide information to determine gain or loss in performance.
The Base API sets the theoretical expectations based on the results of statewide
testing released a year after test administration. The Growth API produced after the Base
API, illustrates the difference between the expected and the observed growth as measured
by student assessment results each year. This is an effort by the state and national
governments to hold all schools accountable for the academic progress of all their
students. In effect, the API Base report sets the bar and, thus, contains information on
how a school performed in a preceding year and the projected target API for the current
year. Although one contains scores from the preceding year and the other from the
current year, both the Base API and the Growth API are calculated in the same manner
and with the same indicators.
In addition to the API score and the projected growth target, the API Base report
also provides: schools’ statewide ranking, similar schools rank, subgroup information,
demographic characteristics, and content area weights. Beyond these data, the API
Growth report also includes the actual growth and API improvement score – the
difference between the preceding and current year’s API scores. Moreover, the API
Growth report also documents whether a school has or has not met set performance goals
for each school year (California Department of Education, 2010c). This last piece of
77
information assists state and federal agencies to determine appropriate rewards, sanctions,
and interventions.
The scores on the above-mentioned tests, which are included in the accountability
reports, are manipulated in a variety of ways by the accountability system in California to
generate information on the performance of students, teachers, and schools. The
following sections comprise of information directly related to the unadjusted and adjusted
accountability variables analyzed in this study.
Unadjusted Achievement Accountability Indicators
Unadjusted achievement accountability indicators are those that measure
performance without controlling for factors like student background and language ability,
which may skew results in one direction or another. In other words, whether a student or
school is considered at an economic disadvantage or advantage, the manner in which the
calculations for the unadjusted variables are applied is equal across all student groups and
schools. Discussed in more detail below are the two unadjusted indicators examined in
this study.
Academic performance index (API). Under the purview of the PSAA, the API
is the cornerstone of California’s accountability system (California Department of
Education, 2010c). The API is an overarching measure of a school’s effectiveness in
producing student outcomes. It is a composite score derived from the results of the
statewide tests given to students in grades 2 to 11 as per the guidelines enacted by
California’s accountability system (see Table 4). The API is a numerical index that
ranges from 200 to 1,000 with 800 as the target for schools statewide. All schools, with
78
an exception of a small number of schools, receive an API score that is calculated from
student performance on the CSTs, the CAPA, the CMA, and for high schools, the
CAHSEE. In effect, the results of several performance indicators make up the overall
API score for each school.
Table 4
School Content Area Weights for API Calculation For Grade Span K-8
Content Area 2009-2010 API Test Weights
K-5 6-8 K-8
CST/CMA/CAPA in English Language Arts 0.565 0.514 0.542
CST/CMA/CAPA in Mathematics 0.376 0.343 0.361
CST/CMA/CAPA in Science, Grades 5 and 8 0.059 0.071 0.065
CST in History-Social Science, Grade 8 ------- 0.071 0.032
Note: CST = California Standards Test;
CMA = California Modified Test;
CAPA = California Alternate Performance Assessment.
Source: California Department of Education (2010c).
At the elementary level, student performance on the CSTs, the CAPA, and the
CMA are used to derive the API Base score and are applied to the API Growth within the
API reporting cycle (California Department of Education, 2010c). Only test scores from
numerically significant student groups are included in the calculation of the API. Each
test is given a certain weight in the calculation of the API as determined by the state
accountability system. Test weights vary according to content area with the largest
emphasis placed on the ELA portion of the criterion-referenced CST (see Table 4).
Overall, the test weights used to evaluate a school’s overall API score are fixed and are
the same across schools and student groups.
79
Furthermore, the API is a cross-sectional look at student achievement. It does not
track individual student progress across years but rather compares snapshots of school- or
LEA-level achievement results from one year to the next. The API is currently a school-
based requirement only under state law. So, although the API is only calculated for the
school-level to hold schools accountable for student learning, API reports are provided
for all LEAs in order to meet federal requirements under the ESEA (California
Department of Education, 2010c, p. 6).
API calculation method. The API is a weighted average of student scores
among all tested content areas and grade levels (see Table 4). The calculation of the API
takes into consideration that some grade levels are evaluated in more subject areas and/or
different tests (California Department of Education, 2010c). In addition, in order to
“maintain consistency in the statewide API scale from one reporting cycle to the next,” a
scale calibration factor (SCF) is used to adjust a school’s API (California Department of
Education, 2010c, p. 55). The SCF provides a positive or negative adjustment to each
API for each year. The following details the steps, which represents the API calculation
method as provided by the California Department of Education (2010c):
1) Apply inclusion/exclusion and adjustment rules to each student test score.
2) Apply API validity criteria (5 CCR and EC Requirements).
3) Convert each test result into a score on the API scale using statewide
performance level weighting factors (see Table 5).
4) Calculate a weighted average of the scores using statewide test weights (see
Table 4).
80
5) Add in the SCF.
6) Sum the weighted average of the scores and the SCF to produce the API.
7) For schools or LEAs with grade spans that overlap the SCF categories, a
weighted average of the APIs of the grade span/disability segments is used to
produce the final API. (p. 55)
Table 5
Performance Levels and Corresponding Weighting Factors
CST Performance Levels Weighting Factors
(5) Advanced 1000
(4) Proficient 875
(3) Basic 700
(2) Below Basic 500
(1) Far Below Basic 200
Note: CST – California Standards Test;
Source: California Department of Education (2010c).
Adequate yearly progress (AYP) proficiency scores. State and federal laws
require the annual review of school performance to determine if student academic
achievement and progress is adequate. AYP is the federal government’s index of
accountability for overall school performance. Each year under NCLB , all schools and
districts must meet the state’s four AYP objectives. Flexibility was provided to each state
by the federal policymakers to develop and implement its respective criteria for meeting
AYP. Consequently, California schools are required to meet or exceed requirements
within each of the following four areas to make AYP annually: (a) participation rate,
81
(b) percent proficient – annual measurable objectives (AMOs), (c) API as an additional
indicator, and (d) graduation rate (for high schools) (California Department of Education,
2010a). These requirements are applied to all students in the tested grades and areas and
to student groups of sufficient size.
For elementary schools, results in grades 2 through 8 from the CST in ELA and
math are used to determine the percentage of students scoring at the proficient level or
above. The AYP proficiency score is calculated by assigning one point for each full
academic year student who scores in the proficient or advanced categories on the state
standardized tests. The total number of points is divided by the total number of students
tested to calculate the proficiency index.
Adjusted Achievement Accountability Indicators
Aside from the unadjusted achievement variables, there are also various adjusted
achievement accountability indicators in use or are available for use by the accountability
system in order to evaluate the performance of students, teachers, and schools. Six
adjusted accountability measures, which were explored and analyzed in this study, are
detailed below.
API improvement scores. The API is based on an improvement model. The
Growth API, which is calculated using the test results of the current year, is compared
against the Base API. Essentially, the difference between the preceding and current year
is the API improvement score for any particular school. API improvement scores are used
to measure the academic growth of a school.
82
The current accountability system sets annual targets for each school and all
numerically significant subgroups at a school also have corresponding targets. A school
or subgroup that has achieved a Base API of: (a) 200-690 must make gains of 5% of the
difference between the Base API and the statewide performance target of 800, (b) 691-
795 must make a gain of five points, (c) 796 must make a gain of 4 points, (d) 797 must
make a gain of 3 points, (e) 798 must make a gain of 2 points, (f) 799 must make a gain
of one point, and (g) 800 or more must maintain an API of at least 800.
API (SED). The API (SED) is the targeted API given to the socioeconomically
disadvantaged (SED) student group. This is a goal set for schools to achieve in an effort
to hold schools accountable and to bring all students to the proficient level of mastery or
better on the content standards assessments. The API requires subgroup accountability to
address the achievement gap that exists between traditionally higher- and lower-scoring
student subgroups. For this particular study, the API for schools with a higher number of
SED students – API (SED) – is compared to those with fewer SED students. As part of
the Closing the Achievement Gap Initiative in California, growth targets for lower
achieving student groups are greater than for each school as a whole.
Subgroups. Student groups (subgroups) for API reporting must meet
certain guidelines. A “numerically significant subgroup” for the purposes of the API is
defined as: (a) 100 or more students with valid STAR program scores; or (b) 50 or more
students with valid STAR program scores who make up at least 15 percent of the total
valid STAR program scores (California Department of Education, 2010a, p. 25).
Subgroups used in the calculation of the API include: Black or African American,
83
American Indian/Alaska Native, Asian, Filipino, Hispanic or Latino, Native
Hawaiian/Pacific Islander, Caucasian, Two or More Races, SED students, ELs, and
SWDs.
In addition to a projected API goal for a school as a whole, each subgroup within
a school also receives a targeted API for each year. Each school is held accountable for
bringing both API targets to fruition and for showing adequate improvement.
Similar schools scores. Although it is critical to hold schools accountable for
student performance, the state accountability system recognizes that schools are made up
of different types of students whose backgrounds, characteristics, and other factors
external to the schooling environment can influence test performance. As a result of this
viewpoint, a component of the PSAA requires schools to be compared with others that
are similar in school, student, and teacher characteristics. The intent of the similar schools
score is to demonstrate and benchmark a school’s measure of progress as compared to
itself and to other successful similar schools.
Similar schools scores are based on the Base API score of a school as compared
to 100 other schools of the same level (elementary, middle, and high school) and of
similar student- , teacher- , school-level characteristics. After a school is categorized
according to its school type, a composite score – the SCI – is derived from the
demographic characteristics of that school.
The PSAA specifies the demographic characteristics to include in similar schools
scores calculations. In 2006 the State Board of Education (SBE) amended the list of
characteristics originally instituted by the PSAA in 1999. The current comprehensive list
84
includes: student mobility, student ethnicity, student SES, percent of teachers who are
fully credentialed, percent of teachers who hold emergency credentials, percent of
students who are ELs, average class size per grade level, whether the school operates a
multi-track year-round educational program, percent of grade span enrollments (grades 2,
3 to 5, 6, 7 to 8, and 9 to 11), percent of students in the gifted and talented education
program, percent of SWDs, percent of reclassified fluent-English-proficient (RFEP)
students, and percent of migrant education students.
LA Times value-added scores. Buddin (2009) performed and the Los Angeles
Times published a value-added analysis on the effectiveness of 11,500 LAUSD grades 3
through 5 teachers and about 470 LAUSD elementary schools. Value-added is a
statistical approach to estimating a teacher’s or school’s effectiveness at enhancing
student performance as indicated by student standardized test scores. In effect, value-
added estimates project a student’s future performance by using previous test scores (in
this case, math and English tests). The difference between the predicted and the actual
results represent the “value” that the teacher added or subtracted.
Aside from student past test scores, value-added approaches can also control for a
variety of variables in order to single out the effects of a teacher on a student. In the first
version of the LA Times value-added analysis, four factors were included in the
adjustments to measure a teacher’s effectiveness: (a) gender, (b) poverty level, (c)
number of years in the district, and (d) whether a student was classified as an English-
language learner. In the newest version, further adjustments were made with the addition
85
of such factors as: (a) parent educational attainment level, (b) class size, (c) student
mobility, (d) five levels of English proficiency, and (e) peer effect – the overall
characteristics of a class.
The LA Times used up to seven years of student assessment data, which were
obtained through the California Public Records Act from LAUSD, for its analysis of the
effectiveness of the 11,500 LAUSD grades 3 to 5 teachers. The grade level of a teacher
was determined from the annual state tests. A five-category rating system that included
least effective, less effective, average effectiveness, more effective, and most effective,
was used to delineate the varying average value-added estimates of each teacher. On
average, teachers in the least effective category subtracted from student scores by more
than seven percentile points in math and more than four percentile points in English;
whereas teachers in the most effective category increased student scores by more than
eleven percentile points in math and six percentile points in English. All grades 3 through
5 teachers whose value-added scores were calculated were ranked relative to one another
on the five-category rating scale.
For this current study, the results of teacher effectiveness in ELA and math were
transformed onto a 100-point scale in which one was the lowest rating and 100 was
considered to be the most effective rating. Only the overall rating published in 2010 was
used for the completion of this research.
Academic growth over time (AGT) value-added scores. The Value-added
Research Center (VARC) on behalf of LAUSD computed the AGT scores in ELA and
math for grades 3 through 9 teachers in LAUSD in order to assess the effectiveness of
86
teachers, principals, and schools on student learning as indicated by student assessment
scores. At the time of this study, AGT scores were computed from up to three years of
test score data from the CSTs in mathematics, Algebra, and ELA in the following grade
spans: (a) ELA in grades 2-9, (b) General mathematics in grades 2-8, and (c) Algebra in
grade 8.
As in other value-added models, LAUSD also used certain variables to adjust for
factors that may confound performance/effectiveness rating. The categories of variables
that were used as controls in measuring teacher and school effectiveness were: student-
level and classroom-level. The student-level factors that were controlled in the AGT
model included: gender, race (African American, Caucasian, and Asian), English
language learner status (English origin, English as a second language (IFEP), English
language learner (LEP), and reclassified (RFEP)), free and reduced price lunch
participation, disability status (severe and mild), and homelessness. The classroom-level
indicators included: classroom averages of pretests in both math and ELA and the
student-level variables. Additionally, similar to LA Times value-added scores, LAUSD
used prior test scores in ELA and math and the controlling factors named above to predict
future student assessment results. The difference between the predicted and the actual
scores in achievement was consequently attributed to the effectiveness of the teacher on
student achievement.
Individual student scores were aggregated for an overall AGT estimate for:
schools, grade-level/subject matter teams, teachers, and specific groups of students
(students with low prior achievement, English language learners (ELLs), gender, race
87
[African American, Caucasian, and Asian], and students with disabilities). In essence, the
AGT analysis produced an estimate of gain (whether in the negative or positive). Results
of the aggregate ranged between a low of one and a high of five and were presented for
single academic years and also for three-year averages whenever possible. Based upon
the results, LAUSD used a five color-coded rating scale to differentiate the varying levels
of effectiveness: (a) blue – far above predicted AGT: AGT estimate was significantly
more than 4; (b) green – above predicted AGT: AGT estimate was significantly above the
district average of three; (c) gray – within the range of predicted AGT: AGT estimate was
not significantly different from the district average of three; (d) yellow – below predicted
AGT: AGT estimate was significantly below the district average of three; and (e) red –
far below predicted AGT: AGT estimate was significantly less than two.
Further, results were color-coded based on the location of the confidence interval
(CI) around the estimates: (a) blue if the CI is entirely above four; (b) green if the CI is
entirely above three; (c) gray if the CI crosses three, the district average; (d) yellow if the
CI is entirely below the three; and (e) red if the CI is entirely below two.
For the purposes of this current research agenda, the scores were kept in the
original one through five rating scale. Single year scores for 2009 and 2010 and 3-year
averages ending with 2010 and 2011 were analyzed for this current research study.
Adjusted grade level equivalent scores. An adjusted grade level equivalent
(AGLE) score is an equal interval standard score (adjusted normal curve equivalent –
ANCE score). An AGLE represents a student’s grade level tier, which ranges from .01 to
.99. Where 2.01 to 2.99 represent the achievement levels of second grade students. An
88
AGLE is calculated relative to a statistically matched normative group. That is, students
are compared to others who are in their respective statistically matched subgroup.
This study used AGLE scores for grade-level analysis, which are computed with
grade-level assessment scores and ANCE scores, which are calculated with school-level
student achievement results, for school-level accountability and evaluation. The steps for
the computation of the ANCE scores and the AGLE scores are similar:
1) Regress test scores on the school characteristics index (SCI).
2) Standardize the scores into equal interval scores (z-scores) through the use of
the following equation: z = X – M /SD, where z is the z-score, X is the raw
score, M is the mean of the set of raw scores, and SD is the standard deviation
from the mean.
3) Then, the z-scores are matched to the respective percentile scores to normalize
the data. Then, the percentile scores were transformed to ANCE (or AGLE)
scores using tabled values. The result of this transformation represents the
proficiency tier for the school or grade-level. (For example, an AGLE of 4 is
transformed to 0.04 after being divided by 100. If the data set represented
scores for grade level 3, the AGLE score would be 3.04.)
Taken together, the indicators described above make up the instrumentation component
of this study.
Procedures
To appropriately respond to the research questions of this current study, several
sources of quantitative data needed to be obtained. The following steps were performed
89
in order to acquire the relevant information for data analysis in order to bring this current
research to completion. (All online resources were public domains, accessible by all who
may have interest in the published data provided.)
1) ELA and math achievement results for all LAUSD elementary schools were
downloaded for analysis from a web-based information source provided to the
public by the California Department of Education located on the web at:
http://www.cde.ca.gov/ta/ac/ar/. Achievement accountability data files from the
2009 and 2010 school years were collected for students in grades 2 through 5.
Specific data gathered were: CST/CAPA/CMA scaled scores, CST/CAPA/CMA
AYP proficiency scores, CST/CAPA/CMA proficiency band scores, school API
scores, API scores for SED students, state decile rankings, similar schools
rankings, and similar schools scores.
2) Data on the SCI were also downloaded for interpretation from the above-
mentioned web-based data source on the California Department of Education
website.
3) Value-added data generated by Buddin (2009) on behalf of LA Times were
obtained from the Los Angeles Times website (http://projects.latimes.com/value-
added/).
4) AGT value-added scores at the school-level generated by VARC specifically for
LAUSD were obtained from the “LAUSD Academic Growth over Time Portal”
located on the web at
90
http://portal.battelleforkids.org/bfk/lausd/AGT_Reports.html?sflang=en. (Battelle
for Kids, 2011)
Data Analysis
To generate a salient comparative analysis of the information value generated by
the various accountability and evaluation models that are either currently in practice or
are available for use, accountability data files from the California Department of
Education, the LA Times website, and the LAUSD AGT website were obtained and
downloaded for the 2009 and 2010 school years for grades 2 to 5 from all LAUSD K-5
elementary schools. Specifically, the 2009 API Base and Growth, the 2010 API Base and
Growth, the 2009 AYP, the 2010 AYP, the AGT single-year, the AGT three-year, and the
LA Times value-added reports were downloaded. All cogent data were coded and
prepared for analysis using the SPSS-PC 19.0 program. Key variables from the multiyear
data files were linked and the comparative analysis of the various models of
accountability and evaluation was completed through a unified database for all LAUSD
K-5 elementary schools.
Descriptive statistics and statistical analyses were computed, including (a)
relevant indices of central tendency, variability, skewness, and kurtosis to establish and
provide foundational context into the distribution of the variables; (b) parametric
correlations using Pearson Product-Moment coefficient (r), to evaluate the strengths of
association between the varying achievement indicators; (c) parametric correlations,
using Pearson Product-Moment coefficient (r), to measure the reliability and validity of
the different evaluation approaches to evaluating the effects of teachers, teacher teams
91
and schools; (d) computation of the reliability/stability coefficient through the use of the
Spearman-Brown Split-Half formula provided a means to determine the degree of
internal consistency and stability among the different indicators available for use in the
accountability systems; (e) computations using the standard error of measurement
assisted in establishing the confidence interval for in the stability coefficient calculations;
(f) analyses of regression along with appropriate histograms and scatter graphs made it
possible to test and satisfy the assumptions of a correlation study: normality, linearity,
and homoscedacity (homogeneity in variance); and (g) cross-tabulations allowed for the
comparison of the information value of the varying accountability indicators.
The following provides further discussion on how the information value of the
varying approaches to evaluation and accountability that was examined in the current
research was assessed.
Reliability
Reliability is a measure’s ability to repeatedly yield consistent results (Choi et al.,
2005).
Internal consistency reliability. Internal consistency reliability is the extent of
inter-item correlation. The Spearman-Brown reliability coefficient was computed for the
key variables in this study – single year AGT scores in ELA and math, three-year
averages of AGT scores in ELA and math, and ANCE scores in ELA and math. The
result of these computations provided a means to determine the reliability (internal
consistency) of these measures for accountability and evaluation purposes (Lissitz &
Samuelson, 2007).
92
Test-retest stability. Test-retest stability is consistency over time. A test-retest
analysis was used to estimate the stability of the adjusted achievement indicators in this
study (Lissitz & Samuelson, 2007; Pedhazur & Schmelkin, 1991). For the purposes of
this study, the test-retest stability was derived from the manipulation of the Spearman-
Brown Split-Half formula on the respective achievement variable for the 2009 and 2010
school years.
Validity
Validity is the degree to which the overall evidence supports the intended use of
the measurement (American Psychological Association, 1999). It is the extent of
correlation between the intended interpretations of test scores to the proposed purpose of
the test. There were several types of validity – discriminant, convergent, and concurrent –
which were critically analyzed in this study.
Convergent validity. Convergent validity is defined as the degree to which an
achievement indicator is similar to other achievement indicators. Convergent validity was
evaluated through the computation of the Pearson Product-Moment correlation
coefficient for the achievement indicators on one another (Lissitz & Samuelson, 2007).
Discriminant validity. Discriminant validity is defined as the extent to which an
achievement indicator is not similar to (or diverges from) another variable to which it
theoretically should not be similar. Correlations between the achievement indicators and
the institutional characteristics that are beyond the purview of schooling system were
ordered to evaluate their respective discriminant validity (Lissitz & Samuelson, 2007).
93
Concurrent validity. Concurrent validity means that a particular measure or set
of assessment results varies directly with another measure or test of the construct. It can
also be indicated by how indirectly a measure correlates with another of the opposite of
the construct (Lissitz & Samuelson, 2007). In this study, the achievement variables were
correlated to the API to determine the extent of correlation, thus, the extent to which they
are interchangeable.
Summary
In response to NCLB, states throughout the country have introduced
accountability systems based on student assessments linked to state content standards.
The current accountability model evaluates students and schools through the use of
various unadjusted and adjusted achievement indicators generated as result of NCLB.
Furthermore, the availability of quantified student assessment data made readily
accessible by state and federal laws have also allowed other alternative approaches of
performance evaluation to surface, such as value-added ratings.
The goal of all evaluation processes is to improve student achievement and close
the achievement gap for the nation’s disadvantaged students. Given this need of the
current education reform agenda, unreliable measures of accountability and the invalid
use of their results can have profound consequences for all stakeholders of the education
landscape, especially large urban districts faced with many factors that significantly
impact the achievement gap of the most at-risk students in our nation. It is, therefore,
94
crucial to evaluate the currently available indicators of performance evaluation in order to
generate relevant information for policy-making and practical decision-making processes
that have direct impact on teaching and learning.
95
CHAPTER 4
RESULTS
Since the publication of the Coleman Report (1966) and A Nation at Risk (1983),
there has been a shift in the education production process to hold teachers and schools
accountable for adequate student outcomes. Current educational reform agendas are
mandating or strongly inducing school districts and states to link student assessment
outcomes to performance evaluations of teachers, grade-level teams, and schools. A
growing number of states and districts are looking into or are already using a combination
of adjusted and unadjusted measures in their accountability systems to evaluate the
effectiveness of teachers, grade-level teams, and schools as they relate to student
achievement (Choi et al., 2005; Linn, 2008). Research questions from this present study
focus on the extent of correlation between unadjusted accountability achievement
indicators to student, teacher, and school characteristics and the reliability and validity of
adjusted achievement accountability indicators presently in use or available for use. The
information value generated from this comparative analysis will illuminate cogent data
from which policy design and practical decision-making processes can be made.
In the preceding chapter, the methodologies by which the adjusted and unadjusted
accountability measures are evaluated were detailed. In the process of the study, adjusted
grade-level equivalent (AGLE) scores, which are adjusted normal curve equivalent
(ANCE) scores, were also introduced to provide another possible accountability
alternative for evaluating teachers, grade-level teams, and schools. This chapter presents
the results of the statistical analyses, which were guided by the three research questions
96
that catalyzed this current correlational research study. The subsequent discussion of
results will detail in depth (a) the essential descriptive statistics of key variables, (b) the
strengths of inter-correlations among the individual adjusted and unadjusted indicators,
(c) the degree of correlation between the accountability indicators under investigation
with the school characteristics index (SCI), (d) the stability of the achievement indicators
over time, and (e) the measure of internal consistency between key indicators. On this
basis, the following sections will describe the findings in order, according to the three
research questions of this study:
1. Within the constraints of K-5 elementary schools in LAUSD, to which extent
are the NCLB status indicator – Adequate Yearly Progress (AYP) – and
California’s PSAA improvement indicator – Academic Performance Index
(API) – correlated to the school characteristics index (SCI) and to what extent
are these two unadjusted achievement indicators inter-correlated.
2. Within the constraints of K-5 elementary schools in LAUSD, to what extent
are the adjusted indicators named below reliable (internal consistency
reliability and test-retest stability) and valid (discriminant, convergent, and
concurrent) for use as they relate to performance evaluation and
accountability:
a. API improvement scores
b. API
SED
scores (SED = socioeconomically disadvantaged)
c. Similar schools scores
d. LA Times value-added scores
97
e. Academic growth over time (AGT) value-added scores – ELA and
Math
f. Adjusted normal curve equivalent (ANCE) scores – ELA and Math
3. Within the constraints of K-5 elementary schools in LAUSD, to what extent
are adjusted grade-level equivalent (AGLE) scores reliable (internal
consistency reliability and test-retest stability) and valid (discriminant and
concurrent) for use as they relate to performance evaluation and
accountability.
Unadjusted Accountability Indicators: AYP and API
The first research question was: within the constraints of K-5 elementary schools
in LAUSD, to which extent are the NCLB status indicator – Adequate Yearly Progress
(AYP) – and California’s PSAA improvement indicator – Academic Performance Index
(API) – correlated to the school characteristics index (SCI) and to what extent are these
two unadjusted achievement indicators inter-correlated. Unadjusted measures of
accountability are characterized as achievement indicators that do not account for factors
that set schools or student groups apart. For example, whether schools are situated in an
affluent neighborhood or are in a high poverty setting, the unadjusted measures are
computed in an invariable manner among all schools alike. In other words, teacher,
grade-level team, and school performance as indicated by the unadjusted indicators are
evaluated without consideration of between school characteristics. In an attempt to
illustrate the degree of inter-correlation of the unadjusted indicators and their respective
correlation to student, teacher, and school characteristics, descriptive statistics of the
98
achievement accountability indicators were computed and presented in the next section to
establish context for the reporting of the results of research question one.
Descriptive Statistics – Unadjusted Accountability Indicators
Table 6 presents a summary of key descriptive statistics – the minimums,
maximums, means, and standard deviations – of the SCI and the unadjusted achievement
accountability indictors presently in use for performance evaluation purposes in
California elementary schools – the NCLB AYP and the PSAA API.
Table 6
Descriptive Statistics for Unadjusted Achievement Indicators for LAUSD Elementary Schools in 2010
N Minimum Maximum Mean Std. Deviation
SCI 501 158.10 199.15 172.06 7.26
API 501 574.00 978.00 782.05 69.42
ELA_AYP 493 8.00 46.75 24.36 7.67
Math_AYP 493 8.00 48.13 29.84 6.96
Note. LAUSD = Los Angeles Unified School District;
SCI = School Characteristics Index;
API = Academic Performance Index;
ELA_AYP = Adequate Yearly Progress in English Language Arts;
Math_AYP = Adequate Yearly Progress in Math.
Aside from the above-listed descriptive statistics, the skewness, standard error of
skewness, the standardized z-score of the skewness, kurtosis, standard error of kurtosis,
and the standardized z-score of the kurtosis for the SCI, API, and the AYP for both ELA
and math are provided in Table 7. Using the following two equations: z_skew = skewness
/ standard error of skewness and z_Kurtosis = kurtosis / standard error of kurtosis, two
99
key points were revealed from the dataset pertaining to the unadjusted variables included
in this research question: (a) The data exhibited some positive skewness suggesting a lack
of symmetry from the mean and (b) The distribution of the data values appeared from
relatively normal to platykurtic, which indicates relatively less concentration of the data
values around the mean than a normal curve. All skewness results were significant at z >
1.96 corresponding to a significance level of p < .05, whereas, only the kurtosis of the SCI
was similarly significant at z > 1.96, at the significance level of p < .05 (Tabachnik &
Fidell, 2007).
Table 7
Skewness, Kurtosis, and z-Scores for the School Characteristics Index and the Unadjusted Achievement
Indicators for LAUSD K-5 Schools in 2010
Skewness
Std. Error
of
Skewness z_Skewness Kurtosis
Std. Error
of Kurtosis z_Kurtosis
SCI 1.322 0.109 12.128 1.132 0.218 5.192
API 0.360 0.109 3.303 0.082 0.218 0.376
ELA_AYP 0.719 0.110 6.536 0.010 0.220 0.045
Math_AYP 0.287 0.110 2.609 0.037 0.220 0.168
Correlation of Unadjusted Variables and SCI
A Pearson Product-Moment correlation was run to determine the relationship
between each of the unadjusted achievement indicators and also between the respective
unadjusted variables with the SCI. As displayed in Table 8, the SCI was highly and
significantly correlated at a significance level of p = .001 with API, AYP in ELA, and
with AYP in math with Pearson Product-Moment correlation coefficients of 0.82, 0.86,
100
and 0.73, respectively. Inter-correlations among the unadjusted variables – the API, AYP
in ELA, and the AYP in math – were highly significant statistically and were all above
the Pearson Product-Moment correlation coefficient of r = 0.92 and significance level of
p = .001.
Table 8
Pearson Product-Moment Correlations for API, AYP, and SCI
SCI API ELA_AYP Math_AYP
1.00 SCI
501
0.82 1.00 API
501 501
0.86 0.97 1.00 ELA_AYP
493 493 493
0.73 0.95 0.92 1.00 Math_AYP
493 493 493 493
Adjusted Accountability Indicators
The second research question was: within the constraints of K-5 elementary
schools in LAUSD, to what extent are the adjusted indicators reliable and valid for use in
evaluation and accountability models. Specifically, this second phase of the investigation
evaluated two types of reliability – internal consistency reliability and test-retest stability
and three types of validity – discriminant, convergent, and concurrent. The adjusted
indicators examined in this second phase of the study were: API improvement sores, API
scores
for the SED subgroup, similar schools scores, LA Times value-added scores,
academic growth over time (AGT) value-added scores in ELA and math, and adjusted
normal curve equivalent (ANCE) scores in ELA and math. Adjusted achievement
101
indicators are defined as measures that control for school characteristics that are not
under the purview of the schooling system. On this basis, adjusted accountability
indicators have been incorporated into the accountability system to equalize the
performance evaluations among schools of varying characteristics that may confound
achievement data. Accordingly, the adjusted variables control for factors that set students,
teachers, and schools apart, making the performance evaluation fairer and less biased
toward disadvantaged student groups and schools.
In order to adequately examine the reliability and validity of the adjusted
accountability indicators, descriptive statistics and the Pearson Product-Moment
correlations of the achievement accountability indicators were computed and are
presented in the following sections. The first preceding section will detail the descriptive
statistics, which will be followed by a summary of results disaggregated by the type of
reliability and validity to which they address as mentioned above.
Descriptive Statistics of Adjusted Variables
Descriptive statistics – the minimums, maximums, means, and standard deviations
– for the adjusted achievement indicators were computed to establish the foundation for
the analysis pertaining to research question two and are presented in Table 9.
102
Table 9
Descriptive Statistics for Adjusted Accountability Indicators for LAUSD K-5 Elementary Schools in 2010
N Minimum Maximum Mean
Std.
Deviation
API_Improvement 493 -63.00 124.00 10.49 24.62
API (SED) 473 570.00 931.00 762.95 53.03
Sim_School_Score 491 -144.00 156.00 12.88 40.16
LATimesValue-addedScore 467 0.00 99.00 49.95 28.55
AGT_ELA 478 1.42 4.98 3.04 0.55
AGT_Math 478 1.18 5.12 3.02 0.61
AGT_ELA_3yr 478 2.05 4.22 3.04 0.37
AGT_Math_3yr 478 1.57 4.43 3.02 0.45
ANCE_ELA 493 -3.56 0.00 0.00 1.00
ANCE_Math 493 -3.58 0.00 0.00 1.00
Note. Multiple modes exist. The smallest value is shown;
LAUSD = Los Angeles Unified School District;
API_Improvement = Difference in Academic Performance Index between 2009 and 2010;
API (SED) = Academic Performance Index for the Socioeconomically Disadvantaged Student Group;
Sim_School_Score = Similar Schools Score;
LA TimesValue-addedScore = School Level Value-added Scores from LA Times;
AGT_ELA = LAUSD Academic Growth Over Time Score in English Language Arts;
AGT_Math = LAUSD Academic Growth Over Time Score in Math;
AGT_ELA_3yr = Three-year average for AGT in ELA;
AGT_Math_3yr = Three-year average for AGT in Math
ANCE_ELA = Adjusted Normal Curve Equivalent Scores in ELA;
ANCE_Math = Adjusted Normal Curve Equivalent Scores in Math.
Additionally, Table 10 displays the skewness, standard error of skewness, the
standardized z-score of the skewness, the kurtosis, standard error of kurtosis, and the
standardized z-score of the kurtosis for the adjusted variables. The z-score for skewness
was computed with the equation, z_skew = skewness / standard error of skewness and
the z-score for kurtosis was computed in a similar manner with the equation z_Kurtosis =
kurtosis / standard error of kurtosis. Of those showing positive skewness, only API
improvement and the four AGT variables were statistically significant with z-scores
above 1.96, which corresponds to p < .05 (Tabachnik & Fidell, 2007). API for SEDs and
103
ANCE scores in ELA were negatively skewed, but were not statistically significant. Data
pertaining to kurtosis revealed that the z-scores for the kurtosis of the adjusted indicators
investigated in phase two of this study, were above the 1.96 (p < .05) significance level
except for the LA Times value-added scores, the three-year average AGT scores in ELA,
and the three-year average AGT scores in ELA (Tabachnik & Fidell, 2007). Overall, the
statistics for the adjusted achievement indicators of this research were highly and
significantly variable, skewed, and kurtosed.
Table 10
Skewness, Kurtosis, and z-Scores for Adjusted Achievement Accountability Indicators in 2010
Skewness
Std. Error
of
Skewness z_Skewness Kurtosis
Std.
Error of
Kurtosis z_Kurtosis
API Improvement 0.590 0.110 5.364 1.526 0.220 6.936
API (SED) -0.041 0.112 -0.366 0.523 0.225 2.335
Sim_School_Score 0.212 0.110 1.927 1.060 0.220 4.818
LATimesValue-addedScore 0.023 0.113 0.204 -1.177 0.225 -5.231
AGT_ELA 0.285 0.112 2.545 0.443 0.223 1.987
AGT_Math 0.291 0.112 2.598 0.539 0.223 2.417
AGT_ELA_3yr 0.433 0.112 3.866 0.150 0.223 0.673
AGT_Math_3yr 0.303 0.112 2.705 0.319 0.223 1.430
ANCE_ELA -0.079 0.110 -0.718 0.984 0.220 4.473
ANCE_Math 0.040 0.110 0.364 0.995 0.220 4.523
Reliability
Reliability is a measure’s ability to repeatedly yield consistent results (Choi et al.,
2005). For the purposes of this study, internal consistency reliability and test-retest
stability reliability were examined for the adjusted accountability indicators. The next
two sections will detail each type of reliability separately.
104
Internal consistency reliability. Internal consistency reliability is defined as the
extent of inter-item correlation (Lissitz & Samuelson, 2007). The inter-items analyzed
were the ELA and the math components of the respective measures. On this basis, only
the adjusted indicators that have disaggregated results for ELA and math were included
in this portion of the analysis. Specifically, only the AGT scores, the AGT scores as
averaged among three years of data, and the ANCE scores were examined.
Table 11 provides the Pearson Product-Moment correlation coefficients and the
Spearman-Brown reliability coefficients for the three adjusted indicators – AGT, AGT 3-
year average, and ANCE. The correlations between the ELA and math components of the
AGT scores, the AGT scores as averaged among three years of data, and the ANCE
scores were r = 0.67, 0.61, and 0.85, respectively, and all were statistically significant at
p = .001. Using the Spearman-Brown Split-Half formula – r
sb
= 2r
xy
/ (1 + r
xy
), where r
sb
= Spearman-Brown reliability coefficient and r
xy
= correlation between the two halves of
the scale – the internal consistency reliability coefficients were manually computed for
the AGT scores, the AGT 3-year average scores, and the ANCE scores (r
sb
= 0.80, 0.76,
and 0.92, respectively).
Table 11
Internal Consistency Reliability of AGT Scores and ANCE Scores for 2010 (Correlation Between ELA and
Math)
Pearson Correlation Coefficient Spearman Brown Reliability
Coefficient
AGT 0.67 0.80
AGT_3yr 0.61 0.76
ANCE 0.85 0.92
105
Test-retest stability reliability. Stability is consistency over time (Lissitz &
Samuelson, 2007; Pedhazur & Schmelkin, 1991). In other words, it is the extent to which
the same results are readily reproduced by an accountability measure. For this study, test-
retest stability reliability is the measure of consistency among the scores as correlated
among the varying variables over a two-year period. The results are illustrated in Table
12. Aside from the correlation between API improvement scores in 2009 and API
improvement scores in 2010, which was r = -0.26, the Pearson Product-Moment
correlation coefficients for the remaining adjusted accountability indicators ranged from a
low of r = 0.09 (AGT ELA) to a high of r = 0.88 (API for SEDs). The AGT three-year
averages in both ELA and math and the API for SEDs had correlations in the range of r =
0.80, while the ANCEs in both in ELA and math and similar school scores had
correlations in the r = 0.70 range. Inputting the Pearson Product-Moment correlation
coefficients into the Spearman-Brown Split-Half formula, the stability coefficients for
each of the adjusted accountability indicators were manually generated. The test-retest
stability coefficients ranged from a low of 0.00 (API improvement) to a high of 0.94 (API
for SEDs). The one-year AGT ELA and one-year AGT math results revealed stability
coefficients of r
xx
= 0.17 and 0.45, respectively, which were the lowest relative to the
other adjusted variables in stability. The AGT three-year averages for ELA and math
were higher at r
xx
= 0.89 and 0.91, respectively. Similar school scores had an r
xx
of 0.85.
The ANCE scores for ELA and math had stability coefficients of r
xx
= 0.87 and 0.86,
respectively.
106
Table 12
Test-Retest Stability of Adjusted Accountability Indicators (Correlations Between 2009 and 2010)
Pearson Correlation Coefficient Spearman Brown Stability
Coefficient
API Improvement -0.26 0.00
API (SED) 0.88 0.94
Sim_School_Score 0.75 0.85
AGT_ELA 0.09 0.17
AGT_Math 0.29 0.45
AGT_ELA_3yr* 0.80 0.89
AGT_Math_3yr* 0.84 0.91
ANCE_ELA 0.77 0.87
ANCE_Math 0.75 0.86
* Note: The correlation for the AGT ELA 3-year and AGT Math 3-year scores were three-year averages
ending in 2010 and 2011, as the three-year averages ending in 2009 were not available.
Validity
Validity is the extent to which a score is reflective of the user’s intent. For the
purposes of this current research study and for this second phase of the investigation,
three types of validity were analyzed using the results of the adjusted accountability
indicators – discriminant validity, convergent validity, and concurrent validity.
Discriminant validity. Discriminant validity is defined as the extent to which an
accountability indicator is not similar to (or diverges from) another variable that it
theoretically should not be similar to. The interest here is in discriminating between the
construct and not-the-construct. The researcher is interested in showing the dissimilarity
with measurements of skills, knowledge, attitudes, and so forth that have been shown or
are believed to de dissimilar from the construct of interest. The construct of interest, in
this present analysis, is the SCI. Of the correlation between the adjusted accountability
indicators and SCI, two were statistically significant at p = 0.001, the API (SED) scores
and the LA Times value-added scores with r = 0.62 and 0.27, respectively, and one was
107
statistically significant at p = 0.003, which was the API improvement scores with r = -
0.16 (see Table 13). The remaining indicators were not highly nor significantly correlated
with Pearson Product-Moment correlation coefficients at nearly zero.
Table 13
Discriminant Validity for Adjusted Accountability Variables for the 2010 School Year (Correlation with
SCI)
SCI Observed Probability
API Improvement - 0.16 0.003
API (SED) 0.62 0.001
Sim_School_Score - 0.01 0.906
LATimesValue-addedScore 0.27 0.001
AGT_ELA - 0.04 0.441
AGT_Math - 0.03 0.516
AGT_ELA_3yr 0.01 0.792
AGT_Math_3yr -0.01 0.792
ANCE_ELA -0.02 0.715
ANCE_Math 0.01 0.811
Convergent validity. Convergent validity is explained as the degree to which an
accountability indicator is similar to other accountability indicators. For the purposes of
this study, the interest in the examination of the convergent validity is to examine how
closely (to what extent) the accountability indicators are measuring similar information.
In order to illustrate the degree of convergence among the accountability indicators, the
Pearson Product-Moment correlations were computed and displayed in Table 14. The
highest correlations were between similar schools scores to both the ANCE scores in
ELA and the ANCE scores in math at r = 0.86 and 0.84, respectively. The ANCE scores
in ELA were correlated with ANCE scores in math at r = 0.81. Further, the correlation
between similar schools scores and API(SED) scores was also high at r = 0.74.
108
The correlations of the AGT scores with the other adjusted indicators revealed the
following. The pairing of AGT math and AGT math 3-year average scores resulted in a
Pearson Product-Moment correlation of 0.70. Also largely correlated were the pairings of
AGT math with AGT ELA, API(SED) with ANCE ELA, API(SED) with ANCE math,
AGT ELA with AGT ELA 3-year, AGT ELA with API improvement scores, ANCE
math with AGT math 3-year, AGT math 3-year with LA Times value-added scores, and
AGT math and API improvement scores at r = 0.67, 0.66, 0.66, 0.64, 0.56, 0.55, 0.55,
0.52, respectively. The remaining correlations were below r = 0.47.
109
Table 14
Convergent Validity for Adjusted Accountability Indicators for the 2010 School Year (Correlations
Between Adjusted Indicators)
API
IMP
API
(SED)
SIM_
SS
LAT_
VAM
AGT
ELA
AGT
Math
AGT_
ELA_
3yr
AGT_
Math_
3yr
ANCE
ELA
ANCE
Math
API
IMP 1
API
(SED) 0.14 1
SIM_
SS 0.38 0.74 1
LAT_
VAM -0.14 0.45 0.34 1
AGT
ELA 0.56 0.24 0.38 0.14 1
AGT
Math 0.52 0.31 0.47 0.24 0.67 1
AGT_
ELA_
3yr 0.15 0.31 0.41 0.46 0.64 0.46 1
AGT_
Math_
3yr 0.09 0.31 0.44 0.55 0.39 0.70 0.60 1
ANCE
ELA 0.30 0.66 0.86 0.25 0.36 0.33 0.41 0.31 1
ANCE
Math 0.33 0.66 0.84 0.39 0.33 0.61 0.35 0.55 0.81 1
Note: API IMP = API Improvement Scores;
API(SED) = API for Socioeconomically Disadvantaged Students;
SIM_SS = Similar School Scores;
LAT_VAM = LA Times Value-added Scores;
AGT ELA = Academic Growth Over Time Scores in ELA;
AGT Math = Academic Growth Over Time Scores in Math;
AGT_ELA_3yr = Academic Growth Over Time Three-Year Average Scores in ELA;
AGT_Math_3yr = Academic Growth Over Time Three-Year Average Scores in Math;
ANCE ELA = Adjusted Normal Curve Equivalent Scores in ELA;
ANCE Math = Adjusted Normal Curve Equivalent Scores in Math.
110
Concurrent validity. Concurrent validity means that a particular measurement or
test varies proportionally with another measurement or test of the construct – or indirectly
with a measure of the opposite of the construct. In other words, concurrent validity allows
for distinguishing between highly similar constructs. Essentially, concurrent validity
measures to what extent can two types of indicators be used in place of another. To
demonstrate concurrent validity of the adjusted scores with API results, the Pearson
Product-Moment correlation was computed for each adjusted variable against API. The
results of their associations are displayed in Table 15. The observed probability of all
correlations were at p = 0.001 except for API improvement scores, which was at p =
0.230. Of those that were significant with an observed probability of p = 0.001, the
largest correlation was between API(SED) and API at r = 0.95. The next highly
significant correlation was between similar schools scores and API at r = 0.57, which was
followed by the correlations with ANCE scores in ELA and math, both measured at r =
0.53. The remaining correlations with API were below r = 0.42.
Table 15
Concurrent Validity for Adjusted Accountability Indicators for 2010 School Year (Correlation with API)
API Observed Probability
API Improvement 0.06 0.230
API (SED) 0.95 0.001
Sim_Schools_Score 0.57 0.001
LATimesValue-addedScore 0.42 0.001
AGT_ELA 0.18 0.001
AGT_Math 0.25 0.001
AGT_ELA_3yr 0.26 0.001
AGT_Math_3yr 0.26 0.001
ANCE_ELA 0.53 0.001
ANCE_Math 0.53 0.001
111
Adjusted Grade Level Equivalent Scores
The third phase of this current research addressed the question: within the
constraints of K-5 elementary schools in LAUSD, to what extent are adjusted grade-level
equivalent (AGLE) scores reliable. Similar to phase two of the investigation, the two
types of reliability examined were internal consistency reliability and test-retest stability
reliability. However, phase three of this study only evaluated discriminant validity and
concurrent validity. AGLE scores are adjusted normal curve equivalent scores at the
grade level. AGLE scores are calculated in the same manner as ANCE scores. However,
ANCE scores were computed using school level data, where AGLE scores were
computed through the manipulation of grade level data.
Descriptive Statistics for AGLE Scores
In order to set the context for the third phase of this current research study, the
means and standard deviations were computed for the AGLE scores and displayed in
Table 16. As the data for AGLE scores were normalized, the distribution of the AGLE
values closely resembled a normal bell curve.
112
Table 16
Descriptive Statistics for Adjusted Grade Level Equivalent Scores in ELA and Math for K-5 Elementary
Schools in the 2009 and 2010 School Years
Mean Std. Deviation N
AGLE_2ELA 2.502 .205 489
AGLE_3ELA 3.502 .205 489
AGLE_4ELA 4.502 .205 489
AGLE_5ELA 5.502 .205 489
AGLE_2MATH 2.502 .205 489
AGLE_3MATH 3.502 .205 489
AGLE_4MATH 4.502 .205 489
AGLE_5MATH 5.502 .205 489
AGLE_2ELA_09 2.502 .205 488
AGLE_3ELA_09 3.502 .205 488
AGLE_4ELA_09 4.502 .205 488
AGLE_5ELA_09 5.502 .205 488
AGLE_2MATH_09 2.502 .205 488
AGLE_3MATH_09 3.502 .205 488
AGLE_4MATH_09 4.502 .205 488
AGLE_5MATH_09 5.502 .205 488
Reliability
As previously mentioned, the two types of reliability examined in this section of
the analysis were internal consistency reliability and test-retest stability reliability. Hence,
the following two sections will detail the results for each type of reliability.
Internal consistency reliability. The first type of reliability analyzed for the
AGLE scores in grades 2 to 5 for K-5 elementary schools in LAUSD was internal
consistency reliability. Accordingly, the AGLE scores in ELA and the AGLE scores in
math for each grade level were correlated using the Pearson Product-Moment correlation
analysis. These correlations along with the Spearman-Brown reliability coefficients,
113
which were manually calculated through the use of the Spearman-Brown Split-Half
formula – r
sb
= 2r
xy
/ (1 + r
xy
) – are presented in Table 17. Correlations between ELA and
math components for grades 2 through 5 revealed a Pearson Product-Moment correlation
coefficient of 0.69 (grade 5) or higher, with the highest correlation of r = 0.80 in grade 2.
When the Spearman-Brown reliability equation was used to compute the internal
consistency between ELA and math at each grade level, the AGLE scores were internally
consistent for each grade level at or above r
sb
= 0.82 with grade 2 having the highest
internal consistency reliability coefficient of r
sb
= 0.89.
Table 17
Internal Consistency Reliability of the AGLE Scores for K-5 LAUSD Schools (Correlation Between ELA
and Math)
Pearson Correlation Coefficient Spearman Brown Reliability
Coefficient
AGLE ELA/Math 2 0.80 0.89
AGLE ELA/Math 3 0.79 0.88
AGLE ELA/Math 4 0.79 0.88
AGLE ELA/Math 5 0.69 0.82
Test-retest stability reliability. In this third phase of the investigation, test-retest
stability reliability was also investigated as the second type of reliability for the AGLE
scores for grades 2 to 5. Test-retest stability was examined for each AGLE type and grade
level using the Spearman-Brown stability coefficient, which was derived manually by
inputting the Pearson Product-Moment correlation coefficients for the respective
variables from the school years 2009 and 2010 into the Spearman-Brown Split-Half
formula. Table 18 illustrates the results of the test-retest stability for the AGLE scores.
The analysis revealed that the Pearson Product-Moment correlations ranged from a low
114
of r = 0.51 (AGLE math for grade 2) to a high of r = 0.57 (AGLE math for grades 3 and
4). The results further demonstrated that in each of the grade levels and content areas, the
lowest stability coefficient was at 0.68 (AGLE ELA for grade 5 and AGLE math for
grade 2) and the highest was 0.72 (AGLE math for grades 3, 4, and 5).
Table 18
Test-Retest Stability – Pearson Product-Moment Correlation Coefficients and Spearman-Brown Stability
Coefficients for AGLE Scores in Grades 2 to 5
Pearson Correlation Coefficient Spearman Brown Stability
Coefficient
AGLE ELA 2 0.55 0.71
AGLE ELA 3 0.53 0.69
AGLE ELA 4 0.53 0.69
AGLE ELA 5 0.52 0.68
AGLE Math 2 0.51 0.68
AGLE Math 3 0.57 0.72
AGLE Math 4 0.57 0.72
AGLE Math 5 0.56 0.72
Validity
The last portion of the analysis of this study was devoted to the examination of
the validity of the AGLE scores. The two types of validity investigated were discriminant
validity and concurrent validity. The two subsequent sections detail the results for each,
respectively.
Discriminant validity. The first type of validity that was analyzed for the AGLE
scores was discriminant validity. The intent of evaluating the discriminant validity was to
inspect the extent to which the AGLE scores are different from the SCI. To that end, the
Pearson Product-Moment correlation coefficients were computed for AGLE scores of
both ELA and math for grades 2 through 5. The results of the analysis are provided in
115
Table 19. The magnitude of the correlations of the AGLE scores to the SCI ranged from a
low of r = 0.00 (AGLE ELA for grade 4) to a high of r = 0.04 (AGLE ELA for grade 5).
Table 19
Discriminant Validity – Pearson Product-Moment Correlation Coefficients for AGLE Scores for ELA and
Math with the SCI for Grades 2 Through 5
SCI Observed Probability
AGLE ELA 2 0.01 0.765
AGLE ELA 3 0.01 0.785
AGLE ELA 4 0.00 0.976
AGLE ELA 5 0.04 0.425
AGLE Math 2 0.03 0.566
AGLE Math 3 0.01 0.901
AGLE Math 4 0.02 0.678
AGLE Math 5 0.02 0.681
Concurrent validity. This study also looked in the concurrent validity of the
AGLE scores with API scores. For the purposes of this research, concurrent validity was
analyzed through the computation of the Pearson Product-Moment correlation for the
varying AGLE scores with API. Table 20 presents the correlations used to determine the
extent of the concurrent validity of the AGLE scores with API. The results of the analysis
revealed that the correlations ranged between r = 0.37 (AGLE Math grade 3) and r = 0.44
(AGLE ELA grade 5).
Table 20
Concurrent Validity – Pearson Product-Moment Correlation Coefficients for AGLE Scores in ELA and
Math with API for Grades 2 Through 5
API Observed Probability
AGLE ELA 2 0.39 .001
AGLE ELA 3 0.40 .001
AGLE ELA 4 0.39 .001
AGLE ELA 5 0.44 .001
AGLE Math 2 0.38 .001
AGLE Math 3 0.37 .001
AGLE Math 4 0.39 .001
AGLE Math 5 0.38 .001
116
CHAPTER 5
DISCUSSION
The Coleman Report (1966) and A Nation at Risk (1983) illuminated the
mediocrity of the U.S. education system, especially as it relates to the adequate progress
of the nation’s disadvantaged students. The achievement gap, which has affected overall
graduation rates, adequate employment attainment, and the nation’s competitiveness as
compared to other developed and developing countries, has increased federal scrutiny on
what was once deemed a responsibility left to the states – K-12 education and
accountability. The large-scale commissioned reports catalyzed the federal government
into more stringent actions to improve achievement for all students.
Accordingly, the federal government’s enactment of the No Child Left Behind
(NCLB) Act of 2001 and the Race to the Top (RTTT) initiative of 2009 shifted the focus
of accountability to the impact made by classroom teachers on learning as measured by
student assessment scores. NCLB mandated a “highly qualified” teacher in each
classroom, and RTTT attempted to ensure that the impact of these teachers on student
learning are effectively measured in order to properly incentivize teachers, schools, and
districts or eliminate those that poorly enhance student outcomes.
Teacher evaluation has been in practice for quite some time, however, as more
sophisticated accountability and data systems, which have been implemented as a result
of federal oversight, are capable of tracking learning gains by individual students, greater
opportunities have arisen to measure the impact of teachers through the use student
assessment scores. By linking performance evaluation with academic standards for
117
students, policymakers are hoping they can transform teacher evaluation into a more
effective tool for improving instructional practice and raising student achievement. To
that end, policymakers and researchers have begun to look into the link between teacher
effectiveness and student outcomes. Determining the best procedure to define and/or
estimate effective/ineffective teachers, grade-level teams, and schools and then
operationalize these results in a valid manner is necessary and warranted.
The purpose of this study was to critically examine the information value
generated by the varying approaches to performance evaluation as it relates to student
assessment scores. To do so, this current research agenda analyzed and compared the
results as produced by the varying achievement indicators that are in use or available for
use. The intent was to provide a comparative analysis of the varying approaches to
performance evaluation that are under the constraints of California’s accountability
system to provide policymakers with salient information value from which to design a
more robust and well-aligned performance evaluation model. In this high stakes
accountability environment, there is an imperative need to have evidence that supports
the uses and interpretations of assessment scores.
To that end, this study specifically evaluated the extent to which the achievement
accountability measures – unadjusted, adjusted, and grade-level equivalent indicators –
were inter-correlated and to what extent were each indicator system correlated to poverty
status. Additionally, this research also evaluated the stability of the results of the various
achievement variables for the 2009 and 2010 school years for Los Angeles Unified
118
School District (LAUSD) K-5 elementary schools. The specific research questions that
were evaluated were:
1. Within the constraints of K-5 elementary schools in LAUSD, to which extent
are the NCLB status indicator – Adequate Yearly Progress (AYP) – and
California’s Public Schools Accountability Act (PSAA) improvement
indicator – Academic Performance Index (API) – correlated to the school
characteristics index (SCI) and to what extent are these two unadjusted
achievement indicators inter-correlated.
2. Within the constraints of K-5 elementary schools in LAUSD, to what extent
are the adjusted indicators named below reliable (internal consistency
reliability and test-retest stability) and valid (discriminant, convergent, and
concurrent) for use as they relate to performance evaluation and
accountability:
a. API improvement scores
b. API
SED
scores (SED = socioeconomically disadvantaged)
c. Similar schools scores
d. LA Times value-added scores
e. Academic growth over time (AGT) value-added scores – ELA and
Math
f. Adjusted normal curve equivalent (ANCE) scores – ELA and Math
3. Within the constraints of K-5 elementary schools in LAUSD, to what extent
are adjusted grade-level equivalent (AGLE) scores reliable (internal
119
consistency reliability and test-retest stability) and valid (discriminant and
concurrent) for use as they relate to performance evaluation and
accountability.
Summary of Findings
There were three research questions, which catalyzed this current study. Each
guided one of the three phases of the investigation. As such, the following discussion on
the findings will be delineated into the three respective phases: (a) Phase one addresses
the unadjusted indicators – the NCLB AYP and the PSAA API; (b) Phase two focuses on
the adjusted indicators proposed as adjuncts to the presently operationalized
accountability system; and (c) Phase three illuminates results of adjusted grade-level
equivalent scores.
Phase One – Unadjusted Indicators
The current accountability system in California incorporates various unadjusted
assessment indicators to evaluate the performance of students, teachers, and schools.
Unadjusted assessment indicators are those that are used to assess performance in the
same manner from one student group to the next student group, and from one school with
another school. Unadjusted variables do not control for student or school factors such as
socioeconomic status or prior achievement. The two unadjusted accountability indicators
analyzed were the AYP and API. The AYP is the keystone of NCLB and the API is the
cornerstone of California’s PSAA. The AYP and the API are currently the two indicators
used at the school level to evaluate the progress of schools for corresponding rewards or
sanctions.
120
The unadjusted achievement measures – the API and the AYP disaggregated into
ELA and math – were correlated against each other and against the SCI, which is the
index of student, teacher, and school characteristics that may influence achievement
assessment results.
The inter-correlations between the unadjusted indicators were strong, positive,
and statistically significant (p = .001) with Pearson Product-Moment correlations larger
than .92 for all pairs of variables indicating that they measure essentially the same
assessment information.
Further, the unadjusted indicators were correlated highly to the SCI with Pearson
Product-Moment correlations between 0.73 (Math AYP) and 0.86 (ELA AYP), p = .001.
Taken together, 53% (Math AYP) to 74% (ELA AYP) of the variance in the unadjusted
achievement scores was explained by differences in school characteristics.
The systems of accountability that use unadjusted measures of performance do
not recognize that each student has an educational history and performs on the basis of
current and past opportunities to learn new skills and knowledge (Goldschmidt et al.,
2005). On this basis, when these indicators are used for evaluation and accountability
purposes, teachers, grade-level teams, and schools with poorer students will be placed at
a relative disadvantage as compared to others. A fair and useful system of evaluation and
accountability “should not place some students, some teachers, or some schools at a
relative disadvantage in comparison to others” (Linn, 2001a; p. 5).
A large body of research suggests that socioeconomic status (SES) strongly
correlates to student achievement (Bradley & Corwyn, 2002; Ferguson, 2002; Flannagan
121
& Grissmer, 2006; Grinion, 1999; Meece, 2002; Pintrich & Schunk, 2002). It has been
well established since the Coleman Report (1966) that SES is highly correlated with
unadjusted measures of progress (Goldschmidt et al., 2005; Linn, 2001a). Furthermore,
according to Perry and McConney (2010), this is a worldwide issue. In their recently
published study in which they examined the correlation between SES and mean student
achievement scores on the 2003 Australian Programme for International Student
Assessment (PISA), the socioeconomic composition of a school mattered greatly in terms
of students’ academic performance throughout the world’s developed nations. All
together, evidence from this current research and other published work indicate strongly
that school performance is largely dependent on student characteristics rather than the
efficacy of the school and its teachers to induce new learning. To this point, Millman’s
(1997) assertion speaks poignantly to the K-12 education accountability landscape:
The single most frequent criticism of any attempt to determine a teacher’s
effectiveness by measuring student learning is that factors beyond a teacher’s
control affect the amounts that students learn …. Educators want a level playing
field and do not believe such a thing is possible. (p. 244)
As a result of the high correlation of the unadjusted indicators to the SCI and in an
effort to remedy the biased nature of high stakes accountability, policymakers and
researchers have examined the fairness and usefulness of adjusted indicators of
performance as adjunct data to evaluate schools (Goldschmidt & Choi, 2007; Linn,
2008). Phase two of this investigation examined six different adjusted accountability
indicator systems currently being used or proposed as adjuncts for evaluation and
accountability.
122
Phase Two – Adjusted Accountability Indicators
Policymakers understand that there are differences in achievement among varying
student groups and closing the achievement gap was critical in the enactment of NCLB
and California’s PSAA. Adjusted assessment indicators have been implemented in the
California accountability system in an effort to differentiate the progress of varying
student groups and to separate teachers and schools with similar diverse student
populations as it relates to overall performance, for the purposes of benchmarking and
evaluation of best practices. Unlike the unadjusted indicators, the adjusted indicators
control for certain student background information, such as SES. The six different
adjusted accountability measures examined in this study comprised of: (a) API
improvement scores, (b) API (SED) scores, (c) similar schools scores, (d) LA Times
value-added scores, (e) academic growth over time (AGT) value-added scores in ELA
and math, and (f) adjusted normal curve equivalent (ANCE) scores in ELA and math.
The first three are currently operationalized under California’s improvement
accountability model in accordance with the PSAA. The LA Times value-added scores
were computed by LA Times to demonstrate an alternative model of evaluation of
teachers and schools in LAUSD (http://projects.latimes.com/value-added/). The AGT
value-added model was implemented in LAUSD in 2010 as another added component of
its performance evaluation and accountability process. The last indicator type, ANCE
scores, was introduced in this study as a reliable and valid alternative for use in
comparison to other methods of performance evaluation and accountability.
123
Five assertions guided the second phase of the investigation. They include: (a)
Adjusted indicators should not depend on the measurements (ELA or math) sampled –
internal consistency reliability; (b) Adjusted indicators should be stable, specifically from
2009 to 2010 – test-retest stability reliability. The standard criterion for internal reliability
is .70 for research purposes and .90 for high stakes decision-making purposes (Lance,
Butts, & Michels, 2006). Test-retest is an estimate of the lower limit of the internal
consistency formula because it is attenuated by measurement error of the internal
consistency of the measure. The observed test-retest can be corrected for attenuation by
dividing it by the square root of year one with year two reliabilities. Thus, we can
estimate the observed test-retest that would yield .70 estimate by multiplying .70 by an
estimate of the internal consistency. Therefore, .70 times .70 yields .49, which would be
the minimum criterion for test-retest reliability for research purposes, and .70 times .90
yields .63, which would be the minimum criterion for high stakes decision-making
purposes (Hocevar, 2010); (c) Adjusted indicators should not be correlated with
institutional characteristics that are beyond the purview of schooling system –
discriminant validity; (d) Adjusted indicators should be positively and moderately
correlated with each other – convergent validity; and (e) Adjusted indicators should be
positively and at least moderately correlated with the API – concurrent validity. The
following discussions of the findings will be divided into five sections – the first two will
detail the information value pertaining to the two types of reliability – internal
consistency reliability and test-retest stability reliability. Then, in the subsequent three,
the generated information value on the three types of validity that guided the second stage
124
of this current research agenda – discriminant validity, convergent validity, and
concurrent validity – will be discussed.
Internal consistency reliability The analysis on internal consistency reliability
was performed on the AGT scores, the three-year average AGT scores, and the ANCE
scores. Of the three types of adjusted indicators that could be disaggregated into ELA and
math components, the ANCE scores were highly internally consistent at r
xy
= 0.92,
followed by the one-year AGT scores at r
xy
= 0.80 and AGT three-year average scores at
r
xy
= 0.76 (Cohen, 1988).
Test-retest stability reliability. Test-retest stability reliability was examined of
the six varying types of adjusted accountability indictors as a way to check for the
reliability of the sum of the data set from the two years. It was revealed that the most
stable was the API(SED) scores (r
xx
= 0.94), which suggests that the gains made by SEDs
are stable across the years. Next, the AGT in math and ELA three-year average scores
were also high at r
xx
= 0.91 and 0.89, respectively. Computations of test-retest stability
also disclosed high coefficients for the ANCE scores and similar school scores. ANCE
scores in ELA were highly stable at r
xx
= 0.87. Similarly, ANCE scores in math had an r
xx
of 0.86 between 2009 and 2010. Then, similar school scores were stable between 2009
and 2010 at r
xx
= 0.85. AGT scores in math, AGT scores in ELA, and API improvement
scores were low in stability with r
xx
= 0.45, 0.17, and 0.00, respectively (Cohen, 1988).
Thus, although it can be surmised that ANCE scores, similar school scores, and
AGT three-year averages are reliable across the two years, the high stability coefficients
125
for the AGT three-year averages may be inflated due to auto-correlations that exist in
part-to-whole computations.
Discriminant validity. Correlations between the SCI and the adjusted indicators
were computed to measure the extent to which the adjusted indicators diverge from the
school characteristics that may hinder an adequate measure of performance as a result of
factors beyond the control of the schooling system. API(SED) and LA Times value-
added scores were positively correlated to the SCI at r = 0.62 and 0.27, respectively.
Where it is understandable that API(SED) should correspond positively with the SCI, it is
problematic that the LA Times value-added scores were also positively correlated to the
SCI, even though it is a small correlation. This indicates that although value-added
models are designed to control for factors that negatively impact performance evaluation,
they may not fully account for all targeted institutional characteristics. Moreover, the fact
that the correlation between the SCI and API improvement was r = -0.16, indicates that
schools with more disadvantaged students are showing greater gains in achievement.
However, it is not certain from this investigation whether this is a result of something that
the high poverty schools are doing or whether it is a “ceiling effect,” or simply an artifact
of the accountability system. Lastly, the remaining indicators were correlated to the SCI
with a Pearson Product-Moment correlation coefficient of nearly zero (ranging from r = -
0.04 to r = 0.01). On this basis, similar school scores, AGT scores, and ANCE scores are
successful ways to control for the characteristics included in the school characteristics
index.
126
Convergent validity. In order to see the extent to which the varying types of
adjusted indicators converge, the Pearson Product-Moment correlation coefficient was
ordered and analyzed for each pairing of variables. All were moderately to highly
correlated ranging from a low of r = 0.09 to a high of r = 0.84 (Cohen, 1988). AGT three-
year averages in math and API improvement scores were correlated the lowest at r =
0.09. Also, low correlations were exhibited by the pairings of API improvement scores
with API(SED) scores and LA Times value-added scores with AGT ELA; both sets of
variables correlated at r = 0.14. Then, API improvement scores and LA Times value-
added scores were also lowly correlated at r = -0.14. The correlation between the LA
Times value-added scores and AGT math scores were also low (r = 0.24). Further, the
API improvement and AGT ELA three-year average scores were at the low end of the
correlation spectrum at r = 0.15. Lastly, the LA Times value-added scores and ANCE
ELA scores were low in correlation at r = 0.25 (Cohen, 1988). Hence, a significant
finding from this section is that the LA Times value-added scores lack convergence with
the AGT scores, similar school scores, and ANCE scores. This lack of correlation implies
that the value-added indicators do not measure the same student assessment data as
compared to the other adjusted indicators under examination in this current research
study or the current accountability systems in California.
Furthermore, although similar schools scores are currently not used to reward or
sanction schools, they are considered critical indicators, which represent fair measures of
school progress as they do account for factors outside of the influence of the schooling
system (Linn, 2001a; Park, 2007). Data revealed that similar schools scores were highly
127
correlated to the ANCE scores in both ELA and math (r = 0.86 and 0.84, respectively),
and moderately correlated with three-year average AGT scores in both ELA and math (r
= 0.41 and 0.44, respectively). Finally, the three-year average AGT scores were
moderately correlated with ANCE scores with Pearson Product-Moment correlation
coefficients ranging from 0.31 to 0.55 (Cohen, 1988).
Concurrent validity. The concurrent validity of the six types of adjusted
indicators with API was examined through an analysis of Pearson Product-Moment
correlation coefficients. According to the results, the highest correlation was with
API(SED) scores at r = 0.95. The lowest correlation was with API improvement scores at
r = 0.06. The correlations with similar schools scores, ANCE ELA, and ANCE math
were at r = 0.57, 0.53, and 0.53 respectively. Both of the three-year AGT variables were
correlated with API at r = 0.26. The results used for examining concurrent validity
indicate that the more concurrently valid adjuncts are similar school scores and the
ANCE scores.
Findings From the Second Phase of the Investigation
Guided by the assertions of the second phase of the study, similar schools scores
and ANCE scores exhibited, from the results of this research study, relatively more
information value than the other four types of adjusted accountability indicators in terms
of the two types of reliability – internal consistency and test-retest stability reliability –
and the three types of validity – discriminant, convergent, and concurrent – examined.
However, although both similar school scores and ANCE scores are reliable and valid for
use as adjuncts to the current evaluation and accountability models, ANCE scores, but
128
not similar schools scores, can be disaggregated into ELA and math components. These
findings implicate that ANCE scores are the more preferred adjuncts.
Phase Three – Adjusted Grade Level Equivalent Scores
In the process of conducting this research, an alternative method of evaluating
performance of teachers, specifically grade-level teams, was also made available for
analysis – adjusted grade-level equivalent (AGLE) scores. AGLE scores are adjusted
normal curve equivalent (ANCE) scores, which are equal interval standard scores that
were transformed from percentile scores in the respective content areas of ELA and math
and for the specific grade levels. The AGLE or ANCE scores incorporate grade-level
tiers ranging from a low of .01 to a high of .99. This study differentiated ANCE scores to
represent school-level ratings, while AGLE scores represent grade-level rankings.
The third phase of the investigation was guided by four assertions: (a) AGLE
indicators should not depend on the measurements (ELA or math) sampled – internal
consistency reliability; (b) AGLE indicators should be stable from 2009 to 2010 – test-
retest stability reliability. The criterion, as discussed in phase two above is a minimum of
.49 for research purposes and .63 for high stakes decision-making purposes; (c) AGLE
indicators should not be correlated with institutional characteristics that are not under the
purview of the schooling systems – discriminant validity; and (d) AGLE indicators
should be moderately and positively correlated with the API – concurrent validity. The
following sections will be divided into the respective types of reliability and validity that
were evaluated in the third stage of the investigation – internal consistency reliability,
test-retest stability reliability, discriminant validity, and concurrent validity.
129
Internal consistency reliability. An evaluation of the internal consistency
reliability was conducted for the AGLE scores for grades 2 to 5 between the ELA and
math components for each respective grade level. As indicated in the previous chapter,
the Pearson Product-Moment correlation for each grade level from two to five ranged
from a 0.69 to 0.80. When these correlations were inputted into the Spearman-Brown
Split-Half formula, the resultant internal consistency coefficients ranged from r
sb
= 0.82
to r
sb
= 0.89. In this manner, AGLE scores examined in this study for the school year
2010 were highly and internally consistent as compared between the ELA and math
components.
Test-retest stability reliability. The test-retest stability reliability was inspected
for the AGLE scores for school years 2009 and 2010 for grades 2 through 5 from all K-5
elementary schools in LAUSD. To obtain the stability coefficients for each grade level,
the Pearson Product-Moment correlation was computed. Results showed that the Pearson
Product-Moment correlation coefficients ranged from r = 0.51 to r = 0.57. Using these
results, the Spearman-Brown stability coefficients were manually computed and data
revealed a low of r
xx
= 0.68 and a high of r
xx
= 0.72. According to the results, AGLE
scores demonstrated a moderate level of stability between 2009 and 2010 (Cohen, 1988).
Based on the set criteria for this research study, the reliability of the two-year sum was
suitable for research purposes but unsuitable for high stakes decision-making.
Discriminant validity. The first type of validity analyzed for the AGLE scores
was discriminant validity. To evaluate the extent to which the AGLE scores are able to
discriminate against the characteristics included in the SCI, the Pearson Product-Moment
130
correlation was computed for each of the grade level in the content areas of ELA and
math. Results indicated correlations ranging from a low r = 0.00 to a high of r = 0.04.
Thus, because the AGLE scores were regressed on the SCI, it is not surprising then that
all correlations were near zero suggesting strong discrimination among the AGLE scores
and the SCI. In other words, as a consequence of their design, the analysis reiterated that
AGLE scores were made to not be influenced by factors that are outside the purview of
the schooling systems, such as SES.
Concurrent validity. The information value of the AGLE scores was also
evaluated on the basis of their concurrent validity with the API. Accordingly, a Pearson
Product-Moment correlation analysis was completed for the AGLE scores with the API.
Results indicated that the correlations ranged from r = 0.37 and r = 0.44, which is in the
moderate range (Cohen, 1988).
Findings From the Third Phase of the Investigation – AGLE Scores
Guided by the assertions of the third phase of the current investigation, the
information value generated indicates that AGLE scores are reliable measures and are, to
a large extent, valid for use in evaluation and accountability. Thus, the results from the
third phase of this investigation warrant attention from future research and policy makers
for policy design. Clearly, current evaluation and accountability models are focused
intently on teacher and school level performance. The debate on how to effectively
measure teacher impact for meaningful educative feedback for improving teaching and
enhancing student learning persists. This study provides evidence for consideration of an
alternative method of performance evaluation and accountability. Specifically, data from
131
this study on AGLE scores suggest that the unit of analysis for the purposes of
accountability can be reliably and validly implemented for grade-level teacher teams.
Taking the unit of analysis to the grade-level teacher team – a subgroup ignored by
NCLB and PSAA – sample size can be increased from that of the teacher-level analysis,
making results more statistically reliable (Linn et al., 2002). Furthermore, it is important
to note that AGLE scores can be simply averaged to generate reliable school composite
scores in an NCE metric that are valid for use for evaluation and accountability purposes.
On the whole, the saliency of the information value generated from this study on AGLE
scores deserves further analysis, scrutiny, and consideration from education researchers
and policy designers.
Implications
This current research agenda generated a number of relevant implications as it
relates to the practice of performance evaluation. The ensuing sections will delineate the
implications, which are supported by the findings of this study, into two categories:
practical implications and policy implications.
Practical Implications
The findings of this study have practical implications, which can catalyze
actionable change at the classroom and school level. First, evident from the findings
discussed above, varying accountability indicators produce vastly varied results as a
result of what each controls for or not control for. Specifically, if the purpose of
evaluation and accountability is to enhance teaching and improve learning, reliable and
valid feedback is critical. Second, as Linn (2001a) stated, “the (accountability) system
132
should not place some students, some teachers, or some schools at a relative disadvantage
in comparison to others” (p.5). LAUSD is disproportionately populated with schools
comprised of a student population hugely low in SES. The unadjusted variables and the
varying value-added scores currently being implemented into the district’s overall
comprehensive accountability model are correlated with SES, making their use invalid
and detrimental for generating enhanced teaching and improved student learning.
In the same line of thought, schools in Houston, New York City, and Chicago
have effectively attached high-stakes to their value-added ratings of teachers (Corcoran,
2010; Jacob, 2011). If corresponding termination or retention processes are based on
similar unstable scores that have been generated through this current research for LAUSD
teachers, many good teachers could possibly be displaced while many ineffective
teachers could remain in the classroom settings teaching our nation’s children.
Lastly, anecdotal evidence suggests that poorer schools are making progress, yet,
the high-stakes accountability system does not take that into account because of its desire
that all students meet proficiency status even with varying starting points. Teachers and
schools teaching disadvantaged students will more likely be sanctioned. Attempting to
retain effective teachers at schools that are facing or about to face sanctions may prove
difficult and critically unfavorable for all students, but especially those needing the most
support.
Policy Implications
The current research also generated several cogent policy implications. First, the
current NCLB’s AYP metric and the California’s PSAA’s API indicator are not fair or
133
valid forms of estimation of school progress. Their strong correlation to poverty status
and other factors not under the purview of schooling system will continue to
disadvantage teachers and schools teaching students that the accountability systems
purport to help improve in terms of educational outcomes. Second, as indicated by the
information value generated from this study on the unadjusted and adjusted variables that
are in use or are available for use, policymakers must look beyond what is available. Data
from this research indicate that ANCE scores and AGLE scores are reliable and are valid
for use as alternative measures of performance. Lastly, as a unique contribution to the
body of literature on evaluation and accountability models, this current study explored
and produced relevant information value on AGLE scores, which can be implemented to
provide educative feedback for grade-level teams. Thus, policy designers can now initiate
investigation into the potential incorporation of grade-level teams, an instructionally
significant subgroup that is ignored under NCLB and PSAA, into evaluation and
accountability design.
Future Research
As a result of this study, several salient implications were generated for practical
purposes and for policy design purposes on accountability systems as it relates to
performance evaluation models linked to student assessment data. However, further
research is needed in order to create a body of knowledge that can substantiate and
augment the information value of the various types of achievement indicators that are in
use or available for use in performance evaluation processes. More scrutiny of the
findings pertaining to ANCE and AGLE scores will greatly enhance their potential for
134
use in evaluation and accountability policy design. In addition, more years of data beyond
two school years may establish more substantive information value for AGLE and ANCE
scores to be supported as reliable measures that are valid for implementation as adjuncts
to the accountability systems. Further, this research was performed from the use of data
of one urban school district in California and only at the K-5 elementary level. Benefit
can be garnered with research performed on different types of school districts and on
different grade levels and subject areas.
Conclusion
It was clear in the literature review and within the current model of performance
evaluation, the achievement indicators that comprise an accountability system impacts
the overall rating of a teacher, grade-level team, or school. At the heart of any
accountability system is the need for evidence that supports the uses and interpretations
of the assessment scores. Evaluation and accountability processes are ultimately
implemented to enhance teacher performance and improve student outcomes to close the
gap in learning for all, but especially disadvantaged students. However, the alignment
between this goal and the achievement indicators that are being used to generate
evaluative information is skewed. Understanding that no one model will be able to
address every possible concern, it is imperative that policymakers devise a
comprehensive evaluation and accountability model that encompasses a multi-
dimensional perspective that is fair, reliable, and valid for use given the intended
purposes of evaluation and accountability. The information value generated from this
135
study is timely and warrants attention from education researchers, policymakers, and
practitioners.
Limitations and Delimitations
Limitations
1. The analysis and interpretation of the findings from the longitudinal value-added
study performed by Richard Buddin for the LA Times are limited in scope by the
statistical model used.
2. The analysis and interpretation of the findings from the longitudinal value-added
study performed by the Value-added Research Center (VARC) on behalf of
LAUSD are limited in scope by the statistical model used.
3. The evaluation of the student achievement scores used for the estimation of
teacher effectiveness is limited to the statistical soundness of the tests used in
evaluating student learning gains. Different accountability indicators can produce
different results.
Delimitations
1. The data collected for this study was from a large urban school district in
California – LAUSD. The size and demographic make-up of LAUSD may hinder
the generalizability of the research findings. Specific sample used for this study
(number of elementary schools = 430) may also affect the appropriate scaling of
the results from this study.
2. Student achievement data were only collected for grades 2 through 5 for all K-5
elementary schools in LAUSD. Further, the dataset for this study comprised only
136
of data across two years. As a consequence, the generalizability of the findings is
constrained by the sampling method operationalized in this study.
137
REFERENCES
Aitkin, M., & Longford, N. (1986). Statistical modeling issues in school effectiveness
studies. Journal of the Royal Statistical Society. Series A (General), 149(1), 1-43.
Alexander, K. L., Entwisle, D. R., & Olson, L. S. (2001). Schools, achievement, and
inequality: A seasonal perspective. Educational Evaluation and Policy Analysis,
23(2), 171-191.
American Psychological Association. (1999). Standards for Educational and
Psychological Tests and Manuals. Washington, DC: American Psychological
Association.
Anastasi, A. (1988). Psychological testing. New York: Macmillan.
Archuleta, S. (2002). Elementary and Secondary Education Act of 1965. Retrieved
July 30, 2011 from
http://si.unm.edu/si2002/SUSAN_A/TIMELINE/TIM_0015.HTM
Ballou, D., Sanders, W., & Wright, P. (2004). Controlling for student background in
value-added assessment of teachers. Journal of Educational and Behavioral
Statistics, 29(1), 37-65.
Battelle for Kids. (2011). LAUSD Academic Growth Over Time reports. Retrieved
September 13, 2011, from
http://portal.batelleforkids.org/bfk/lausd/AGT_Reports.html?sflang=en
Borman, G. D., & Kimball, S. M. (2005). Teacher quality and educational equality: Do
teachers with higher standards-based evaluation ratings close student achievement
gaps? The Elementary School Journal, 106(1), 3-20.
Boyd, D., Goldhaber, D., Lankford, H., & Wyckoff, J. (2007). The effect of certification
and preparation on teacher quality. The Future of Children, 17(1), 45-68.
Boyd, D., Lankford, H., Loeb, S., Rockoff, J., & Wyckoff, J. (2008). The narrowing gap
in New York City teacher qualifications and its implications for student
achievement in high-poverty schools. Journal of Policy Analysis and
Management, 27(4), 793-818.
Bradley, R. H., & Corwyn, R. F. (2002). Socioeconomic status and child development.
Annual Review of Psychology, 53(1), 371-399.
Brophy, J. E. (1973). Stability of teacher effectiveness. American Educational Research
Journal, 10(3), 245-252.
138
Braun, H. I. (2005). Using student progress to evaluate teachers: A primer on value
added models. Princeton, NJ: Educational Testing Services. Retrieved April 1,
2010, from http://www.ets.org/research/policy_research_reports/pic-vam
Brophy, J. E. (1973). Stability of teacher effectiveness. American Educational Research
Journal, 10(3), 245–252.
Bryk, A., Thum, Y. M., Easton, J. Q., & Luppescu, S. (1997). Assessing school academic
productivity: The case of Chicago school reform. Social Psychology of Education,
2(1), 103-142.
Buddin, R. (2010). How effective are Los Angeles elementary teachers and schools?
Retrieved December 10, 2010, from
http://www.latimes.com/media/acrobat/201008/55538493.pdf
Cal Education Code Sections 44660-44665. The Stull Act (98-TC-25)
Caldas, S. J., & Bankston, C. L. (1999). Multilevel examination of student, school, and
district-level effects on academic achievement. The Journal of Educational
Research, 93(2), 91-100.
California Commission on Teacher Credentialing. (1997). California standards for the
teaching profession. Retrieved April 5, 2011, from
http://www.btsa.ca.gov/ba/pubs/pdf/cstpreport.pdf
California Department of Education. (2010a). Overview of the 2009-10 accountability
progress report. Retrieved March 5, 2011, from
http://www.edcoe.k12.ca.us/departments/curriculum_instruction/documents/0818
10CILAccountability.pdf
California Department of Education. (2010b). California standardized testing and
reporting: Post-test guide technical information. Retrieved May 1, 2011, from
http://www.startest.org/pdfs/STAR.post-test_guide.2009.pdf
California Department of Education. (2010c). 2009-10 Academic Performance Index
reports: Information guide. Retrieved March 5, 2011, from
http://www.cde.ca.gov/ta/ac/ap/documents/infoguide09.pdf
California Department of Education. (2010d). Overview of the 2009 similar schools rank
based on the Academic Performance Index. Retrieved March 11, 2011, from
www.cde.ca.gov/ta/ac/ap/documents/simschl09b.pdf
139
California Department of Education. (2011). Descriptive Statistics and Correlation Tables
for California’s 2010 School Characteristics Index and Similar Schools Ranks.
Retrieved October 25, 2011, from
http://www.cde.ca.gov/ta/ac/ap/documents/tdgreport1011.pdf
California Department of Education. (n.d). Accountability Progress Reporting (APR).
Retrieved December 20, 2010, from http://www.cde.ca.gov/ta/ac/ar/
California Teachers Association. (n.d.). Understanding the Stull Act. Retrieved February
25, 2011, from www.cta.org/.../4C41B7A33B4B4FE8A28E88803AD34C1B.ashx
Carlson, D. E. (2000). All students or the ones we taught? Paper presented at the Council
of Chief State School Officers Annual National Conference on Large-scale
Assessment, Snowbird, UT.
Choi, K., Goldschmidt, P., & Yamashiro, K. (2005). Exploring models of school
performance: From theory to practice. Yearbook of the National Society for the
Study of Education, 104(2), 119-146.
Choi, K., Seltzer, M., Herman, J., & Yamashiro, K. (2007). Children left behind in AYP
and non-AYP schools: Using student progress and the distribution of student
gains to validate AYP. Educational Measurement, Issues and Practice, 26(3), 21.
Clark, D. (1993). Teacher evaluation: A review of the literature with implications for
educators. Unpublished Seminar Paper, California State University of Long
Beach.
Clark, D., Martorell, P., & Rockoff, J. (2009). School Principals and School Performance.
National Center for Analysis of Longitudinal Data in Education Research
(CALDER), Working Paper 38. Washington, DC: The Urban Institute. Retrieved
February 24, 2011, from http://www.urban.org/url.cfm?ID=1001427
Clotfelter, C. T., Ladd, H. F., & Vigdor, J. L. (2006). Teacher-student matching and the
assessment of teacher effectiveness. The Journal of Human Resources, 41(4),
778-820.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, NJ: Erlbaum.
Coleman, J. S. (1966). Equality of educational opportunity (COLEMAN) study (EEOS).
[Electronic version]. Ann Arbor, MI: Inter-university Consortium for Political
and Social Research. Retrieved March 8, 2011,
fromwebapp.icpsr.umich.edu/cocoon/ICPSRSTUDY/06389.xml
140
Corcoran, S. P. (2010). Can teachers be evaluated by their students’ test scores? Should
they be? The use of value-added measures of teacher effectiveness in policy and
practice. Providence, RI: Annenberg Institute for School Reform at Brown
University. Retrieved September 21, 2010, from:
http://www.annenberginstitute.org/pdf/valueAddedReport.pdf
Corcoran, S. P., Jennings, J. L., & Beveridge, A. A. (2010). Teacher effectiveness on
high- and low-stakes tests. Paper presented at the Institute for Research on
Poverty summer workshop, Madison, WI.
Croninger, R. G., Rice, J. K., Rathbun, A., & Nishio, M. (2007). Teacher qualifications
and early learning: Effects of certification, degree, and experience on first-grade
student achievement. Economics of Education Review, 26(3), 312-324.
Cuban, L. (2004). The blackboard and the bottom line: Why schools can't be businesses.
Cambridge, Mass: Harvard University Press.
Daley, G., & Valdés, R. (2006). Value added analysis and classroom observation as
measures of teacher performance: A preliminary report (Publication No. 311).
Los Angeles: Los Angeles Unified School District; Program Evaluation and
Research Branch; Planning, Assessment and Research Division. Retrieved
February 15, 2011, from http://notebook.lausd.net/pls/ptl/
Danielson, C., & Ebrary, I. (2007). Enhancing professional practice: A framework for
teaching. Alexandria, VA: Association for Supervision and Curriculum
Development.
Darling-Hammond, L. (2000). Teacher quality and student achievement: A review of
state policy evidence. Education Policy Analysis Archives, 8(1). Retrieved
September 10, 2011 from http://olam.ed.asu.edu/epaa/v8n1/
Darling-Hammond, L. (2007). Evaluating “no child left behind.” The Nation, 284(20),
11-11. Retrieved August 25, 2011, from
http://www.thenation.com/doc/20070521/darlinghammond
Donaldson, M. L. (2009). So long, Lake Wobegon? Using teacher evaluation to raise
teacher quality. Retrieved February 15, 2011, from
http://www.americanprogress.org/issues/2009/06/teacher_evaluation.html
Doran, H., & Izumi, L. (2004). Putting education to the test: A value-added model for
California. San Francisco, CA: Pacific Research Institute. Retrieved May 5, 2008,
from http://www.heartland.org/custom/semod_policybot/pdf/15626.pdf
141
Drury, D. & Doran, H. C. (2003). The value of value-added analysis. Policy research
brief for National School Board Association, 3(1), 1-4. Retrieved February 24,
2011, from www.cgp.upenn.edu/ope_news.html
Eberts, R. W., & Stone, J. A. (1984). Unions and public schools: The effect of collective
bargaining on American education. Lexington, MA: Lexington Books.
EdSource. (2002). California’s student data system: In need of improvement. Retrieved
January 16, 2011, from http://www.edsource.org/pub_edfct_stdtdatasys.html
EdSource. (2005). School accountability under NCLB: Ambitious gals and competing
systems. Retrieved January 16, 2011, from
http://www.edsource.org/pub_NCLB8-05.html
EdSource. (2005). The state’s official measure of school performance. Retrieved January
16, 2011, from http://www.edsource.org/pub_PerfMeasures6-05.html
Elementary and Secondary Education Act (2003). Retrieved on February 16,
2011, from www2.edtrust.org/edtrust/esea
Ehrenberg, R. G., & Brewer, D. J. (1994). Do school and teacher characteristics matter?
evidence from high school and beyond. Economics of Education Review, 13(1), 1-
17.
Ellett, C. D., & Teddlie, C. (2003). Teacher evaluation, teacher effectiveness and school
effectiveness: Perspectives from the USA. Journal of Personnel Evaluation in
Education, 17(1), 101-128.
Elmore, R. F., Ableman, C. H., & Fuhrman, S. H. (1996). The new accountability in state
education reform: From process to performance. In H. F. Ladd (Ed.), Holding
schools accountable: Performance-based reform in education (pp. 65–98).
Washington, DC: Brookings Institution Press.
Ferguson, R. F. (2002). What doesn’t meet the eye: Understanding and addressing racial
disparities in high achieving suburban schools. Retrieved March 5, 2011, from
http://www.ncrel.org/gap/ferg/index.html
Ferguson, R. F., & Ladd, H. F. (1996). How and why money matters: An analysis of
Alabama schools. In H. F. Ladd (Ed.), Holding schools accountable:
Performance-based reform in education (pp. 265–298). Washington, DC:
Brookings Institution Press.
142
Flanagan, A., & Grissmer, D. (2006). Using test scores to rank performance of districts:
Findings from Illinois. Retrieved February 12, 2011, from
http://www.rand.org/pubs/working_papers/2006/RAND_WR379.pdf
Friedman, T. L. (2005). The world is flat: A brief history of the twenty-first century /
Thomas L. Friedman. New York: Farrar, Straus and Giroux.
Fuhrman, S. H. (2004). Introduction. In S. H. Furhman & R. F. Elmore (Eds.),
Redesigning accountability systems for education (pp. 3-14). New York: Teachers
College Press.
Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effectiveness: A
research synthesis. Washington, D.C.: National Comprehensive Center for
Teacher Quality. Retrieved January 17, 2011, from
http://www.tqsource.org/publications/EvaluatingTeachEffectiveness.pdf
Goldhaber, D. (2002). The mystery of good teaching: Surveying the evidence on student
achievement and teachers’ characteristics. Education Next, 2, 50–55.
Goldhaber, D. D., & Brewer, D. J. (1997). Why don't schools and teachers seem to
matter? assessing the impact of unobservables on educational productivity. The
Journal of Human Resources, 32(3), 505-523.
Goldhaber, D. D., & Brewer, D. J. (2000). Does teacher certification matter? high school
teacher certification status and student achievement. Educational Evaluation and
Policy Analysis, 22(2), 129-145.
Goldhaber, D. D., Brewer, D. J., & Anderson, D. J. (1999). A three-way error
components analysis of educational productivity. Education Economics, 7(3),
199-208.
Goldschmidt, P., & Choi, K. (2007). The practical benefits of growth models for
accountability and the limitations under NCLB. Retrieved February 24, 2011,
from http://cse.ucla.edu/products/policy/cresst_policy9.pdf
Goldschmidt, P., Roschewski, P., Choi, K., Auty, W., Hebbler, S., Blank, R., et al.
(2005). Policymakers’ guide to growth models for school accountability: How do
accountability models differ? Washington, D.C.: Council of Chief State School
Officers. Retrieved January 17, 2011, from
http://www.ccsso.org/projects/scass/projects/accountability_systems_and_reportin
g _consortium/8705.cfm
143
Gonzales, P., Williams, T., Jocelyn, L., Roey, S., Kastberg, D., & Brenwald, S. (2008).
Highlights from TIMSS 2007. Mathematics and science achievement of U.S.
fourth and eighth-grade students in an international context. Washington, DC:
IES, NCES.
Gordon, R., Kane, T. J., & Staiger, D. O. (2006). Identifying Effective Teachers Using
Performance on the Job. The Hamilton Project White Paper 2006-01.
Washington, DC: The Brookings Institution.
Grinion, P. E. (1999). Academic achievement and poverty: Closing the achievement gap
between rich and poor high school students [Electronic version]. Dissertation
Abstracts International, 60 (02A), 0386.
Guarino, C. M., Hamilton, L. S., Lockwood, J. R., & Rathbun, A. H. (2006). Teacher
qualifications, instructional practices, and reading and mathematics gains of
kindergartners. (NCES 2006-031). Washington, DC: U.S. Department of
Education, National Center for Education Statistics. Retrieved February 24, 2011,
from http://search.proquest.com/docview/62087800?accountid=14749
Hamilton, L. S. & Koretz, D. M. (2002). Tests and their use in test-based accountability
systems. In L. S. Hamilton, B. M. Stecher, & S. P. Klein (Eds.) Making sense of
test-based accountability in education (pp. 13 – 49). Santa Monica, CA:
RAND.
Hanushek, E. A. (1992). The trade-off between child quantity and quality. Journal of
Political Economy, 100(1), 84-117.
Hanushek, E. A., Kain, J. F., O’Brien, D. M., & Rivkin, S. G. (2005). The market for
teacher quality. (Working Paper No. 11154). Cambridge, MA: National Bureau of
Economic Research. Retrieved February 24, 2011, from
http://www.nber.org/papers/w11154.pdf?new_window=1
Hanushek, E. A., Raymond, M. E., & Rivkin, S. G. (2004). Does it matter how we judge
school quality?. Paper presented for the annual meetings of the American
Education Finance Association, Salt Lake City, UT.
Hanushek, E. A., & Woessmann, L. (2011). The economics of international differences in
educational achievement. In E. A. Hanushek, S. Machin, & L. Woessmann (Vol.
Ed.), Handbook of the economics of education, Vol. 3. (3
rd
Ed., pp. 89 – 200).
North Holland: Amsterdam.
Hawk, P., Coble, C. R., & Swanson, M. (1985). Certification: It does matter. Journal of
Teacher Education, 36(3), 13–15.
144
Huebert, J. P., & Hauser, R. M. (Eds.). (1999). High-stakes testing for tracking,
promotion, and graduation. Washington, DC: National Academy Press.
Hocevar, D. (2010). Can state test data be used by elementary school principals to make
teacher level and grade-level instructional decisions? Unpublished Paper,
University of Southern California.
Hocevar, D., Brown, R., & Tate, K. (2008). Leveled assessment modeling project.
Unpublished manuscript, University of Southern California.
Jackson, C. K., & Bruegmann, E. (2009). Teaching students and teaching each other: The
importance of peer learning for teachers. American Economic Journal: Applied
Economics, 1(4), 85-85.
Jacob, B. A. (2011). Do principals fire the worst teachers? Educational Evaluation and
Policy Analysis, 33(4), 403-434.
Kane, T. J., Rockoff, J. E., & Staiger, D. O. (2008). What does certification tell us about
teacher effectiveness? evidence from New York City. Economics of Education
Review, 27(6), 615-631.
Kane, T. J., & Staiger, D. O. (2002). The promise and pitfalls of using imprecise school
accountability measures. The Journal of Economic Perspectives, 16(4), 91-91.
Kane, T. J., Staiger, D. O., Grissmer, D., & Ladd, H. F. (2002). Volatility in school test
scores: Implications for test-based accountability systems. Brookings Papers on
Education Policy, 2002(5), 235-283.
Kiesling, H. J. (1984). Assignment practices and the relationship of instructional time to
the reading performance of elementary school children. Economics of Education
Review, 3(4), 341-350.
Kim, J. S., & Sunderman, G. L. (2005). Measuring academic proficiency under the no
child left behind act: Implications for educational equity. Educational Researcher,
34(8), 3-13.
Koretz, D., Linn, R. L., Dunbar, S. B., & Shepard, L. A. (1991). The effects of high stakes
testing on achievement: Preliminary findings about generalization across tests.
Paper presented at the annual meeting of the American Educational Research
Association, Chicago, IL.
Lachat, M. A. (2004). Standards-based instruction and assessment for English language
learners. Thousand Oaks, CA: Corwin Press.
145
Lance, C. E., Butts, M. M., & Michels, L. C. (2006). The sources of four commonly
reported cutoff criteria: What did they really say? Organizational Research
Methods, 9, 202–220.
Lankford, H., Loeb, S., & Wyckoff, J. (2002). Teacher sorting and the plight of urban
schools: A descriptive analysis. Educational Evaluation and Policy Analysis,
24(1), 37-62.
Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29(2), 4-16.
Linn, R. L. (2001a). The design and evaluation assessment of accountability systems.
(CSE Tech. Rep. No. 539). Los Angeles, CA: Center for the Study of Evaluation.
Retrieved January 25, 2011, from
http://www.cse.ucla.edu/products/reports/TR539.pdf
Linn, R. L. (2001b). Reporting school quality in standards-based accountability systems.
Los Angeles, CA: National Center for Research on Evaluation, Standards, and
Student Testing.
Linn, R. L. (2004). Accountability models. In S.H. Furhman & R. F. Elmore (Eds.),
Redesigning accountability systems for education (pp.73-95). New York:
Teachers College Press.
Linn, R. L. (2005). Test-based educational accountability in the era of No Child Left
Behind. (CRESST report 651). Los Angeles: University of California, Center of
Research and Evaluation, and Student Testing. Retrieved January 25, 2011, from
http://cse.ucla.edu/products/reports/r651.pdf
Linn, R. L. (2006). Validity of inferences from test-based educational accountability
systems. Journal of Personnel Evaluation in Education, 19(1), 5-15.
Linn, R. L. (2008). Educational accountability systems. In K. E. Ryan & L. A. Shepard
(Eds.), The future of test-based educational accountability (pp. 3–24). New York,
NY: Routledge.
Linn, R. L., Baker, E. L., & Betebenner, D. W. (2002). Accountability systems:
Implications of requirements of the no child left behind act of 2001. Educational
Researcher, 31(6), 3-16.
Linn, R. L., & Haug, C. (2002). Stability of school-building accountability scores and
gains. Educational Evaluation and Policy Analysis, 24(1), 29-36.
146
Lissitz, R., Doran, H., Schafer, W., & Willhoft, J. (2006). Growth modeling, value
added modeling and linking: An introduction. In R. W. Lissitz (Ed.), Longitudinal
and Value-Added Models of Student Performance (pp. 1-46). Maple Grove, MN:
JAM Press.
Lissitz, R. W., & Samuelsen, K. (2007). A suggested change in terminology and
emphasis regarding validity and education. Educational Researcher, 36(8), 437-
448.
Mahoney, J. W. (2006). How value added assessment helps improve schools. Edge: Phi
Delta Kappa International, 1 (4).
Martinez-Garcia, C., LaPrairie, K., & Slate, J. R. (2011). Accountability ratings of
elementary schools: Student demographics matter. Current Issues in Education,
14(1). Retrieved May 11, 2011, from http://cie.asu.edu/
Marzano, R. J. (2003). What works in schools: Translating research into action.
Alexandria, VA: Association for Supervision and Curriculum Development.
McCaffery, D. F., Koretz, D. M., Lockwood, J. R., & Hamilton, L. S. (2003). Evaluating
value-added models for teacher accountability. Santa Monica, CA: RAND.
McDonnell, L. M. (2004). Politics, persuasion, and educational testing. Cambridge, MA:
Harvard Univ. Press.
McKinsey & Company (2009). The Economic Impact of the Achievement Gap in
America’s Schools. Retrieved January 17, 2011, from
http://mckinseyonsociety.com/the-economic-impact-of-the-achievement-gap-in
americas-schools/
Meece, J. L. (2002). Child and adolescent development for educators. Boston: McGraw-
Hill.
Medley, D., & Mitzel, H. (1963). Measuring classroom behavior by systematic
observation. In N. L. Gage (Ed.), Handbook of research on teaching (pp. 247
328). Chicago, IL: Rand McNally.
Meyer, R. H. (1995). Educational performance indicators: A critique. Institute for
Research on Poverty. Discussion Paper no. 1052-94. Retrieved February 17,
2011, from http://www.ssc.wisc.edu/irpweb/publications/dps/pdfs/dp105294.pdf
147
Milanowski, A., Kimball, S. M., & White, B. (2004). The relationship between standards
based teacher evaluation scores and student achievement. Paper presented at the
2004 American Educational Research Association annual conference, San Diego,
CA. Retrieved January 17, 2011, from
http://www.wcer.wisc.edu/cpre/papers/3site_long_TE_SA_AERA04TE.pdf
Millman, J. (1997). Grading teachers, grading schools: Is student achievement a valid
evaluation measure?. Thousand Oaks, CA: Corwin Press.
Mullens, J. E., Leighton, M. S., Laguarda, K. G., & O'Brien, E. (1996). Student learning,
teaching quality, and professional development: Theoretical linkages, current
measurement, and recommendations for future data collection. ( No. 96-28).
Washington, DC: U.S. Dept. of Education, Office of Educational Research and
Improvement, National Center for Education Statistics: Educational Resources
Information Center.
Murnane, R. J. (1975). The impact of school resources on the learning of inner city
children. Cambridge, MA: Ballinger Pub. Co.
Murnane, R. J., & Cohen, D. K. (1986). Merit pay and the evaluation problem: Why most
merit pay plans fail and a few survive. Harvard Educational Review, 56(1), 1-17.
Murnane, R. J., & Steele, J. L. (2007). What is the problem? the challenge of providing
effective teachers for all children. The Future of Children / Center for the Future
of Children, the David and Lucile Packard Foundation, 17(1), 15-43.
National Education Association. (2006). Appendix II: Making public schools great – An
initial look at approved and rejected states’ models in the U.S. Department of
Education’s “Growth Model Pilot Project”. Retrieved October 6, 2006, from
http://www.nea.org/lac/esea/images/policy.pdf#search=%22making%20public%2
0school s%20great%22
No Child Left Behind (NCLB) Act of 2001, Pub. L. No. 107-110, § 115, Stat. 1425
(2002).
Novak, J., & Fuller, B. (2003). Penalizing diverse schools? Similar test scores, but
different students, bring federal sanction. Berkeley, CA: Policy Analysis for
California Education.
Nye, B., Hedges, L., & Konstantopoulos, S. (2004). How large are teacher effects?.
Educational Evaluation and Policy Analysis, 26(3), 237-257.
148
Oakes, J. (1989). What educational indicators? The case for assessing the school context.
Educational Evaluation and Policy Analysis, 11(2), 181-199.
Oakes, J. (2003) Education inadequacy, inequality, and failed state policy: A synthesis of
expert reports prepared for Williams v. State of California. Retrieved September
12, 2011 from www.mofo.com/decentschools/expert_reports/oakes_report.pdf
Odden, A., & Kelley, C. (2002). Paying Teachers for What They Know and Do: New and
Smarter Compensation Strategies to Improve Schools (2nd ed). Thousand Oaks,
CA: Corwin Press.
Olson, L., & Hoff, J. D. (2005). U. S. to pilot new gauge of “growth”. Education Week,
25(13), 1-16. Retrieved March 3, 2011, from
http://www.edweek.org/ew/articles/2005/11/30/13growth.h25.html
One Hundred Third Congress of the United States. (1999). Goals 2000: Educate America
Act. Retrieved January 5, 2011, from http://www.ed.gov/legislation/GOALS2000/
TheAct/index.html
Park, G. (2007). California under the microscope: An evaluation of the practical effect of
different approaches to accountability. (Doctoral dissertation, University of
Southern California, 2007). Dissertation Abstracts International, 68(04), 124.
Pedhauzur, E. J. & Schmelkin, L. P. (1991). Measurement, design, and analysis: An
integrated approach. Hillsdale, NJ: Erlbaum.
Pennsylvania Department of Education (n.d). Supplemental documentation for 1999:
Reading, mathematics, and writing assessment report. Retrieved January 12,
2011, from http://www.wde.psu/edu/pssa/99supl.pdf
Perry, L. B., & McConney, A. (2010). School socioeconomic composition and student
outcomes in Australia: Implications for educational policy. Australian Journal of
Education, 54(1), 72-85.
Phillips, K. J. (2010). What does highly qualified mean for student achievement?
Evaluating the relationship between teacher quality indicators and at-risk
students’ mathematics and reading achievement gains in first grade. The
Elementary School Journal, 110(4), 464-493.
Piata, R. (2005). Spotlight: Classroom observation, professional development, and
teacher quality. The Evaluation Exchange. 6(4). Retrieved January 12, 2011,
from http://www.gse.harvard.edu/hfrp/eval/issue32/spotlight3.html
Pintrich, P. R., & Schunk, D. H. (2002). Motivation in education: Theory, research, and
applications (2nd ed.). Upper Saddle River, NJ: Merrill Prentice Hall.
149
Porter, A. (1988). Indicators: Objective data or political tool? Phi Delta Kappan, 69(7),
503-508.
President Barack Obama – The White House. (2009). Race to the Top Fact Sheet.
Retrieved March 16, 2011, from http://www.whitehouse.gov/the-press-office/fact-
sheetrace-top
Public Policy Institute of California. (2005). The progress of English learners in
California schools. Retrieved January 23, 2011, from http://www.ppic.org
Public Schools Accountability Act of 1999, Cal SBX1, Education Code Part 28, Chapter
6.1 Sections 52050-52056 (1999). Retrieved March 5, 2011, from
http://www.leginfo.ca.gov/pub/9900/bill/sen/sb_00010050/sbx1_1_bill_1999040
_chaptered.html
Raudenbush, S.W. (2004). School, statistics, and poverty: Can we measure school
improvement? Princeton, NJ: Educational Testing Service, Policy Evaluation and
Research Center.
Ravitch, D. (2002). A brief history of testing and accountability. In W. M. Evers, & H.
Walberg (Eds.), School accountability (pp. 9-21). Stanford: Hoover Institution
Press.
Riddle, W. C. (2004). Adequate yearly progress (AYP): Implementation of the no child
left behind act. Washington, DC: Congressional Research Service. Retrieved
October 6, 2011, from http://www.opencrs.com/rpts/RL32495_20051026.pdf
Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools, and academic
achievement. Econometrica, 73(2), 417-458.
Rothstein, J. (2010). Teacher quality in educational production: Tracking, decay, and
student achievement. The Quarterly Journal of Economics, 125(1), 175-214.
Rouse, C. E. (2005). Accounting for schools: Economic issues in measuring school
quality. In C. A. Dwyer (Ed.), Measurement and research in the accountability
era (pp. 275-298). Mahwah, NJ: Lawrence Erlbaum Associates.
Rowan, B., Correnti, R., & Miller, R. (2002). What large-scale, survey research tells us
about teacher effects on student achievement: Insights from the prospects study of
elementary schools. Teachers College Record, 104(8), 1525-1567.
150
Sanders, W. L. (2000). Value-added assessment from student achievement data:
Opportunities and hurdles. Journal of Personnel Evaluation in Education, 14(4),
329-339.
Sanders, W. L. (2004). A summary of conclusions drawn from longitudinal analyses of
student achievement data over the past 22 years (1982-2004). Presentation to
Governors Education Symposium, Asheville, NC.
Sanders, W. L., & Horn, S. P. (1994). The Tennessee value-added assessment system
(TVAAS): Mixed-model methodology in educational assessment. Journal of
Personnel Evaluation in Education, 8(3), 299-311.
Sanders, W. L., & Horn, S. P. (1998). Research findings from the Tennessee value-added
assessment system (TVAAS) database: Implications for educational evaluation
and research. Journal of Personnel Evaluation in Education, 12(3), 247-256.
Sanders, W. L., & Rivers, J. C. (1996). Cumulative and residual effects of teachers on
future student academic achievement. Retrieved February 24, 2011, from
http://beteronderwijsnederland.net/files/cumulative%20and%20residual%20effec
s%20of%20teachers.pdf
Sanders, W. L., Wright, S. P., & Horn, S. P. (1997). Teacher and classroom context
effects on student achievement: Implications for teacher evaluation. Journal of
Personnel Evaluation in Education, 11(1), 57-67.
Sanders, W., Wright, S. P., & Rivers, J. C. (2006) Measurement of academic growth of
individual students toward variable and meaningful academic standards. In R. W.
Lissitz (Ed.), Longitudinal and value added modeling of student performance.
Maple Grove, MN: JAM
Schunk, D. H. (2000). Learning theories: An educational perspective. Upper Saddle
River, NJ: Merrill Prentice Hall.
Shepard, L., Kupermintz, H., & Linn, R. (2000). Cautions regarding the Sanders value
added assessment system. Response panel comments presented at the annual
conference of the Colorado Staff Development Council, Denver, CO.
Sirotnik, K. A. (2004). Introduction: Critical concerns about accountability concepts and
practices. In K. A. Sirontnik (Ed.) Holding accountability accountable: What
ought to matter in public education. (pp. 1-17). New York, NY: Teachers
College, Columbia University.
Swanson, M., Hawk, P. P., & Coble, C. R. (1985). Certification: It does matter. Journal
of Teacher Education, 36(3), 13-15.
151
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics. Boston:
Pearson/Allyn & Bacon.
Teacher Effectiveness Task Force: Los Angeles Unified School District Final Report.
(2010). Retrieved March 21, 2011, from
http://www.dkfoundation.org/PDF/LAUSD-TeacherEffectiveness-Task-Force
Report-FINAL.pdf
Teddlie, C., Reynolds, D., & Sammons, P. (2000). The methodology and scientific
properties of school effectiveness research. In C. Teddlie & D. Reynolds (Eds.),
The international handbook of school effectiveness research. (pp. 55-133).
London: Falmer Press.
The National Commission on Excellence in Education. (1983). A nation at risk.
Retrieved January 11, 2011, from http://www.ed.gov/pubs/NatAtRisk/risk.html
Thum, Y. M. (2003). No Child Left Behind: Methodological challenges &
recommendations for measuring adequate yearly progress. CSE Technical
Report. Los Angeles, CA: Center for the Research on Evaluation, Standards, and
Student Testing, University of California, Los Angeles.
Toch, T., & Rothman, R. (2008). Rush to judgment: Teacher evaluation in public
education. Education Sector Reports. Retrieved January 11, 2011, from
http://www.educationsector.org/research/research_show.htm?doc_id=656300
Tyler, J. H., Taylor, E. S., Kane, T. J., & Wooten, A. L. (2010). Using student
performance data to identify effective classroom practices. The American
Economic Review, 100(2), 256-256.
United States Department of Education. (2002). No Child Left Behind Legislation and
Policies. Retrieved December 19, 2010, from
http://www.ed.gov/policy/elsec/leg/esea02/index.html
United States Department of Education. (2009). Guidance on the state fiscal stabilization
fund program. Retrieved March 8, 2011, from
http://www.ed.gov/programs/statestabilization/guidance.pdf
United States Congress. (2001). No Child Left Behind Act of 2001. Retrieved January 10,
2011, from http://www.ed.gov/policy/elsec/leg/esea02/index.html
United States Department of Education. (2010) Race to the Top Fund. Retrieved March
1, 2011, from http://www2.ed.gov/programs/racetothetop/index.html
152
Wayne, A. J., & Youngs, P. (2003). Teacher characteristics and student achievement
gains. Review of Educational Research, 73(1), 89-122.
Webster, W., Mendro, R., Orsak, T., & Weerasinghe, D. (1998). An application of
hierarchical linear modeling to the estimation of school and teacher effect. Paper
presented at the annual meeting of the American Educational Research
Association, San Diego, CA.
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: Our
national failure to acknowledge and act on differences in teacher effectiveness.
New York: The New Teacher Project. Retrieved March 11, 2011, from
http://www.widgeteffect.org
Wilson, B., & Wood, J. A. (1996). Teacher evaluation: A national dilemma. Journal of
Personnel Evaluation in Education, 10(1), 75-82.
Woody, E., Buttles, M., Kafka, J., Park, S., & Russell, J. (2004). Voices from the field:
Educators respond to accountability. Berkeley, CA: Policy Analysis for
California Education.
Xu, D. (2000). The relationship of school spending and student academic achievement
when achievement is measured by value-added scores. Unpublished doctoral
dissertation, Vanderbilt University, Nashville.
Abstract (if available)
Abstract
Current models of evaluation and accountability utilize varying unadjusted measures of student achievement to reward or sanction schools. These unadjusted accountability indicators do not account for differences in student or school characteristics that contribute to variations in assessment results. Since the Coleman Report (1966), a guiding principle in accountability design has been that educational outcomes data should be used only after the effects of institutional characteristics have been statistically removed. Such indices are called adjusted indicators, where an adjustment is either statistical or through aggregation. ❧ The purpose of this study is to analyze the reliability (internal consistency and test-retest) and validity (discriminant, convergent, and concurrent) of six available accountability indicator systems: (a) API improvement scores, (b) APISED scores (SED = socioeconomically disadvantaged), (c) similar schools scores, (d) LA Times value-added scores, (e) academic growth over time (AGT) value-added scores (ELA and math), and (f) adjusted normal curve equivalent (ANCE) scores (ELA and math). Each system has been proposed as adjuncts to the currently operationalized school status (average achievement) scores. ❧ The population included all K-5 elementary schools in LAUSD
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Use of accountability indicators to evaluate elementary school principal performance
PDF
Evaluating the efficacy of the High Point curriculum in the Coastline Unified School District using CST, CAHSEE, and CELDT data
PDF
Examining principal perceptions, and teacher and school effectiveness through a value-added accountability model
PDF
Assessing and addressing random and systematic measurement error in performance indicators of institutional effectiveness in the community college
PDF
An application of value added models: School and teacher evaluation in Chinese middle schools
PDF
Ready, set, algebra?
PDF
Identifying early academic and behavioral indicators predictive of future referral to alternative education in a large K-12 urban school district
PDF
Input-adjusted transfer scores as an accountability model for California community colleges
PDF
Academic achievement among Hmong students in California: a quantitative and comparative analysis
PDF
California under the microscope: an evaluation of the practical effect of different approaches to accountability
PDF
Accountability models in remedial community college mathematics education
PDF
The effect of reading self-efficacy, expectancy-value, and metacognitive self-regulation on the achievement and persistence of community college students enrolled in basic skills reading courses
PDF
A quantitative study on southeast Asian and Latino student's perceptions of teachers' expectations and self-efficacy
PDF
Examining opportunity-to-learn and success in high school mathematics performance in California under NCLB
PDF
Quantifying student growth: analysis of the validity of applying growth modeling to the California Standards Test
PDF
That's not what I asked for: three essays on the (un)intended consequences of California's dual-accountability system
PDF
A case study of student engagement in a high performing urban continuation high school
PDF
Evaluating the effects of diversity courses and student diversity experiences on undergraduate students' democratic values at a private urban research institution
PDF
A longitudinal study on the performance of English language learners in English language arts in California from 2003-2012
PDF
An evaluation of the School Assistance and Intervention Team process in California public schools: lessons learned and indications for policy change
Asset Metadata
Creator
Black, Aime
(author)
Core Title
A comparison of value-added, orginary least squares regression, and the California Star accountability indicators
School
Rossier School of Education
Degree
Doctor of Education
Degree Program
Education (Leadership)
Publication Date
04/25/2012
Defense Date
01/30/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
accountability,accountability indicators,evaluation,OAI-PMH Harvest,ordinary least squares regression,student assessment scores,value-added
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Hocevar, Dennis (
committee chair
), Polikoff, Morgan S. (
committee member
), Seder, Richard (
committee member
)
Creator Email
aime.black220@gmail.com,aimeblac@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-10871
Unique identifier
UC11289343
Identifier
usctheses-c3-10871 (legacy record id)
Legacy Identifier
etd-BlackAime-639.pdf
Dmrecord
10871
Document Type
Dissertation
Rights
Black, Aime
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
accountability
accountability indicators
evaluation
ordinary least squares regression
student assessment scores
value-added