Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A meta-analysis of formative assessment on academic achievement among Black and Hispanic students
(USC Thesis Other)
A meta-analysis of formative assessment on academic achievement among Black and Hispanic students
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A Meta-Analysis of Formative Assessment on Academic Achievement Among Black and
Hispanic Students
Alex Gilewski
Rossier School of Education
University of Southern California
A dissertation submitted to the faculty
in partial fulfillment of the requirements for the degree of
Doctor of Education
May 2024
© Copyright by Alex Gilewski 2024
All Rights Reserved
The Committee for Alex Gilewski certifies the approval of this Dissertation
Patricia Tobey
Adam Kho, Committee Co-chair
Erika Patall, Committee Co-chair
Rossier School of Education
University of Southern California
2024
iv
Abstract
Formative assessment encompasses a variety of educational practices geared towards improving
both learning and instruction. Formative assessment has been explored through John Hattie’s
Visible Learning project, which found a moderately positive effect (g = 0.4) on achievement.
Most educational research on the topic, however, is conducted on populations that are
disproportionately White. A meta-analysis was conducted on formative assessment interventions
specifically among studies originally included within Hattie’s analysis with a sample which was
at minimum cumulatively 40% Black and Hispanic and conducted within the United States. Four
hundred and three reports were screened for eligibility, and only 20 met the population/setting
criteria, eight of which did not include sufficient information to calculate an effect size. A metaanalysis of 12 reports containing 22 effects revealed a statistically significant average effect on
achievement (g = 0.332). The results also indicated a very large degree of heterogeneity, which
was expected given the broad nature of formative assessment. Publication status, percent Black,
percent Hispanic, school level, subject domain, type of formative assessment intervention, and
intervention duration were evaluated as potential moderators. However, no factor significantly
moderated the effect. Overall, the meta-analysis highlights the dearth of studies conducted with
diverse populations on the topic of formative assessment and provides a glimpse into how the
practice of formative assessment can influence achievement among populations with significant
amounts of students of color.
Keywords: meta-analysis, formative assessment, Visible Learning, racially diverse
v
Acknowledgements
I have no known conflicts of interest to disclose. I wish to extend gratitude to Erika
Patall, Adam Kho, and Patricia Tobey for their guidance in developing this dissertation.
Additionally, I thank Amanda Vite for reviewing the coding, as well as Ani Aharonian and
Katarina Garcia for assistance and support.
vi
Table of Contents
Abstract.......................................................................................................................................... iv
Acknowledgements......................................................................................................................... v
List of Tables ...............................................................................................................................viii
List of Figures................................................................................................................................ ix
Review of the Prior Literature ............................................................................................ 3
Defining Formative Assessment............................................................................. 3
Visible Learning...................................................................................................... 5
Theoretical Foundations for Formative Assessment............................................... 6
Review of Empirical Research on Formative Assessment ................................... 10
Factors Contributing to Variation in Intervention Effects.................................... 18
The Present Synthesis........................................................................................... 23
Methods............................................................................................................................. 26
Literature Search................................................................................................... 26
Inclusion Criteria .................................................................................................. 27
Data Extraction ..................................................................................................... 28
Computing Effect Sizes and Data Analysis.......................................................... 30
Results............................................................................................................................... 31
Overall Average Effect ......................................................................................... 32
Publication Bias.................................................................................................... 33
Moderator Analyses.............................................................................................. 33
Discussion......................................................................................................................... 36
Exploring Heterogeneity....................................................................................... 39
Limitations and Implications for Practice............................................................. 40
Conclusions........................................................................................................... 42
vii
References..................................................................................................................................... 43
Tables............................................................................................................................................ 54
Figures........................................................................................................................................... 56
Appendix A: Studies Included in Meta-Analysis ......................................................................... 57
Appendix B: Coding Guide .......................................................................................................... 59
viii
List of Tables
Table 1: Overall Average Effect of Formative Assessment Interventions on Achievement…….54
Table 2: Results of Moderator Analyses…………………………………………………………55
Appendix A: Studies Included in Meta-Analysis………………………………………………..57
Appendix B: Coding Guide………………………………………………...……………………59
ix
List of Figures
Figure 1: PRISMA Chart………………………………………………………………………...56
1
1
A Meta-Analysis of Formative Assessment on Academic Achievement Among Black and
Hispanic Students
Formative assessment, a widely used approach within primary, secondary, and postsecondary education (Bennett, 2011), has been described as a crucial component of education
worldwide (Schildkamp et al., 2020). It has even been referred to as a “pillar of educational
significance” (Van der Kleij et al., 2018, p. 620). At its core, formative evaluation provides
feedback to both teachers and students on the progress of learning. Numerous empirical articles,
reviews (see Bennett, 2011; Black & Wiliam, 1998; Schildkamp et al., 2020) and meta-analyses
(see Briggs et al., 2012; Chen & Li, 2021; Fuchs & Fuchs, 1986; Graham et al., 2015; Kingston
& Nash, 2011; Lee et al., 2020; McMillan et al., 2013; Sanchez et al., 2017; Xuan et al., 2022)
have explored its benefits.
Related approaches that have gained attention in the context of formative evaluation are
peer assessment, self-assessment, and clickers. Peer assessment involves students providing
feedback and evaluating each other's work. It serves as an additional layer of feedback and
contributes to the development of critical thinking and self-regulatory skills, while reaping
additional benefits provided by social learning. The benefits include promoting deeper
understanding, fostering collaboration, and enhancing students’ metacognitive abilities (Van
Zundert et al., 2010). Meta-analyses have shown that peer assessment, along with formative
evaluation, can have a positive impact on student achievement (Double et al., 2020; Sanchez et
al., 2017). Self-assessment is also used in formative work. This encourages students to reflect
upon their own understanding, further developing metacognitive skills and self-regulation skills,
similarly as to when work is assessed by peers. As with peer assessment, self-assessment has
2
2
also been shown to improve academic achievement (Gibbs & Taylor, 2016; Sanchez et al.,
2017).
While peer assessment and self-assessment change the source of feedback on a formative
assessment, clickers use technology to inform instructors in real time. A clicker is a device which
students use to input a response to a multiple-choice question posed by the instructor. The
instructor is immediately able to see the responses and address misconceptions, evaluate class
performance en masse, and adjust instruction accordingly. As such, peer assessment, selfassessment, and clickers all relate directly to formative assessment.
The Visible Learning project by John Hattie recognizes formative evaluation as having
the “potential to accelerate” achievement, with a weighted mean effect size of d = 0.40. Peer
assessment, self- and peer-grading, and clickers are similarly categorized, with weighted effect
sizes of 0.41, 0.54, and 0.24, respectively (Hattie, 2008, 2023). These findings underscore the
potential of peer/self-assessment and technology as complementary strategies to formative
evaluation in enhancing student learning outcomes.
However, it is crucial to consider the limitations of generalizing these benefits to all
student populations, as many educational studies in the United States have been found to have an
over-representation of White participants (Graham, 1992; Konkel, 2015). This raises concerns
about the applicability of findings from meta-analyses to diverse populations. Notably, there
exist significant racial and ethnic disparities in academic achievement, with Black and Hispanic
students graduating from high school and college at lower rates compared to their White
counterparts (Fry et al., 2021). Therefore, it is imperative to investigate whether the benefits of
formative evaluation and peer/self-assessment extend to Black and Hispanic students. By
conducting a meta-analysis of the intervention and correlation studies used in Visible Learning
3
3
that focus on the impact of these approaches on student achievement with significant (cumulative
40+%) Black and Hispanic populations, this research synthesis aims to contribute valuable
insights to address and mitigate racial and ethnic educational inequity within the United States.
Herein, the effects of formative evaluation will be examined to evaluate whether formative
evaluation is more, less, or equally effective for populations with larger proportions of Black and
Hispanic students.
Review of the Prior Literature
Defining Formative Assessment
Formative assessment, synonymous with formative evaluation, is an integral component
of the educational process that aims to gather information and provide feedback on students'
learning progress during the instructional period. It is an ongoing, interactive, and diagnostic
approach that focuses on understanding students’ strengths, weaknesses, and areas requiring
further development. Unlike summative assessment, which typically occurs at the end of a
learning unit or course to judge overall learning, formative assessment occurs throughout the
learning process, allowing students to enhance their understanding and teachers to adjust their
instruction. The first use of the term “formative evaluation” was to distinguish an ongoing
process of evaluation of curriculum development from a discrete summative evaluation of a
curriculum (Scriven, 1966). Soon thereafter, the term was modified to classroom practice by
instructors to improve learning outcomes (Bloom et al., 1971), again in contrast to summative
evaluation. While Bloom noted the same test could be used either formatively or summatively,
the formative value would be diminished if the test were used in the grading process (Wiliam,
2006). Going further, Wiliam highlights that Bloom and Scriven focused on formative
4
4
assessments only being formative if information was used in some way to shape a student’s
learning (and/or a teacher’s instruction).
While educational researchers agree that summative assessment is inherently not
formative, there is no singular definition for formative evaluation (Bennett, 2011). In general,
literature definitions can be categorized dichotomously. One school believes formative
assessment must refer to concrete, often written, assignments constructed by the instructor for
the purpose of providing feedback. The other expands the definition to include any activity
whose result is change based on feedback, regardless of whether the activity was planned or
intended to generate an opportunity for feedback. The first definition is quite narrow, which may
limit the number of studies incorporated into this synthesis. The second definition is very broad,
even including after-the-fact activities, and therefore exceedingly difficult to operationalize.
Accordingly, for the purposes of this synthesis, formative evaluation will be defined as any
activity or assignment in which improving learning through feedback is the primary goal. This
definition is similar to “assessment for learning,” another commonly used term for formative
assessment (Black et al., 2003).
In practice, peer and self-assessments are subtypes of formative evaluation in which the
feedback is produced by someone other than the instructor. While Hattie has formative
evaluation, clickers, peer assessment, and peer and self-grading as four distinct influences on
learning, the four have been combined in this research synthesis due to their formative nature to
increase the statistical power of the meta-analysis. Ultimately, while the source of feedback (i.e.,
instructor vs. peer) was going to be examined as a moderating factor, a lack of reports made this
analysis impossible. However, the use of technology (i.e., clickers) within the intervention was
able explored as a potential moderator.
5
5
Visible Learning
Visible Learning (Hattie, 2008) was originally published in 2008 and has been widely
adopted by K–12 educators throughout the United States and the world. Indeed, Google Scholar
indicates that Visible Learning has been cited nearly 27,000 times in academic works. The work
is a meta-meta-analysis, or meta-analysis of meta-analyses. In it, over 800 meta-analyses are
categorized as influences and are synthesized to determine what is effective for teaching and
learning. The massive undertaking has been designated the “holy grail” of research on education
by the Times Educational Report in 2008, a metonym eagerly adopted by Hattie. It should be
noted that the work has been adapted into a for-profit training program offered by a company
owned by Hattie. The widespread adoption of the book notwithstanding, the work is not without
critique.
Critical responses to Visible Learning highlight multiple potential flaws with the work.
With over 800 meta-analyses containing 15,000 studies examined by one primary author,
mistakes both mathematical and methodological have become apparent (Terhart, 2011; Wecker
et al., 2017). Common criticisms center around the lack of transparency behind the quantitative
analysis conducted. For example, some individual studies used to calculate an effect of an
influence on achievement do not have an outcome measure which is direct academic
achievement, such as attendance. Other critiques focus on non-methodological issues, such as
how the book promotes neoliberalism, sexism, and ableism (McKnight & Whitburn, 2020) or
how the book centers learning in a teacher-centric authoritarian classroom (Terhart, 2011).
This dissertation seeks to examine a different flaw within the work. Primarily, the solely
quantitative nature of meta-analysis paints an incomplete picture; numbers are presented without
transparent context (Terhart, 2011). Indeed, Terhart continues by noting that only influences
6
6
which were of empirical interest at the time of (and preceding) the meta-analysis could be
included. Finally, it is important to establish that a meta-meta-analysis, such as Visible Learning,
presents results twice removed from the original data, making the actual effect of each
intervention less clear (Sundar & Agarwal, 2021). This dissertation continues Terhart’s argument
while incorporating the cautions of meta-meta-analysis noted by Sundar and Agarwal by
critically examining the populations of the individual studies within Visible Learning.
Theoretical Foundations for Formative Assessment
The benefits of formative assessment center around feedback. The purpose of feedback is
to both indicate and mollify discrepancies between a student’s current understanding of a
concept and a desired level of understanding (Hattie & Timperley, 2007), while simultaneously
informing instruction. A multitude of models have been developed to explain how feedback
influences learning (Lipnevich & Panadero, 2021). These include information-processing
(Kulhavy & Stock, 1989) and self-regulation (D. L. Butler & Winne, 1995; Nicol & Macfarlane‐
Dick, 2006), among others.
Information-processing theory (IPT) is a collection of models of how memory functions.
A widely used model is the multi-store model developed in the 1960s (Atkinson & Shiffrin,
1968). It explains how stimuli are shifted between sensory memory, short-term (or working)
memory, and long-term memory. Of relevance to feedback and formative evaluation is the
notion that rehearsal of information retrieved from long-term memory into the short-term store
can help strengthen the duration and accuracy of the information when it is transferred back to
long-term memory. Formative assessment allows students the opportunity to draw information
from their long-term stores into working memory for the purpose of rehearsal. The information is
7
7
then transferred back to long-term memory. This process is iterative, and each cycle can bolster
the strength of the memory (Kulhavy & Stock, 1989).
Formative assessment additionally allows for misconceptions to be corrected before they
are solidified in long-term stores. A meta-analysis exploring the timing of feedback found that
for formative work, immediate feedback was more beneficial than delayed feedback for learning
(Kulik & Kulik, 1988). While learning is complex, these findings suggest that addressing
misconceptions early may play a role in helping students to learn. Additional theoretical
frameworks stress the importance of scaffolding new knowledge on prior knowledge to support
long-term retention (e.g., Vygotskiĭ, 1986). Formative assessment allows instructors to identify
and reinforce gaps in a student’s conceptual scaffold, thereby strengthening the scaffold and
leading to better understanding.
Formative assessment allows students to develop self-regulatory skills, which can lead to
higher motivation and achievement. The process of self-regulation includes goal setting, selfobservation, self-evaluation, judgment, and self-rewarding, all oriented toward achieving goals
(Schunk & Zimmerman, 2013; Vohs & Baumeister, 2016). A prominent model for selfregulation and feedback explores how internal and external feedback mechanisms work together
to facilitate learning (Butler & Winne, 1995). In their original 1995 model, Butler and Winne
describe internal feedback as several paths that inform knowledge and belief domains, goals, and
strategies to meet those goals. Internal feedback can be used to adjust or set new goals as
students self-evaluate their progress. Alternatively, strategies for attaining goals can be
reassessed, or new strategies can be adopted. External feedback contributes by supporting the
regulatory decisions students make, either confirming or supplementing the student’s
understanding of their pathway and progress. External feedback can also contradict the student’s
8
8
self-assessment, leading the student to reexamine their goals and strategies, in turn leading to
higher achievement.
The Butler-Winne model was expanded by Nicol and Macfarlane-Dick in 2006 by
making a distinction between internal learning outcomes and externally observable outcomes,
both of which can both be influenced by internal feedback. Teachers, peers, and other sources of
external feedback can only act on externally produced responses, such as those generated
through formative assessment. Formative work, therefore, can provide students with
opportunities to refine their internal self-assessment skills while simultaneously allowing for
ample external feedback.
The Nicol and Macfarlane-Dick model additionally highlights the fact that instructors
providing quality feedback to students also provides important information to the instructor
regarding their own instructional practices. This is an inherently useful aspect of formative
assessment, upon which an instructor can reflect and regulate their own instructive behaviors.
While the purpose of this synthesis is to explore the effects of formative evaluation on students
as it relates to their direct academic achievement, it is important to highlight the utility of
formative assessment to inform teaching practice (see Black & Wiliam, 1998, 2009). Of course,
improved instruction can also lead to improved academic achievement, although indicators of
improvement may lag in comparison to the benefits students gain directly from utilizing
feedback from formative assessment.
Quantitative Critical Race Theory
Formative assessment typically enhances educational achievement. However, this
assertion is based on quantitative results reported in peer-reviewed studies. Moreover, these
studies are often conducted with convenience samples of disproportionately White representation
9
9
(Graham, 1992). The aim of this synthesis is to explore whether the benefits of formative
assessment extend to more diverse populations. Quantitative critical race theory (QuantCrit) (see
Garcia et al., 2018; Gillborn et al., 2018) provides a basis for why this analysis should be
conducted.
QuantCrit initially emerged as a response to the limitations of traditional quantitative
research methods in addressing issues of race and racism. Quantitative methods allow for rapid
collection and analysis of large amounts of data to make generalizations but lack nuance and
make assumptions which may hide inequities if not critically examined. QuantCrit highlights
how choices made in data collection, analysis, and reporting inherently introduce bias into
research, and that findings therein are not necessarily objective and/or neutral.
QuantCrit further emphasizes the importance of understanding intersectionality: race,
gender, class, and other social constructs are invariably intertwined. The pervasive nature of
racism within educational institutions already tips the scales toward the privileged; a student’s
social identity will have been instrumental in the course of events leading them to even have the
opportunity to participate in a study. The eligible pool of participants for a study at a university is
therefore already overly populated by the privileged. And so, does it make sense to generalize
the findings of a study based upon participants who do not accurately reflect the diversity of
humanity? QuantCrit provides a clear answer: no. While this synthesis is not able to fully
overcome the inequities perpetuated through quantitative methodologies, it represents a step in
the direction towards empowering research conducted to challenge extant power dynamics and
support marginalized communities. Meta-analysis as a methodology uses quantitative research as
its primary data source and the assumptions therein, observed through the lens of QuantCrit, will
be explored within the limitations discussed at the end of this synthesis.
10
10
Review of Empirical Research on Formative Assessment
Formative assessment is widely implemented and studied; however, there are relatively
few empirical studies directly linking formative assessment with academic achievement (Black
& Wiliam, 1998; Carrillo-de-la-Peña et al., 2009; Nendaz & Tekian, 1999). Indeed, a relatively
recent United States Department of Education meta-analysis of formative assessment in
elementary education found a total of 2,622 studies relating formative assessment and
achievement published between 1988 and 2014; of these, only 23 were determined to be rigorous
enough to study a causal link between formative assessment and academic achievement (Klute et
al., 2017). Another meta-analysis examining the effects of formative assessment on achievement
in K–12 education screened 3,730 studies and eliminated all but 33 (Lee et al., 2020). As such, it
is clear that there is a relative dearth of empirical causal research on the effects of formative
evaluation. This conclusion is echoed by reviews (Bennett, 2011; Black & Wiliam, 1998;
McMillan et al., 2013). Of particular note is the lack of published studies examining the
relationship between race and the effects of formative evaluation on academic achievement.
However, while the vast majority of studies may not be sufficient for the purposes of metaanalysis, they still contribute to the ability of formative assessment to affect academic
achievement, often citing relationships based on the theoretical underpinnings discussed
previously. Here, general trends and discrepancies within empirical formative assessment studies
are highlighted.
Formative Assessment in General
Overall, formative assessment has a positive effect on academic outcomes. Meta-analyses
report overall weighted average effect sizes of 0.26 (Klute et al., 2017), 0.29 (Lee et al., 2020),
and 0.20 (Kingston & Nash, 2011). However, individual effects within studies may be much
11
11
more varied; for example, Klute and colleagues (2017) found a range of –0.46 to 1.22 among the
studies they reviewed. The effect size of –0.46 was found in a multivariate study exploring the
effects of goal setting and self-instruction, two components of self-regulation theory, within the
context of formative assessment among primarily White special education elementary students
(Johnson et al., 1997). Students were randomly assigned to one of four groups: reading strategy
instruction only, strategy instruction and goal-setting instruction, strategy instruction and selfinstruction, and strategy instruction plus goal-setting instruction and self-instruction. In this
study, during the implementation of a reading strategy, instructors would add on verbal or
written questions (i.e., formative assessment), having students indicate their goals and progress,
and/or have them evaluate their self-instruction, all while providing direct immediate feedback to
the student during the process. While the reading strategy was overall effective in supporting the
students’ reading comprehension skills, the students receiving the goal-setting intervention
performed worse on a retelling exercise than students receiving the strategy intervention alone.
The authors stress that the study did not compare the presence of self-regulatory skills with the
absence thereof, only the effects of their instruction in these skills; they suggest their reading
comprehension strategy instruction may have included implicit self-regulatory information. As a
secondary note, the authors suggest that their strategy intervention may have made the task so
familiar to students as to render self-regulatory skills, such as goal-setting unnecessary.
Interestingly, the highest effect size intervention in the Klute (2017) meta-analysis also
examined a special education population; most studies focus on general education classes. In the
study, students (primarily White) were assigned randomly to a control group or experimental
group, where both groups received a curriculum-based measurement treatment (Fuchs et al.,
1989). However, the experimental group used formative assessment data to design instructional
12
12
programs within the semester-long design, while the control group collected formative data but
did not utilize it. On a standardized reading exam, the experimental students far outperformed the
control group students. The disparity between these two studies with similar populations
highlights a key difficulty when researching formative assessment. In the Johnson study, the
treatment was intended to develop self-regulatory skills, while in the Fuchs study, the treatment
was intended to inform instruction. These are both aspects of formative assessment which are
thought to be beneficial to students. As such, how formative data are used is a likely potential
moderator for the effects of formative assessment on achievement. The plot thickens. In a
previous meta-analysis, this moderator has been dubbed “treatment type” and has been found to
generate different statistically significant effect sizes depending on whether formative
assessment treatment types were predominantly in the form of professional development,
curriculum-embedded assessment, computer-based feedback systems, specific use of student
feedback, or assessment conversations, class activities, and student reflection (Kingston & Nash,
2011). Kingston and Nash found that curriculum-embedded assessment such as in the Fuchs et
al. (1989) study (calculated g = 1.22) yielded an average effect size of –0.05 across seven
studies, while self-regulatory treatments, such as the goal-setting intervention implemented by
Johnson and colleagues in 1997 (calculated g = –0.46), produced a mean effect of 0.10 over three
studies. As such, it appears the specifics of each study may render it difficult to pinpoint a
precise effect size for the broad domain of formative assessment. In keeping with evaluating the
effects of formative assessment interventions on populations with relatively large Black and
Hispanic representation, all the studies compared and contrasted hereafter fit the criterion of
having a study population of at minimum 40% Black and Hispanic students. The next section
13
13
serves to elucidate how formative assessment intervention studies are conducted and how they
vary.
Examples of Formative Assessment Studies
Most studies differ by the type of formative assessment intervention which is
implemented. Nonetheless, it is helpful to illustrate how studies examine formative assessment
interventions with an example. Following the previous Kingston classification, some formative
assessment interventions take the form of curriculum-embedded assessments. Curriculumembedded assessments refer to pre-generated materials which include formative assessments to
be used during the process of learning and allow for consistency in implementation across
instructors and institutions. They are often generated by third-party companies, rather than by
instructors themselves. In these studies, the typical experimental or quasi-experimental design is
manifested in treatment groups being composed of teachers who have attended some amount of
professional development centered on implementing (often company-developed) formal
formative assessment within the classroom. The assessment curriculum itself may be of short
duration (a few lessons) or year-long.
Typical achievement measures are standardized examinations, especially if a pre-/posttest design is utilized. For example, in a randomly assigned study, a mathematics formative
assessment strategy called “Powersource” was implemented in 27 middle schools in multiple
states in the southwestern United States, wherein teachers either received professional
development on how to implement Powersource (treatment) or an unrelated professional
development of identical duration on evaluating technical quality of districtwide exams (Phelan
et al., 2011). To measure the impact of this formative assessment intervention, students
completed a test of prerequisite knowledge at the beginning of the year and completed a state
14
14
math assessment at the end of the year. Given the temporal distance between these relatively
short interventions and the summative assessment at the end of the year, any change would be
construed as long-term retention. While pre-/post-intervention gains did not differ between the
groups overall, the treatment was effective in helping students to remember the distributive
property in arithmetic. This type of intervention is similar to “just-in-time” teaching, in which
feedback is provided immediately back to students on formative questions and instructors
immediately adjust their instruction and has been linked with beneficial educational outcomes
(Liberatore et al., 2017; Novak, 2011).
Accordingly, to explore this discrepancy, one can evaluate the effects of a longer-term
formative intervention. One such intervention is the Arts Achieve project through New York
City’s Department of Education. This project had music instructors in treatment schools attend
ongoing professional development on how to implement formative assessment in music
education, along with biannual collaborative discussions between instructors (Valle, 2015).
Control schools did not receive any specialized professional development concerning formative
assessment. Schools were matched by demographics and blocked by school level. Summative
benchmark assessments were given each year. When students between groups were individually
matched by propensity (same probability to receive the treatment) to mirror randomizing by
student as opposed to by school, students receiving the criteria-referenced formative assessments
outperformed control students in the areas of content knowledge and music listening skill on the
citywide benchmark examinations, but not in actual musical performance skill.
A study of the Work-Sampling System, a curriculum-embedded performance assessment,
found that low-income urban third and fourth grade students who learned in classrooms utilizing
the system on average significantly outperformed both a demographically matched comparison
15
15
group as well as the rest of students in the district on a standardized exam when compared to
their performance the prior year (Meisels, 2003). The finding was strongest in reading, but still
significant for mathematics. Notably, this effect was strongest for lower-performing students,
especially in math. The assessment system involved careful documentation of student work and
interactions in a portfolio to monitor progress and includes both students and parents in the
learning process. Effectively, it allows for perceptual formative data to be recorded and used
when evaluating achievement and informing instruction. Finally, in this study, classrooms used
for establishing the treatment group had instructors who had been using the system for at
minimum three years, indicating extensive familiarity with it.
Indeed, the familiarity with implementing curriculum-embedded formative assessments
may play a role in how effective the intervention is. In a broad, randomized study over seven
states on middle and elementary students, use of a continuous progress monitoring system to
inform instruction improved performance in math on a standardized exam in a pre-/post design,
but only when implemented correctly (Ysseldyke & Bolt, 2007).
Of course, not all curriculum-embedded formative assessments have been successful in
improving achievement. For example, a 2011 United States Department of Education report on
the effectiveness of Classroom Assessment for Student Learning (CASL), a widely used
curriculum-embedded formative assessment program coupled with associated professional
development, reported no difference on standardized math exam scores between schools
randomly assigned to use the program and those not using it (Randel et al., 2011). The report
included data from across the central United States. Interestingly, the data from the study were
analyzed an additional time with additional survey data and again found nonsignificant
differences on student achievement between the treatment and control groups (Randel et al.,
16
16
2016). However, the program was found to enhance teachers’ knowledge of assessment and the
frequency of student involvement in classroom assessment. The findings from the Randel studies
are echoed in other studies; in one study examining the effects of embedding formative
assessments into a problem-based learning curriculum in middle school mathematics classes
found no significant difference on problem-solving performance (M. D. Butler, 2014). These
studies again highlight the difficulty in establishing the effects of formative assessment; even
among similar grade levels and focusing on mathematics, studies show an array of findings.
However, it is notable that these findings have either been positive or neutral among the most
common types of formative assessment interventions.
While they are less frequently studied, technology-mediated formative assessment and
self-regulatory interventions have been demonstrated to enhance student learning. Computermediated technologies refer to those in which a computer is used to rapidly summarize formative
data, such as measuring frequency of responses to a multiple-choice question asked in a
classroom in real time. In a pre/post quasi-experimental design, a student response system (e.g.,
clickers) was used in high school classrooms to generate immediate feedback to students on inclass formative math questions (Manuel, 2016). Following the semester-long intervention,
students who received immediate feedback on their formative work through the computermediated system significantly outperformed the control group on a researcher-generated end-ofsemester summative examination; the two groups performed similarly on the pre-intervention
exam. This finding is similar to other implementations of student response systems, which tend
to either be neutral or positive on academic achievement but appear to universally increase
student engagement (see Heaslip et al., 2014; Preszler et al., 2007; Trees & Jackson, 2007).
17
17
One of the least-studied intervention types are interventions directly tied to selfregulatory skills. Many studies appear to give credence to self-regulation when attributing effects
of formative assessment on achievement; however, few study it directly. Dale Schunk explored
the effects of goal setting and self-evaluation in formative work among fourth grade students in
math classes (Schunk, 1996). Students were given formative assessments designed to either
promote a mastery orientation or a performance orientation. In the first of two studies in the
report, Schunk had half of each the mastery and performance groups complete self-evaluation
tasks. He found that the mastery group, both with and without self-evaluation, along with the
self-evaluation performance group all demonstrated higher skill in completing fraction-based
exercises. In the second study, he had all students complete self-evaluation tasks, and compared
the mastery and performance groups again. He found the mastery condition led to higher
achievement outcomes. As such, his findings demonstrate support for direct self-regulatory
interventions within the context of formative assessment. However, as noted previously, this
finding is not universal (e.g., as in Johnson et al., 1997). To recapitulate, Johnson argued that the
self-regulatory skills developed in their intervention were made redundant by repeated exposure
to the content.
In summary, formative assessment tends to have either positive or neutral effects on
academic achievement for studies with populations with significant Black and Hispanic
populations. Most interventions consist either of professional development, assessments
embedded into the curriculum, or some combination of the two. Other interventions have
explored the use of technology or self-regulatory skills. Findings from studies exploring
formative assessment are not cut and dry; many factors may moderate the effect of the
intervention on achievement. Such factors include the duration of the implementation, the
18
18
familiarity of instructors with the assessment, and the fidelity of instructors to implement the
formative assessments as intended. While these results come from evaluating studies only
including the population of interest in this synthesis, other factors from studies in the general
population may be important to analyze for their potential to moderate the effects of formative
assessment on academic achievement.
Factors Contributing to Variation in Intervention Effects
The theoretical underpinnings behind the effect of formative evaluation on achievement,
coupled with findings within empirical literature, suggest a few moderating factors; other factors
have been indicated in prior meta-analyses and empirical studies. These moderators include
school level, subject domain, type of intervention, and intervention duration (Graham et al.,
2015; Kingston & Nash, 2011; Lee et al., 2020). Notably, race has not yet been included as a
potential moderator of the effects of formative assessment within meta-analyses.
Race and Ethnicity
It is difficult to predict the relationship between race/ethnicity and the effects of
formative assessment on achievement. Race is a social construct, and social factors pervade
education, particularly as students develop and become more aware of the realities of their social
situations. On one hand, certain practices, such as peer assessment or culturally-responsive
instruction, may be especially beneficial to Black and Hispanic students. On the other hand,
formative assessment may become more threatening to particular racial/ethnic groups as a result
of the classroom environment, students’ past experiences in school, or historical context.
Formative assessment inherently requires quality feedback to be given and received, and social
factors between the feedback provider and feedback recipient can affect how well the feedback is
utilized.
19
19
Peer feedback may be particularly beneficial to Black and Hispanic students. While
neither Black nor Hispanic cultures are monolithic, they both tend towards collectivism (Darwish
& Huber, 2003; Green et al., 2005). In a collectivist paradigm, relationships and group success
take precedence over individual efforts (Oyserman & Lee, 2008). Accordingly, Black and
Hispanic students may experience more collective fulfillment through peer assessment than
instructor assessment alone. Unfortunately, due to the constraints of this analysis, only one report
utilized peers as the primary source of feedback. It is not possible to conduct any meaningful
moderator analysis with only one data point, and so this synthesis is unable to examine the utility
of peer feedback within formative assessment. While peer feedback may be beneficial to Black
and Hispanic students, these populations may be less receptive to feedback from instructors with
whom they do not relate. Further, implicit bias, microaggressions, and stereotype threat all
undermine the relationship between instructor and student, potentially diminishing the ability of
formative assessment to improve achievement. With multiple positive and negative social factors
at play, it is difficult to parse whether race is related to formative assessment’s effect on
achievement. Accordingly, it is hypothesized that race will not moderate the effects of formative
assessment.
School Level
The school level of the student during which formative assessment interventions are
implemented may contribute to variation in effects. Previous meta-analyses and reviews have
historically found no difference comparing the effects of formative assessment between grade
level brackets (Black & Wiliam, 1998; Kingston & Nash, 2011; Lee et al., 2020; Sanchez et al.,
2017). However, these meta-analyses have been conducted without regard to student race or
ethnicity. As mentioned previously, many social factors may moderate the ability of formative
20
20
assessment to influence achievement. Social factors change over time and with educational
context, so it is therefore a possibility that grade level may play a significant effect on how
formative assessments are able to influence student achievement when looking solely at studies
with significant Black and Hispanic populations. Typically, social factors become more salient as
students progress through school levels. However, it is not obvious which factors are more
relevant to formative assessment, and therefore it is hypothesized that grade levels will not
moderate the effect of formative assessment.
Subject Domain
The subject domain of the course in which formative assessment is implemented is a
second factor which may moderate the effect of formative assessment upon academic
achievement. In general, the implementation of formative assessment in science has lower effect
size compared to subjects such as mathematics, English, or language arts. This may be
because the benefits of feedback interventions are more positive when the task is either familiar
or cognitively noncomplex (Kluger & DeNisi, 1996). Indeed, Lee and colleagues (2020) found a
relatively low effect size of 0.13 for formative assessment implementation in science, and higher
effect sizes for language (0.33) and arts (0.29). Kingston and Nash (2011) found a weighted
mean effect size for 0.17 for math and 0.09 for science (it is notable that science also had a
standard error of 0.20, thereby potentially having a negative effect), compared to a mean of 0.32
for English/language arts. Other meta-analyses calculated higher effect sizes than Kingston and
Nash (2011) in English/writing, including 0.46 in EFL courses in China (Chen & Li, 2021) and
0.61 by Graham (2015), who studied writing in Grades 1–8 in the United States. For math, other
effect sizes have been found to be 0.34 (Lee et al., 2020) and 0.36 (Klute et al., 2017). Kingston
and Nash (2011) posit the complexity of science courses as central to this effect. While this
21
21
claim is conjecture on its own, a review of the relative difficulties of subject area examinations
found that science courses are not only often perceived as more difficult, but additionally have
more difficult outcome measures (Coe et al., 2008). A secondary suggested cause is the
variability in the specific formative assessment process or assignment which is a limitation
across all disciplines. Kingston and Nash (2011) also found that content area was the only
moderator to explain more variance than chance alone, and other setting factors, such as the
grade level of the student/course, were nonsignificant.
Peer assessment is more prevalent in English/writing domains than any other subject.
One meta-analysis found 40.74% of eligible studies with peer assessment were conducted in
writing courses, with 12 other disciplines ranging from 1.85% to 14.81% (Double et al., 2020);
another found 65.67% of eligible studies with peer assessment to be conducted in “social science
and arts,” which included writing courses (Li et al., 2020). This is likely due to the iterative and
flexible nature of the writing process; revision is commonplace in English and writers have
freedom to explore a multitude of approaches, styles, and stances. Peer review is also somewhat
common in performance disciplines, such as music, art, and dance. Math and science, by
contrast, are typically assessed through solving problems which have one right answer,
particularly at primary and secondary levels, which is inherently less receptive to peer review
and/or revision. As such, science and math may be more likely to benefit from peer assessment at
the post-secondary level, when writing assessments (i.e., journal-style lab reports) are used.
Nonetheless, despite most empirical studies being conducted in writing-specific courses, the
effect sizes of formative assessment interventions which include peer assessment are similar
between disciplines, with social sciences producing a weighted effect size of 0.284 standard
deviations and science and engineering yielding an effect size of 0.345, according to one meta-
22
22
analysis (Li et al., 2020). Due to the relative lack of use of peer feedback within STEM fields, it
is predicted that STEM domains will negatively moderate the effect of formative assessment, and
peers as the source of feedback may covary.
Type of Intervention
As noted in the review of extant literature on formative assessment, the type of
intervention implemented may contribute to variability in effects on achievement. To reiterate,
common types of formative assessment interventions include professional development,
curriculum-embedded assessment, technology-mediated assessment, and others. Professional
development-based interventions are the most common and have the highest potential to
positively influence the effects of formative assessment on previous meta-analyses (Kingston &
Nash, 2011). However, none of the studies which ultimately were included within this analysis
had professional development as the independent variable. Curriculum-embedded assessments
are packages purchased by school districts or states which include formal formative assessments
as a component. Technology-mediated formative assessment, such as those involving clickers,
use computers to provide real-time feedback to both students and instructors. Curriculumembedded formative assessments have been found to either negatively affect achievement or not
affect it, while technology-mediated formative assessments tend to produce positive effects.
Other less frequently examined types of formative assessment include performance assessment,
portfolios, or low-stakes assignments (which can be peer-reviewed or instructor-reviewed). This
synthesis will explore whether previous trends are similar when limited to studies with diverse
populations. It is predicted here that the results from this meta-analysis will be similar to
previous ones; curriculum-embedded assessments will negatively moderate the effectiveness of
formative assessments, while technology-mediated formative assessments will positively
23
23
moderate it. Because other types of formative assessment are less frequently studied and may be
variable, it is predicted that other formative assessments will not moderate the effect on
achievement.
Intervention Duration
The factor of intervention duration is also discussed within the literature review. The
relationship is straightforward and logical; a longer duration of intervention will likely lead to
more pronounced effects. It is hypothesized that duration will positively moderate the effects of
formative assessment on achievement.
The Present Synthesis
No singular study has as its primary aim to demonstrate correlation between race and the
effectiveness of formative evaluation. Indeed, one of the strengths of meta-analysis is to
investigate heretofore “hidden” relationships that may exist in large amounts of data. As such, it
is the hope that this synthesis will be able to explore the relationship of race and its effect on the
ability of formative assessment to influence achievement.
Formative assessment has been shown to enhance student achievement in meta-analyses.
However, it is crucial to consider the limitations of generalizing these benefits to all student
populations, as many educational studies in the United States have been found to have an overrepresentation of White participants (Graham, 1992; Konkel, 2015). This raises concerns about
the applicability of findings from meta-analyses to diverse populations. Notably, there exist
significant racial and ethnic disparities in academic achievement, with Black and Hispanic
students graduating from high school and college at lower rates compared to their White
counterparts (Fry et al., 2021). As formative assessment is commonly utilized throughout
American educational systems, it is imperative to investigate whether the benefits of formative
24
24
evaluation and peer/self-assessment extend to Black and Hispanic students and that these
populations are not negatively affected by the use of formative assessment.
John Hattie’s Visible Learning, as of January 2023, produced an effect size of 0.40 for the
influence of formative assessment (he calls it formative evaluation) based on five meta-analyses
containing a cumulative 229 studies and 872 effect sizes. For clickers, three meta-analyses which
calculated 236 effects from 132 studies generated an effect size of 0.24. For peer assessment, two
meta-analyses examining 178 effects across 91 studies yielded an effect size of 0.44. Finally, for
peer- and self-grading, he found an effect size of 0.42 from four meta-analyses covering 86
effects in 61 studies.
It is hypothesized that the effects of formative assessment are beneficial to a similar
degree as originally calculated by Hattie to Black and Hispanic students; as such, this synthesis is
primarily exploratory. By conducting a meta-analysis of the interventional studies used in Visible
Learning that focus on the impact of these approaches on student achievement with significant
(cumulative 40+%) Black and Hispanic populations, this research synthesis aims to contribute
valuable insights to address and mitigate racial and ethnic educational inequity within the United
States.
While Hattie originally presented formative evaluation, peer assessment, peer- and selfgrading, and clickers as four distinct influences, there is a significant degree of overlap between
the four. Often, activities which undergo peer or self-evaluation are formative in nature, seeking
to provide feedback to improve learning. Clickers are primarily used in formative contexts to
both inform instruction and inform learning. Additionally, combining the effects of all four will
allow for a deeper, more nuanced exploration of potential moderators.
This meta-analysis seeks to address the following research questions:
25
25
1. To what extent is formative assessment related to academic achievement among
Black and Hispanic students?
2. To what extent does the relationship between formative assessment and achievement
among Black and Hispanic students vary by the proportion of Black or Hispanic
students?
3. To what extent does the relationship between formative assessment and achievement
among Black and Hispanic students vary across school level?
4. To what extent does the relationship between formative assessment and achievement
among Black and Hispanic students vary across subject domain?
5. To what extent does the relationship between formative assessment and achievement
among Black and Hispanic students vary across the type of formative assessment
intervention implemented?
6. To what extent does the relationship between formative assessment and achievement
among Black and Hispanic students vary by intervention duration?
Hypotheses
Based on the theoretical frameworks and empirical literature reviewed, I hypothesize the
following: Formative assessment will have a positive influence on achievement, although the
effect will be similar to that calculated by Hattie, ranging from 0.24 to 0.44 for each of the
individual influences of formative evaluation, peer assessment, peer- and self-grading, and
clickers. Race will not be related to the effects of formative assessment because it is not
reasonable to attempt to parse out the relationship between numerous social factors, both positive
and negative, that may affect formative assessment. However, this exploration is critical to
ensuring formative assessments are not harmful to students. School level will similarly not
26
26
moderate the effects of formative assessment, but is still of interest to the field, and may also
indicate the effects of social factors that become more salient at higher grade levels. By contrast,
the subject domain will moderate the effect, with subjects such as English and history having
larger positive effects than math or science because past research has found STEM disciplines to
have more difficult outcome measures. The type of formative assessment intervention will
moderate the effect, with curriculum-embedded assessments negatively moderating the effects of
formative assessment relative to technology-mediated assessments. Finally, intervention duration
will moderate the effectiveness of formative assessment on enhancing achievement, such that
longer interventions produce stronger effects because prolonged exposure to formative
assessment provides students more opportunity to learn, overcome misconceptions, and develop
self-regulatory skills.
Methods
As the aim of this study was to re-examine an existing meta-analysis of meta-analyses,
the methodology employed here differs from that of a traditional meta-analysis. This section will
describe the process by which studies were retrieved, screened, coded, and analyzed.
Literature Search
The first step in re-examining the Visible Learning influences was retrieving their source
material: the original meta-analyses. These were found by using the Visible Learning MetaX
website by searching each influence and retrieving the citations of the meta-analyses employed. I
utilized the following search platforms/databases, in this order, until the meta-analysis was
found: Google Scholar, the University of Southern California Library page, ProQuest, ERIC, and
Google. If a meta-analysis was not retrievable through this sequence, I enlisted the aid of the
University of Southern California librarians through the University of Southern California
27
27
interlibrary document delivery system. All fourteen meta-analyses were able to be retrieved and
none were duplicates. Finally, I crosschecked each meta-analysis with those listed on the MetaX
website to ensure the correct reports were found.
Once each meta-analysis was retrieved, the next step was identifying which reports were
used as data. A graphical representation of this process can be found in the PRISMA (Page et al.,
2021) chart found as Figure 1. References for each were generally available within tables,
indicated as contributing to the meta-analysis in the reference section, or provided as a list in the
supplemental information. Following the same procedure as for the meta-analyses, I attempted to
retrieve each of the reports used therein, with one exception. As the screening criteria for this
meta-analysis would include only studies conducted in the United States, one meta-analysis
(Kim, 2005) evaluated studies conducted in both South Korea and the United States and as such,
the Korean reports (k = 107) were able to be excluded prior to screening. A total of 403 reports
were eligible for screening after the aforementioned 107 were removed from consideration. I
then endeavored to retrieve these reports and was able to retrieve all but 27 after exhausting
search methods. Seven reports were found to be duplicated within multiple meta-analyses, and
the duplicates were removed.
Inclusion Criteria
The remaining 369 studies were screened for eligibility. A graphical representation of the
screening process can be found in Figure 1. To be included in this study, an empirical study must
have met three criteria. The first criterion was that a study must have been included in Hattie’s
Visible Learning project, which, by dint of the search method, was met for all studies retrieved.
The second criterion was that the study must have been conducted within the United States. For
being conducted elsewhere, 103 reports were excluded (in addition to the 107 studies in Korea
28
28
mentioned previously). The third criterion was that the sample population was at minimum 40%
Black and/or Hispanic students. Pursuant to this criterion, 178 reports were excluded for not
reporting race/ethnicity demographics and 67 were excluded for samples that did not meet the
40% threshold. One report was retracted and was therefore excluded from analysis. The
remaining 20 reports were eligible for inclusion in the present study. Designs represented within
the eligible reports were either quasi-experimental or experimental.
Data Extraction
To extract relevant data from individual studies, a coding guide was developed by the
principal investigators of this project (Adam Kho and Erika Patall). This guide was refined in
consultation with input from me and other graduate students. I was trained over a period of 2
months in coding reports in alignment with this coding guide. Trainees would attempt to code
given studies before meeting as a group to discuss, explore, and resolve discrepancies. The
varied studies examined included effect sizes which had to be computed from correlation
coefficients, mean differences, sample sizes, standard deviations, F ratios, t-statistics, chi-square
statistics, and contingency tables. The studies used in training were experimental, quasiexperimental, or correlational. A formative assessment (no irony intended) was administered to
all trainees to allow them to determine whether they were prepared for individual coding. Upon
completion of coder training, I individually coded each of the 20 studies, which were verified by
another graduate student. Nine discrepancies arose during verification, and these were resolved
through discussion, consulting with additional coders and/or the principal investigators as
necessary. The total error rate was approximately 0.46%.
The following general categories were used in coding: report characteristics,
participant/sample attributes, predictor descriptions (including how the study defines formative
29
29
assessment), outcome measures (such as GPA, standardized test scores, or scores on researcherdeveloped materials), research design, and effect sizes. The coding focus follows the design of
the research questions; the racial demographics, feedback source, school level, subject domain,
type of formative assessment, and intervention duration were explicitly coded from each study.
The coding guide, along with specific codes used within each category, can be found in
Appendix B. After coding, eight reports were removed. Four did not provide sufficient
information to calculate an effect size and four had an outcome measure that was not
achievement (e.g., attendance). A list of all 12 included studies, along with their characteristics,
can be found in Appendix A.
All reports included provided racial demographics, as this was a criterion in screening.
Only one report included peer assessment, while the remaining eleven used the instructor as the
primary feedback source. Interventions took place at all school levels: elementary (Grades K–5),
middle (Grades 6–8), high (Grades 9–12), and collegiate (Grades 13+). The subject domains
included were math, history, English, psychology, and music. The types of assessment
interventions included in this analysis were curriculum-embedded assessment (CEA, which are
produced by third-party companies), technology-mediated formative assessment (TMFA)
including clickers, and general formative assessment (GFA), which includes any type of
formative assessment not fitting within the prior categories. Examples of GFA interventions
include ongoing performance assessment, instructor-generated formative assessment, criterionreferenced formative assessment, and low-stakes peer assessment. Duration varied among the
reports from < 10 minutes to 2 years. Notably, duration of implementation does not imply
frequency of implementation, rather the timing from onset of the first iteration to the time when
achievement data were collected. Frequency refers to how often within the duration window the
30
30
intervention was reiterated. Frequency of intervention is explained in more detail in the
discussion section.
Computing Effect Sizes and Data Analysis
Effect sizes were calculated for intervention studies using standardized mean differences
on academic achievement between treatment and control groups. Whenever possible, effect sizes
were directly computed from the means, standard deviations, and sample sizes of the
intervention and control groups. In cases where this approach was not feasible, effect sizes were
derived from means and standard errors, t-statistics, p-values of t-tests, or F-statistics, with
conversion formulas provided by Lipsey and Wilson (2001).
For students who were exposed to multiple treatment conditions compared to a single
control condition, separate effect sizes will be calculated for each intervention condition. To
account for a slight positive bias in effects observed with small samples, intervention effect sizes
were converted to bias-corrected Hedges’ g, a standardized effect size (Hedges, 1981).
Meta-analysis of the intervention data was performed using the metafor and
clubSandwich R packages (Pustejovsky, 2019; Viechtbauer, 2010) under the framework of
random-effects modeling. To address the dependency between multiple effect size estimates
within studies and mitigate potential model misspecification, a multilevel modeling approach
was employed in conjunction with a robust variance estimator developed by Pustejovsky and
Tipton (2022). A random-effects model was used to estimate the pooled effect size for the
relationship between formative assessment and achievement. The heterogeneity among effect
sizes was assessed using Q, τ2
, and I
2
statistics. Additionally, a 95% confidence interval was
reported for the weighted average effect, following the guidelines of Borenstein et al. (2011).
31
31
To explore further the heterogeneity in effect size estimates, mixed-effects metaregression models were employed. Each moderator, including formative assessment type, school
level, ethnicity, and subject domain, were examined in separate models. Additionally, potential
publication bias and funnel plot asymmetry were investigated by conducting Egger’s regression
test (Egger et al., 1997) and exploring whether publication status acted as a moderator in the
meta-regression models.
This synthesis is designed to illuminate the ability of formative assessment to improve
the academic achievement of Black and Hispanic students. However, there are limitations. This
study is directly limited by the choice of data sources. Many studies have been published which
were not included in Hattie’s Visible Learning, and as such, their data are not considered here.
This limits the generalizability of the findings this research will produce. Another limitation is
that many studies simply do not include racial demographics, further limiting the amount of data
which may have been able to be included in this meta-analysis. It is possible that some of the
studies excluded for not including these data may have met the inclusion criteria. Finally,
formative assessment is particularly difficult to measure. Implementation can vary widely and,
given that only 12 reports are included in this meta-analysis, it is possible that these data are not
sufficient for the statistical purposes of meta-analysis.
Results
A total of 12 reports were identified for inclusion within this meta-analysis. The sample
sizes within these interventional reports ranged from 52 to 9,956 and the total sample size was
13,359. The year of publication ranged from 1993 to 2016, with the majority released after 2010.
Outcome measures represented within the reports were instructor/researcher-generated
summative assessment scores, course grades, and standardized test scores. A complete list of the
32
32
studies included, along with their individual effect sizes, sample sizes, and additional
characteristics can be found in Appendix A.
Overall Average Effect
The primary aim and first research question of this research was to examine if formative
assessments affected achievement among Black and Hispanic students. The average effect for
formative assessment interventions on achievement was positive and statistically significant (g =
0.335, p < 0.01; see Table 1). No outliers were detected when employing Tukey’s definition (i.e.,
values greater than 150% of the interquartile range). The value of the calculated effect size is
small-to-medium and is slightly lower than Hattie’s reported effect size of 0.4. However, it is
important to note that this meta-analysis included reports categorized as peer assessment, peerand self-grading, clickers, in addition to those categorized as formative assessment. Accordingly,
formative assessments are effective in increasing achievement among populations with at least
40% Black and Hispanic students.
Additionally, the small number of studies and effects analyzed revealed a large amount of
heterogeneity as described by the measures of Cochran’s Q (Q = 353.4, p < .0001), tau-squared
(τ
2 = 0.058) and I-squared (I
2 = 93.72). Cochran’s Q is a measure of relatively low power with
small sample sizes and tau-squared, while insensitive to sample size, can be difficult to interpret
at face value. However, a nonzero tau-squared indicates a degree of variation among true effects.
Finally, the I-squared value indicates 93.72% of variation among effects is due to heterogeneity,
a value which is substantially high. A very large degree of heterogeneity is expected when
considering how diverse formative assessment interventions can be coupled with a small sample
size. Nevertheless, some of the variation can be accounted for by examining potential
moderators.
33
33
Publication Bias
The results from the modified Egger’s regression model suggested that there was no
evidence of funnel plot asymmetry for the dataset (b = 1.2752, SE = 1.2254, t(20) = 1.0406, p =
.310). Likewise, the moderator analysis comparing published and unpublished reports indicated
that the pooled effect sizes did not statistically significantly differ (b = –0.058, p = .786) by
publication status for achievement (see Table 2). Examined separately for published and
unpublished studies, the average effects of formative assessment were not statistically
significantly different from zero.
Moderator Analyses
In this meta-analysis, in addition to publication status, the percentage of Black students,
percentage of Hispanic students, school level, subject domain, type of formative assessment
intervention, and intervention duration were analyzed as moderators of the effect of formative
assessment on achievement. The small sample size limits the statistical power of moderator
analyses. However, most subgroups included three or more calculated effects; when there were
two or fewer reports in each category, larger categories were constructed. For example, English,
music, and history were combined into the subject domain category of humanities. All
categorizations employed are detailed within the description of individual moderator analyses.
Presented in each moderator category are results from both no-intercept models and intercept
models with the category with the greatest number of effects as the reference group. A full report
of all moderator analyses is found in Table 2.
Race and Ethnicity
The second goal and research question of this analysis was to examine variation in the
effect of formative assessment on achievement relating to the racial diversity of sample.
34
34
Therefore, a primary potential moderator to explore was the proportion of Black and Hispanic
students within studies. Neither the percentage of Black students nor the percentage of Hispanic
students were significant moderators of the effect of formative assessment on achievement (b =
0.0045, p = .489 and b = 0.0014, p = .737, respectively, see Table 2). This analysis found that the
proportion of Black or Hispanic students did not moderate the effectiveness of formative
assessments.
School Level
The next potential moderator examined was school level, corresponding with the third
research question. Reports included elementary school (Grades K–5), middle school (Grades 6–
8), high school (Grades 9–12), and college (Grades 13+) samples. No group level effects were
statistically different from zero. Neither middle school (b = –0.299, p = .273), nor high school (b
= –0.314, p = .344) nor college (b = –0.512, p = .140) effects were statistically different from
that of elementary school effects. Additionally, effects within middle school were not distinct
from high school (b = –0.016, p = .944) or college (b = –0.219, p = .320). Finally, high school
compared with college similarly produced no difference (b = –0.203, p = .470). Table 2
illustrates the results obtained when elementary school acted as the reference group. While none
of the effects were distinct, there is a trend within the results; formative assessment appears to be
more effective for elementary students and its effectiveness diminishes as students progress
through school levels.
Subject Domain
The fourth research question posed was aimed at determining whether the effectiveness
of formative assessment interventions varied by subject domain. Reports included within the
meta-analysis contained achievement measures in the subject domains of math (k = 8), history (k
35
35
= 2), psychology (k = 1), music (k = 1), and English (k = 1). Music, psychology, and English
were represented within one report each, and so all three were combined with history into a
single category: humanities. The results of the analysis are presented in Table 2. Math and
humanities were not statistically different from each other (b = 0.192, p = .489). However, the
effect of math interventions was significantly different from zero (g = 0.258, p = .021).
Accordingly, the effects of formative assessment did not vary by subject domain, but math
interventions alone positively affected achievement.
Type of Formative Assessment Intervention
The fifth research question examined variation of effect based on type of intervention. As
noted in the review of empirical research, formative assessment interventions can be categorized
by general type of formative assessment. The types included in this analysis were curriculumembedded assessment (CEA, which are produced by third-party companies), technologymediated formative assessment (TMFA), and general formative assessment (GFA), which
includes any type of formative assessment not fitting within the prior categories. The results of
the analysis are presented in Table 2. Curriculum-embedded assessments produced achievement
measures that were not different from TMFA (b = –0.250, p =.360) or GFA (b = –0.251, p =
.417). Technology-mediated effects were nearly identical to those from GFA (b = –0.0005, p =
.998). As such, the type of domain did not moderate the effect of formative assessment.
Intervention Duration
The final research question explored whether the intervention duration moderated the
effect of formative assessments. Additionally, the effect of duration of the formative assessment
intervention was explored. Duration varied among the reports from < 10 minutes to 2 years.
When duration of the intervention was examined as a continuous variable, it did not statistically
36
36
significantly moderate the effect of formative assessment on achievement (b = 0.180, p = .536,
see Table 2).
Source of Feedback
Although a review of empirical studies and past meta-analyses indicated that the source
of feedback (i.e., instructor feedback vs. peer feedback) could be a potential moderator,
particularly for Black and Hispanic students, 11 of the 12 reports which met all the criteria for
inclusion within this synthesis had instructors as the primary source of feedback. The one other
report had peer feedback. With such disparity, it is not meaningful to draw conclusions, and thus
a moderator analysis was not conducted.
Discussion
This synthesis reports an exploration of the effects of formative assessment interventions
among sample populations with relatively high proportions of Black and Hispanic students. A
statistically significant positive overall effect was found for formative assessment interventions
on achievement, combining effects from 12 reports. The reports included in this synthesis were
largely heterogeneous, and while moderator analyses were conducted, no significant moderators
were detected. Herein, the results and limitations of this study are explored, and future
implications for research are discussed.
This synthesis sought to explore six research questions. While the first question focused
on the overall effect, the other five were moderators. For the first research question, it was
hypothesized that the effect would be both positive and similar in magnitude to what Hattie
calculated. Hattie found effects ranging from 0.24 through 0.44 for the four influences combined
within this synthesis. The current finding (g = 0.335) also falls within this range. The result
indicates that, in general, formative assessment as a practice is still effective for racially diverse
37
37
student populations. A promising potential moderator, peer assessment, was not able to be
examined. Only one report within the entire influence of peer assessment qualified for inclusion
within this study, while many were conducted outside of the United States, particularly in
English as a Foreign Language classes. Accordingly, it is suggested that additional quantitative
studies be undertaken examining the effects of peer assessment with racially diverse populations.
The remaining research questions explored moderators of the effect of formative
assessment. The moderators explored were proportion of Black or Hispanic students, school
level, subject domain, type of formative assessment intervention, and intervention duration.
None of the analyses performed uncovered any statistically significant moderating effect. The
lack of significance may be due in large part to a very small sample size. Lack of statistical
power notwithstanding, one trend deserves further exploration.
Previous meta-analyses have found no difference on formative assessment interventions
across grade levels. However, while nonsignificant, the effect sizes illustrate a decreasing trend.
Formative assessment interventions may be most effective among racially diverse elementary
school students with the effect dropping off as students progress to college. A core component of
formative assessment is feedback, often from the instructor. This trend may be correlated with
social factors, such as belongingness and stereotype threat, which can impact how receptive a
student is to feedback. As they age, students may feel less secure within a classroom as social
realities become apparent to them. Belongingness has been demonstrated to decrease over time,
especially among students of color (Wang & Eccles, 2012). Indeed, belongingness tends to drop
precipitously during middle school (Loukas et al., 2016), which matches the largest drop in
formative assessment effectiveness within this analysis.
38
38
Additionally, students who progress through the American education system find
themselves exposed to instructor implicit bias. Negative academic stereotypes are associated
with Black and Hispanic students, and instructors my subconsciously reinforce these negative
stereotypes within their classrooms. When students are exposed to these biases, academic
performance can be undermined, a phenomenon known as stereotype threat (Spencer et al.,
2016). Stereotype threat causes students to expend a great degree of effort to mitigate negative
effects. An investigation into the processes by which stereotype threat diminishes achievement
found that both physiological and psychological effects contribute. Physiologically, frontal
cortex processing is impaired due to stress responses, while psychologically, students must
expend additional effort to suppress and contain negative thoughts about themselves (Schmader
et al., 2008). Stereotype threat has been linked with multiple factors, such as attendance and
engagement, which affect achievement in populations of color (Osborne & Walker, 2006).
Accordingly, as students progress to higher grades, Black and Hispanic students may be less
willing to participate in formative assessment practices or less receptive to instructor feedback,
and therefore may be less positively affected by the implementation of formative assessment
practices.
To examine the decreasing trend in formative assessment effectiveness among Black and
Hispanic students over school level, several studies are suggested. Primarily, qualitative studies
should be undertaken to explore conscious and subconscious thought processes among both
Black and Hispanic students as well as their instructors, specifically among those
students/instructors who use formative assessments within their educational contexts. These
studies should be conducted at all school levels to determine a potential cause for the decrease in
effectiveness. Methodological tools should include both interviews and observations. Instructors
39
39
are often not aware of potential implicit biases and actions taken as a result, which observations
may reveal. Additionally, more quantitative studies involving formative assessment interventions
among these same populations will contribute to the statistical power of determining if this trend
is significant.
Exploring Heterogeneity
Only 2.35% of the studies used to generate effect sizes for the influences of formative
assessment, peer assessment, peer- and self-grading, and clickers within Hattie’s Visible
Learning project were conducted in the United States among populations with at least 40% Black
and/or Hispanic students. As such, this dearth of racially diverse data illustrates a need for
practitioners and scholars alike to critically assess whether interventions they are considering
implementing have been examined within a population predominantly White or one more
representative of their institution’s demographics. Nonetheless, the primary finding of this
synthesis was overall positive, but the few reports which met the criteria for inclusion merely
scratched the surface of the breadth of formative assessment interventions extant within the
literature.
Formative assessment takes many forms, can be difficult to measure, and informs both
learning and instruction, each of which yield outcomes with different timing. For these reasons,
Black and Wiliam in their seminal 1998 work caution against drawing general conclusions on the
impacts of formative assessment, even against those derived meta-analytically. The
heterogeneity of reports included within this work highlights the difficulty in studying formative
assessment. Theoretically simple measures, such as frequency of intervention (i.e., “dosage” of
the treatment), become complex in the context of formative assessment. For example, an
instructor asking a question aloud in a classroom, having students reply, and providing feedback
40
40
on their responses is a sequence of events which happens very frequently in classrooms. The
sequence at its core is an informal formative assessment, but how many instructors keep track of
how often it occurs? How do they measure and report to what degree students are participating?
And yet, even these small interactions can influence achievement.
More formal formative assessment practices are easier to track and report. However, the
degree of familiarity and implementation fidelity can greatly impact the effect of an intervention.
Attempting to measure and account for these more “hidden” variables contributes to the
complexity present within many educational interventions, formative assessment included. As
such, a great degree of heterogeneity is expected for formative assessment, an expectation met in
the results of this synthesis. While the potential moderators explored did not yield significant
differences, it is possible that with a greater amount of data more nuances will be revealed. As
more and more research is conducted and as educational psychologists highlight the importance
of diverse participant groups, the likelihood of discovering which moderators influence
formative assessment’s impact on achievement rises.
Limitations and Implications for Practice
This work depends entirely on past research and restricted further to only including
reports previously included in Visible Learning. The field is further limited by studies conducted
within the United States and to those who reported significant Black and Hispanic populations,
rendering the sample size extremely small for a meta-analysis. Most studies within the United
States did not include racial demographics. Researchers are encouraged to report all
demographic data they collect, and to intentionally seek representative samples for their studies
on formative assessment. While there was sufficient data to calculate an overall effect for
formative assessment on achievement, moderator analyses were not statistically powerful. Of
41
41
course, it is possible that the potential moderators highlighted in this meta-analysis do not
moderate the effects of formative assessment, but until a greater amount of data is analyzed,
conclusions about moderators should not yet be drawn.
Additional moderators not explored here are certainly possible, notably the frequency of
intervention. The regularity with which formative assessments are given can be complex to
measure in a meaningful way. On the surface, one might be tempted to report that a particular
assessment intervention was conducted four times over the course of a semester. However, the
number four does not carry information about whether all four were implemented in one week, at
regular intervals, or variable ones. The frequency of intervention may also covary with several
other factors, such as the type of intervention. Certain types of formative assessments, such as
curriculum-embedded ones, are difficult to construct and time-consuming to implement and may
be administered less frequently. However, technology-moderated interventions are generally
facile to develop and administer and can be implemented with high frequency. Frequency of
intervention can also covary with the total duration, as frequency is inherently the number of
iterations over time; the effect of conducting two iterations over a week could differ from four
iterations over a month. With a far larger pool of data, it may be possible to assay the effects of
more complex moderators.
While the overall effect calculated was similar to the ones calculated by Hattie, caution
should be employed when considering these results; only 12 reports provided sufficient
information to calculate an effect on achievement. However, it is unlikely that formative
assessment interventions carry risk of harming students within Black and Hispanic populations.
So, while prudence is recommended, practitioners should feel free to continue implementing
formative assessment interventions within their classrooms, schools, or districts. Other reports
42
42
included outcomes which were not direct achievement, such as attendance. As these reports were
originally used to establish a causal link between formative assessment and achievement,
practitioners are cautioned to carefully examine the exact outcome measures utilized within
studies when deciding upon practices to implement.
Finally, it is important to recognize that multiple moderators may be correlated. For
example, within this synthesis, most studies examining technology-mediated formative
assessment interventions took place within colleges; the only exception was conducted high
school. While nonsignificant, college was the school level within which formative assessments
were least effective. As such, it is possible that the relationship between technology-mediation
and the effectiveness of formative assessment is skewed. Another trend was that all high school
interventions were in the domain of math, which was nonsignificantly lower than humanities,
which again potentially obfuscates moderating effects. With few reports, it is not possible to
control for this covariation within moderator analyses. The covariation limits the generalizability
of the findings of this research. For a more complete picture, studies should seek to explore the
effectiveness of a variety of intervention types at all school levels within multiple subject
domains, while still seeking racially diverse populations.
Conclusions
This meta-analysis demonstrated that formative assessment interventions are effective at
enhancing achievement within racially diverse populations. Another primary finding was a
dearth of reported racial demographics within studies, along with an overabundance of White
participants in studies. Moderator analyses were conducted but were not statistically significant
for any moderating factor examined. This research highlights a need for additional research to be
conducted on formative assessment interventions among racially diverse populations.
43
43
References
Atkinson, R. C., & Shiffrin, R. M. (1968). Human memory: A proposed system and its control
processes. In The psychology of learning and motivation: II. Academic Press.
https://doi.org/10.1016/S0079-7421(08)60422-3
Bennett, R. E. (2011). Formative assessment: A critical review. Assessment in Education:
Principles, Policy & Practice, 18(1), 5–25.
https://doi.org/10.1080/0969594X.2010.513678
Black, P., Chris, H., & Clara, L. (2003). Assessment For Learning: Putting it into Practice.
McGraw-Hill Education.
Black, P., & Wiliam, D. (1998). Assessment and Classroom Learning. Assessment in Education:
Principles, Policy & Practice, 5(1), 7–74. https://doi.org/10.1080/0969595980050102
Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational
Assessment, Evaluation and Accountability (Formerly: Journal of Personnel Evaluation
in Education), 21(1), 5–31. https://doi.org/10.1007/s11092-008-9068-5
Bloom, B. S., Hastings, J. T. (John T., & Madaus, G. F. (1971). Handbook on formative and
summative evaluation of student learning. McGraw-Hill.
Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2011). Introduction to
Meta-Analysis. John Wiley & Sons.
Briggs, D. C., Ruiz-Primo, M. A., Furtak, E., Shepard, L., & Yin, Y. (2012). Meta-Analytic
Methodology and Inferences About the Efficacy of Formative Assessment. Educational
Measurement: Issues and Practice, 31(4), 13–17. https://doi.org/10.1111/j.1745-
3992.2012.00251.x
44
44
Butler, D. L., & Winne, P. H. (1995). Feedback and Self-Regulated Learning: A Theoretical
Synthesis. Review of Educational Research, 65(3), 245–281.
https://doi.org/10.3102/00346543065003245
Butler, M. D. (2014). The effects of embedding formative assessment measures in a problembased learning mathematics curriculum for middle school students [Doctoral dissertation,
University of Kentucky].
https://www.proquest.com/docview/1824398361/abstract/FB6520DA11B64348PQ/1
Carrillo-de-la-Peña, M. T., Baillès, E., Caseras, X., Martínez, À., Ortet, G., & Pérez, J. (2009).
Formative assessment and academic achievement in pre-graduate students of health
sciences. Advances in Health Sciences Education, 14(1), 61–67.
https://doi.org/10.1007/s10459-007-9086-y
Chen, Q., & Li, H. (2021). Formative Assessment in China and Its Effects on EFL Learners’
Learning Achievement: A Meta-Analysis from Policy Transfer Perspective. The
Educational Review, USA, 5(9), 355–366. https://doi.org/10.26855/er.2021.09.005
Coe, R., Searle, J., Barmby, P., Jones, K., & Higgins, S. (2008). Relative difficulty of
examinations in different subjects (Science Community Supporting Education).
Darwish, A.-F. E., & Huber, G. L. (2003). Individualism vs Collectivism in Different Cultures:
A cross-cultural study. Intercultural Education, 14(1), 47–56.
https://doi.org/10.1080/1467598032000044647
Double, K. S., McGrane, J. A., & Hopfenbeck, T. N. (2020). The Impact of Peer Assessment on
Academic Performance: A Meta-analysis of Control Group Studies. Educational
Psychology Review, 32(2), 481–509. https://doi.org/10.1007/s10648-019-09510-3
45
45
Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by
a simple, graphical test. BMJ, 315(7109), 629–634.
https://doi.org/10.1136/bmj.315.7109.629
Fry, R., Bennett, J., & Barroso, A. (2021). Racial and ethnic gaps in the U.S. persist on key
demographic indicators. Pew Research Center.
https://www.pewresearch.org/interactives/racial-and-ethnic-gaps-in-the-u-s-persist-onkey-demographic-indicators/
Fuchs, L. S., & Fuchs, D. (1986). Effects of Systematic Formative Evaluation: A Meta-Analysis.
Exceptional Children, 53(3), 199–208. https://doi.org/10.1177/001440298605300301
Fuchs, L. S., Fuchs, D., & Hamlett, C. L. (1989). Effects of instrumental use of curriculum-based
measurement to enhance instructional programs. RASE: Remedial & Special Education,
10(2), 43–52. https://doi.org/10.1177/074193258901000209
Garcia, N. M., López, N., & Vélez, V. N. (2018). QuantCrit: Rectifying quantitative methods
through critical race theory. Race Ethnicity and Education, 21(2), 149–157.
https://doi.org/10.1080/13613324.2017.1377675
Gibbs, J. C., & Taylor, J. D. (2016). Comparing student self-assessment to individualized
instructor feedback. Active Learning in Higher Education, 17(2), 111–123.
https://doi.org/10.1177/1469787416637466
Gillborn, D., Warmington, P., & Demack, S. (2018). QuantCrit: Education, policy, ‘Big Data’
and principles for a critical race theory of statistics. Race Ethnicity and Education, 21(2),
158–179. https://doi.org/10.1080/13613324.2017.1377417
46
46
Graham, S. (1992). “Most of the subjects were White and middle class”: Trends in published
research on African Americans in selected APA journals, 1970–1989. American
Psychologist, 47(5), 629–639. https://doi.org/10.1037/0003-066X.47.5.629
Graham, S., Hebert, M., & Harris, K. R. (2015). Formative Assessment and Writing: A MetaAnalysis. The Elementary School Journal, 115(4), 523–547.
https://doi.org/10.1086/681947
Green, E. G. T., Deschamps, J.-C., & Páez, D. (2005). Variation of Individualism and
Collectivism within and between 20 Countries: A Typological Analysis. Journal of
Cross-Cultural Psychology, 36(3), 321–339. https://doi.org/10.1177/0022022104273654
Hattie, J. (2008). Visible Learning: A Synthesis of Over 800 Meta-Analyses Relating to
Achievement. Routledge. https://doi.org/10.4324/9780203887332
Hattie, J. (2023). Visible Learning: The Sequel: A Synthesis of Over 2,100 Meta-Analyses
Relating to Achievement. Routledge. https://doi.org/10.4324/9781003380542
Hattie, J., & Timperley, H. (2007). The Power of Feedback. Review of Educational Research,
77(1), 81–112. https://doi.org/10.3102/003465430298487
Heaslip, G., Donovan, P., & Cullen, J. G. (2014). Student response systems and learner
engagement in large classes. Active Learning in Higher Education, 15(1), 11–24.
https://doi.org/10.1177/1469787413514648
Hedges, L. V. (1981). Distribution Theory for Glass’s Estimator of Effect size and Related
Estimators. Journal of Educational Statistics, 6(2), 107–128.
https://doi.org/10.3102/10769986006002107
Johnson, L., Graham, S., & Harris, K. R. (1997). The effects of goal setting and self-instruction
on learning a reading comprehension strategy: A study of students with learning
47
47
disabilities. Journal of Learning Disabilities, 30(1), 80–91.
https://doi.org/10.1177/002221949703000107
Kim, S. (2005). Effects of implementing performance assessments on student learning: Metaanalysis using HLM. [Doctoral dissertation, Pennsylvania State University]. ProQuest
Dissertations & Theses Global.
Kingston, N., & Nash, B. (2011). Formative Assessment: A Meta-Analysis and a Call for
Research. Educational Measurement: Issues and Practice, 30(4), 28–37.
https://doi.org/10.1111/j.1745-3992.2011.00220.x
Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A
historical review, a meta-analysis, and a preliminary feedback intervention theory.
Psychological Bulletin, 119, 254–284. https://doi.org/10.1037/0033-2909.119.2.254
Klute, M., Apthorp, H., Harlacher, J., & Reale, M. (2017). Formative assessment and elementary
school student academic achievement: A review of the evidence (REL 2017-259).
Regional Educational Laboratory Central.
Konkel, L. (2015). Racial and Ethnic Disparities in Research Studies: The Challenge of Creating
More Diverse Cohorts. Environmental Health Perspectives, 123(12), A297–A302.
https://doi.org/10.1289/ehp.123-A297
Kulhavy, R. W., & Stock, W. A. (1989). Feedback in Written Instruction: The Place of Response
Certitude. Educational Psychology Review, 1(4), 279–308.
Kulik, J. A., & Kulik, C.-L. C. (1988). Timing of Feedback and Verbal Learning. Review of
Educational Research, 58(1), 79–97. https://doi.org/10.3102/00346543058001079
Lee, H., Chung, H. Q., Zhang, Y., Abedi, J., & Warschauer, M. (2020). The Effectiveness and
Features of Formative Assessment in US K-12 Education: A Systematic Review. Applied
48
48
Measurement in Education, 33(2), 124–140.
https://doi.org/10.1080/08957347.2020.1732383
Li, H., Xiong, Y., Hunter, C. V., Guo, X., & Tywoniw, R. (2020). Does peer assessment promote
student learning? A meta-analysis. Assessment & Evaluation in Higher Education, 45(2),
193–211. https://doi.org/10.1080/02602938.2019.1620679
Liberatore, M. W., Morrish, R. M., & Vestal, C. R. (2017). Effectiveness of Just in Time
Teaching on Student Achievement in an Introductory Thermodynamics Course. Advances
in Engineering Education, 6(1). https://eric.ed.gov/?id=EJ1138854
Lipnevich, A. A., & Panadero, E. (2021). A Review of Feedback Models and Theories:
Descriptions, Definitions, and Conclusions. Frontiers in Education, 6.
https://www.frontiersin.org/articles/10.3389/feduc.2021.720195
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis (pp. ix, 247). Sage Publications,
Inc.
Loukas, A., Cance, J. D., & Batanova, M. (2016). Trajectories of School Connectedness Across
the Middle School Years: Examining the Roles of Adolescents’ Internalizing and
Externalizing Problems. Youth & Society, 48(4), 557–576.
https://doi.org/10.1177/0044118X13504419
Manuel, A. K. (2016). The effects of immediate feedback using a student response system on
math achievement of eleventh grade students [Doctoral dissertation, Mercer University].
https://www.proquest.com/docview/1812929713/66452A9817834EC1PQ/1
McKnight, L., & Whitburn, B. (2020). Seven reasons to question the hegemony of Visible
Learning. Discourse: Studies in the Cultural Politics of Education, 41(1), 32–44.
https://doi.org/10.1080/01596306.2018.1480474
49
49
McMillan, J. H., Venable, J. C., & Varier, D. (2013). Studies of the Effect of Formative
Assessment on Student Achievement: So Much More Is Needed. Practical Assessment,
Research & Evaluation, 18(2). https://eric.ed.gov/?id=EJ1005135
Meisels, S. J. (2003). Impact of Instructional Assessment on Elementary Children’s
Achievement. Education Policy Analysis Archives, 11, 9–9.
https://doi.org/10.14507/epaa.v11n9.2003
Nendaz, M. R., & Tekian, A. (1999). Assessment in Problem-Based Learning Medical Schools:
A Literature Review. Teaching and Learning in Medicine, 11(4), 232–243.
https://doi.org/10.1207/S15328015TLM110408
Nicol, D. J., & Macfarlane‐Dick, D. (2006). Formative assessment and self‐regulated learning: A
model and seven principles of good feedback practice. Studies in Higher Education,
31(2), 199–218. https://doi.org/10.1080/03075070600572090
Novak, G. M. (2011). Just-in-time teaching. New Directions for Teaching and Learning,
2011(128), 63–73. https://doi.org/10.1002/tl.469
Osborne, J. W., & Walker, C. (2006). Stereotype Threat, Identification with Academics, and
Withdrawal from School: Why the most successful students of colour might be most
likely to withdraw. Educational Psychology, 26(4), 563–577.
https://doi.org/10.1080/01443410500342518
Oyserman, D., & Lee, S. W. S. (2008). Does culture influence what and how we think? Effects
of priming individualism and collectivism. Psychological Bulletin, 134(2), 311–342.
https://doi.org/10.1037/0033-2909.134.2.311
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow, C. D.,
Shamseer, L., Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J.,
50
50
Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E.,
McDonald, S., … Moher, D. (2021). The PRISMA 2020 statement: An updated guideline
for reporting systematic reviews. Systematic Reviews, 10(1), 89.
https://doi.org/10.1186/s13643-021-01626-4
Phelan, J. C., Choi, K., Vendlinski, T., Baker, E., & Herman, J. (2011). Differential
Improvement in Student Understanding of Mathematical Principles Following Formative
Assessment Intervention. The Journal of Educational Research, 104(5), 330–339.
https://doi.org/10.1080/00220671.2010.484030
Preszler, R. W., Dawe, A., Shuster, C. B., & Shuster, M. (2007). Assessment of the Effects of
Student Response Systems on Student Learning and Attitudes over a Broad Range of
Biology Courses. CBE—Life Sciences Education, 6(1), 29–41.
https://doi.org/10.1187/cbe.06-09-0190
Pustejovsky, J. E. (2019). Procedural sensitivities of effect sizes for single-case designs with
directly observed behavioral outcome measures. Psychological Methods, 24(2), 217–235.
https://doi.org/10.1037/met0000179
Pustejovsky, J. E., & Tipton, E. (2022). Meta-analysis with Robust Variance Estimation:
Expanding the Range of Working Models. Prevention Science, 23(3), 425–438.
https://doi.org/10.1007/s11121-021-01246-3
Randel, B., Apthorp, H., Beesley, A. D., Clark, T. F., & Wang, X. (2016). Impacts of
professional development in classroom assessment on teacher and student outcomes. The
Journal of Educational Research, 109(5), 491–502.
https://doi.org/10.1080/00220671.2014.992581
51
51
Randel, B., Beesley, A. D., Apthorp, H., Clark, T. F., Wang, X., Cicchinelli, L. F., & Williams,
J. M. (2011). Classroom Assessment for Student Learning: Impact on Elementary School
Mathematics in the Central Region. Final Report. NCEE 2011-4005. In National Center
for Education Evaluation and Regional Assistance. National Center for Education
Evaluation and Regional Assistance. https://eric.ed.gov/?id=ED517969
Sanchez, C. E., Atkinson, K. M., Koenka, A. C., Moshontz, H., & Cooper, H. (2017). Selfgrading and peer-grading for formative and summative assessments in 3rd through 12th
grade classrooms: A meta-analysis. Journal of Educational Psychology, 109(8), 1049–
1066. https://doi.org/10.1037/edu0000190
Schildkamp, K., van der Kleij, F. M., Heitink, M. C., Kippers, W. B., & Veldkamp, B. P. (2020).
Formative assessment: A systematic review of critical teacher prerequisites for classroom
practice. International Journal of Educational Research, 103, 101602.
https://doi.org/10.1016/j.ijer.2020.101602
Schmader, T., Johns, M., & Forbes, C. (2008). An integrated process model of stereotype threat
effects on performance. Psychological Review, 115(2), 336–356.
https://doi.org/10.1037/0033-295X.115.2.336
Schunk, D. H. (1996). Goal and Self-Evaluative Influences During Children’s Cognitive Skill
Learning. American Educational Research Journal, 33(2), 359–382.
https://doi.org/10.3102/00028312033002359
Schunk, D. H., & Zimmerman, B. J. (2013). Self-regulation and learning. In Handbook of
psychology: Educational psychology, Vol. 7, 2nd ed (pp. 45–68). John Wiley & Sons,
Inc.
52
52
Scriven, M. (1966). Social Science Education Consortium. Publication 110, The Methodology of
Evaluation. https://eric.ed.gov/?id=ED014001
Spencer, S. J., Logel, C., & Davies, P. G. (2016). Stereotype Threat. Annual Review of
Psychology, 67(Volume 67, 2016), 415–437. https://doi.org/10.1146/annurev-psych073115-103235
Sundar, K., & Agarwal, P. K. (2021). Effect Sizes and Meta-analyses: How to Interpret the
“Evidence” in Evidence-Based. Retrieval Practice.
https://pdf.retrievalpractice.org/MetaAnalysisGuide.pdf
Terhart, E. (2011). Has John Hattie really found the holy grail of research on teaching? An
extended review of Visible Learning. Journal of Curriculum Studies, 43(3), 425–438.
https://doi.org/10.1080/00220272.2011.576774
Trees, A. R., & Jackson, M. H. (2007). The learning environment in clicker classrooms: Student
processes of learning and involvement in large university‐level courses using student
response systems. Learning, Media and Technology, 32(1), 21–40.
https://doi.org/10.1080/17439880601141179
Valle, C. (2015). Effects of criteria-referenced formative assessment on achievement in music
[Doctoral dissertation, State University of New York at Albany].
https://www.proquest.com/docview/1752252579/abstract/3A30A1F5C5464603PQ/1
Van der Kleij, F. M., Cumming, J. J., & Looney, A. (2018). Policy expectations and support for
teacher formative assessment in Australian education reform. Assessment in Education:
Principles, Policy & Practice, 25(6), 620–637.
https://doi.org/10.1080/0969594X.2017.1374924
53
53
Van Zundert, M., Sluijsmans, D., & Van Merriënboer, J. (2010). Effective peer assessment
processes: Research findings and future directions. Learning and Instruction, 20(4), 270–
279. https://doi.org/10.1016/j.learninstruc.2009.08.004
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of
Statistical Software, 36(3), 1–48.
Vohs, K. D., & Baumeister, R. F. (2016). Handbook of Self-Regulation, Third Edition: Research,
Theory, and Applications. Guilford Publications.
Vygotskiĭ, L. S. (Lev S. (1986). Thought and language. Cambridge, Mass. : MIT Press.
http://archive.org/details/thoughtlanguage0000vygo
Wang, M.-T., & Eccles, J. S. (2012). Social Support Matters: Longitudinal Effects of Social
Support on Three Dimensions of School Engagement from Middle to High School. Child
Development, 83(3), 877–895. https://doi.org/10.1111/j.1467-8624.2012.01745.x
Wecker, C., Vogel, F., & Hetmanek, A. (2017). Visionär und imposant – aber auch belastbar?
Zeitschrift für Erziehungswissenschaft, 20(1), 21–40. https://doi.org/10.1007/s11618-016-
0696-0
Wiliam, D. (2006). Formative Assessment: Getting the Focus Right. Educational Assessment,
11(3–4), 283–289. https://doi.org/10.1080/10627197.2006.9652993
Xuan, Q., Cheung, A., & Sun, D. (2022). The effectiveness of formative assessment for
enhancing reading achievement in K-12 classrooms: A meta-analysis. Frontiers in
Psychology, 13. https://www.frontiersin.org/articles/10.3389/fpsyg.2022.990196
Ysseldyke, J., & Bolt, D. M. (2007). Effect of technology-enhanced continuous progress
monitoring on math achievement. School Psychology Review, 36(3), 453–467.
54
54
Tables
Table 1
Overall Average Effect of Formative Assessment Interventions on Achievement
Outcome k NS NES g 95% CI τ
2
I
2 Q
Achievement 12 12 22 0.335** 0.099 / 0.572 0.058 93.72 353.4****
Note. k = number of studies. NS= number of samples. NES = number of effects. g = Hedges’ g
(average pooled effect). CI = confidence interval (low estimate / high estimate). ** p < 0.01
**** p < 0.0001.
55
55
Table 2
Results of Moderator Analyses
Moderator k NS NES b (SE) g 95% CI
Publication status
Published 7 7 11 – 0.358 –0.0652 / 0.7806
Unpublished 5 5 11 –0.058 (0.205) 0.300 –0.0273 / 0.6277
Ethnicity
%Black 11 11 20 0.0045 (0.0059) – –0.0115 / 0.0204
%Hispanic 11 11 20 0.0014 (0.0040) – –0.0089 / 0.0118
School level
Elementary 3 3 7 – 0.632 –0.6312/ 0.8600
Middle 3 3 8 –0.299 (0.231) 0.333 –0.4238 / 1.6883
High 3 3 5 –0.314 (0.285) 0.318 –0.6105 / 1.2460
College 3 3 5 –0.518 (0.275) 0.114 –0.0585 / 0.7251
Subject domain
Math 8 8 13 – 0.258* 0.0534 / 0.4621
Humanities 5 5 9 0.192 (0.258) 0.450 –0.4628 / 0.8472
Type of FA
CEA 5 5 8 – 0.474 –0.0971 / 1.0451
TMFA 4 4 6 –0.250 (0.252) 0.223 –0.2637 / 0.7115
GFA 4 4 8 –0.251 (0.280) 0.224 –0.6478 / 1.0946
Duration (years) 11 11 20 0.180 (0.258) – –0.6380 / 0.9976
Note. k = number of studies. NS = number of samples. NES = number of effects. b =
unstandardized regression slope coefficient (moderator effect). SE = standard error. g = Hedges’
g (average pooled effect). CI = confidence interval (low estimate / high estimate). FA =
formative assessment. CEA = curriculum-embedded assessment. TMFA = technology-mediated
formative assessment. GFA = general formative assessment. * p < 0.05.
56
Figures
Figure 1
PRISMA Chart
Meta-analyses identified from Visible
Learning (n = 14)
Reports analyzed within the metaanalyses (k = 510)
Reports removed before screening:
Reports known not to be in United States (k =107)
Reports sought for retrieval (k = 403) Reports unable to be retrieved (k = 27)
Duplicate reports (k = 7)
Reports assessed for eligibility (k = 369)
Reports excluded:
Outside United States (k = 103)
Race/ethnicity not reported (k = 178)
Black/Hispanic population < 40% (k = 67)
Study retracted (k = 1)
Identification Screening Included
Reports included in analysis (k = 12)
Reports coded (k = 20) Insufficient information to calculate effect size (k = 4)
Outcome measure not achievement (k = 4)
57
Appendix A: Studies Included in Meta-Analysis
First
author Year Grade %
Black % Hispanic FA
type
FA
duration
Feedback
source
Subject
domain N g v
Bartsch 2011 College 15.0 29.0 Tech 0.003 Instructor History 52 0.592 0.0803
Baxter 1993 Middle 0 53.3 CEA – Instructor Math 105 0.367 0.0410
Baxter 1993 Middle 0 53.3 CEA – Instructor Math 90 0.806 0.0590
Crowe 2015 College 53.0 25.0 GFA 0.080 Peer Psych. 170 –0.269 0.0238
Lim 2011 College – 75.0 Tech 0.003 Instructor Math 56 0.364 0.0730
Lim 2011 College – 75.0 Tech 0.003 Instructor Math 56 0.409 0.0733
Lynch 2013 High 42.0 2.0 Tech 0.310 Instructor Math 61 –0.213 0.0664
Manuel 2015 High 35.8 5.70 CEA 0.333 Instructor Math 53 0.680 0.0800
Miesels 2003 Elem. 73.1 – CEA 2.000 Instructor Reading 163 1.410 0.0314
Miesels 2003 Elem. 73.1 – CEA 2.000 Instructor Math 168 0.755 0.0256
Phelan 2011 Middle 3.74 65.3 CEA 0.750 Instructor Math 1475 0.070 0.0028
Phelan 2011 Middle 7.9 36.0 CEA 0.750 Instructor Math 2616 0.306 0.0016
Randel 2016 Elem. 3.9 37.3 CEA 2.000 Instructor Math 9596 0.003 0.0004
Seals 2001 High 63.0 6.0 GFA 0.350 Instructor Math 73 0.412 0.0568
58
First
author Year Grade %
Black % Hispanic FA
type
FA
duration
Feedback
source
Subject
domain N g v
Seals 2001 High 63.0 6.0 GFA 0.350 Instructor Math 73 0.326 0.0564
Seals 2001 High 63.0 6.0 GFA 0.350 Instructor Math 73 0.499 0.0573
Steeley 2012 College 1.5 40.4 Tech 0.090 Instructor History 67 0.129 0.0602
Steeley 2012 College 1.5 40.4 Tech 0.150 Instructor History 67 0.178 0.0602
Valle 2015 E/M/H 19.3 45.3 GFA 1.000 Instructor Music 342 0.540 0.0121
Valle 2015 E/M/H 19.3 45.3 GFA 1.000 Instructor Music 342 0.645 0.0123
Valle 2015 E/M/H 19.3 45.3 GFA 1.000 Instructor Music 342 0.509 0.0121
Valle 2015 E/M/H 19.3 45.3 GFA 1.000 Instructor Music 324 –0.103 0.0124
Note. E/M/H = elementary, middle, and high school. FA type = type of formative assessment. CEA = curriculum-embedded
assessment. Tech = technology-mediated formative assessment. GFA = general formative assessment. Duration = the duration of the
intervention in years. N = number of students in sample. g = Hedges’ g (average pooled effect). v = variance.
59
Appendix B: Coding Guide
Code name Code description Code options
Coder information
C-1 Date coded [text entry]
C-2 Coder [text entry]
Meta-analysis characteristics
M-1 Meta-analysis’ first author’s last name [text entry]
M-2 Meta-analysis Google Drive link [text entry]
Report characteristics
R-2 Article Google Drive link [text entry]
R-3 First author's last name [text entry]
R-4 Year [text entry]
R-5 Title [text entry]
R-6 APA reference [text entry]
R-7 Publication type 1. Journal article
2. Book or book chapter
3. Dissertation
4. Master’s thesis
5. Policy report
6. Government report
7. Conference paper
8. Other
–99. Can’t tell
R-8 Data sources 1. Independent study
2. Regional/national data set
3. Other
–99. Can’t tell
R-9 Dataset name [text entry];
–99 Missing/can’t tell/not applicable
R-10 Data collection year indicated 0. No
1. Yes
R-11 Year(s) data collected [text entry];
–99 Missing/can’t tell/not applicable
R-12 On what page(s) did you find the data
source?
[text entry]
R-13 Overlapping datasets [text entry];
–99 No
Setting characteristics
S-1 Study number 0. Single study
1. Study 1
2. Study 2
3. Study 3
60
Code name Code description Code options
etc.
S-2 Location [text entry];
–99 Missing/can’t tell/not applicable
S-3 Region 1. Northeast
2. South
3. Midwest
4. West
5. National
–99. Can’t tell
S-4 On what page(s) did you find the
location?
[text entry];
–99 Missing/can’t tell/not applicable
S-5 School level 1. Preschool
2. Elementary school: K–5
3. Middle school: 6–8
4. High school: 9–12
5. Undergraduate
6. Graduate school
7. Other (specify)
–99. Can’t tell
S-6 Other school level (specify) [text entry];
–99 Missing/can’t tell/not applicable
Participant and sample characteristics
P-1 Sample 0. Overall sample
1. Subgroup
P-2 Subgroup specification [text entry];
–99 Missing/can’t tell/not applicable
P-3 Subgroup overlap 0. No
1. Yes
–99. N/A
P-4 Subgroup overlap explanation [text entry];
–99 Missing/can’t tell/not applicable
P-5 Sample size (at start) [text entry];
–99 Missing/can’t tell/not applicable
P-6 On what page(s) did you find the
sample size?
[text entry];
–99 Missing/can’t tell/not applicable
P-7 Sample characteristics 1. Sample at start
2. Analysis sample
3. Both, but they are the same
4. Both, and they are not the same
5. Neither
–99 Missing/can’t tell/not applicable
P-8 Sample characteristics specification [text entry];
–99 Missing/can’t tell/not applicable
61
Code name Code description Code options
P-9 %White [text entry];
–99 Missing/can’t tell/not applicable
P-10 %Black [text entry];
–99 Missing/can’t tell/not applicable
P-11 %Hispanic [text entry];
–99 Missing/can’t tell/not applicable
P-12 %Asian or Pacific Islander [text entry];
–99 Missing/can’t tell/not applicable
P-13 %Native American or American
Indian
[text entry];
–99 Missing/can’t tell/not applicable
P-14 %Other [text entry];
–99 Missing/can’t tell/not applicable
P-15 On what page(s) did you find the
racial/ethnic distribution?
[text entry];
–99 Missing/can’t tell/not applicable
P-16 Grade level -1. Preschool 8. Grade 8
0. Kindergarten 9. Grade 9
1. Grade 1 10. Grade 10
2. Grade 2 11. Grade 11
3. Grade 3 12. Grade 12
4. Grade 4 13. Undergraduate
5. Grade 5 14. Graduate
6. Grade 6 15. Other (specify)
7. Grade 7 –99. Can’t tell
P-17 Grade level (if other) [text entry];
–99 Missing/can’t tell/not applicable
P-18 On what page(s) did you find the
grade level?
[text entry];
–99 Missing/can’t tell/not applicable
P-19 % Female [text entry];
–99 Missing/can’t tell/not applicable
P-20 On what page(s) did you find the %
female statistic?
[text entry];
–99 Missing/can’t tell/not applicable
P-21 % Low income / economically
disadvantaged
[text entry];
–99 Missing/can’t tell/not applicable
P-22 On what page(s) did you find the %
low income statistic?
[text entry];
–99 Missing/can’t tell/not applicable
P-23 % Special education [text entry];
–99 Missing/can’t tell/not applicable
P-24 On what page(s) did you find the %
Special education statistic?
[text entry];
–99 Missing/can’t tell/not applicable
P-25 % English learners [text entry];
–99 Missing/can’t tell/not applicable
P-26 On what page(s) did you find the %
English learner statistic?
[text entry];
–99 Missing/can’t tell/not applicable
62
Code name Code description Code options
Predictor/influence
I-1 Report's name for influence [text entry]
I-2 Influence definition [text entry];
–99 Missing/can’t tell/not applicable
I-3 On what page(s) did you find the
influence definition?
[text entry];
–99 Missing/can’t tell/not applicable
I-4 How is the influence measured? [text entry];
–99 Missing/can’t tell/not applicable
I-5 On what page(s) did you find the
description of how the influence
was measured?
[text entry];
–99 Missing/can’t tell/not applicable
I-6 Reliability 0. No
1. Yes
–99. Unsure, not applicable
I-7 Alpha coefficient (reliability) [text entry];
–99 Missing/can’t tell/not applicable
I-8 Alpha coefficient from what source? 1. Data from this coded study
2. Data from the study for which the survey
was derived
–99. Unsure, not applicable
I-9 On what page did you find the alpha
coefficient?
[text entry];
–99 Missing/can’t tell/not applicable
I-10 How was the influence manipulated
by the researcher?
[text entry];
–99 Missing/can’t tell/not applicable
I-11 On what page(s) did you find the
description of how the researcher
manipulated the influence?
[text entry];
–99 Missing/can’t tell/not applicable
I-12 What type of formative assessment
intervention was used?
1. Professional development
2. Curriculum-embedded assessment
3. Technology-mediated formative
assessment
4. Other
–99. Unsure
I-13 Other type used? [text entry]; –99 not applicable
I-14 What was the duration of the
intervention in years?
[text entry]; –99 not applicable
I-15 Who was the primary feedback
source?
1. Instructor
2. Peers
3. Self
4. Computer-generated
5. Other
Outcome measures
63
Code name Code description Code options
O-1 Outcome type 1. Standardized test (e.g., NAEP, state
standardized assessment, WoodcockJohnson test)
2. Grades (e.g., course, GPA)
3. Knowledge diagnostic test developed by
the researcher/instructor
4. Local assessment (e.g., local school
district)
5. Other achievement
O-2 Outcome name [text entry];
–99 missing/can’t tell/not applicable
O-3 Outcome description [text entry];
–99 missing/can’t tell/not applicable
O-4 On what page(s) did you find the
description of the outcome?
[text entry];
–99 missing/can’t tell/not applicable
O-5 Domain of outcome 1. Mathematics
2. English language arts
3. Science
4. Social science
5. General academics
6. Other (specify)
O-6 Domain of outcome (specified) [text entry];
–99 missing/can’t tell/not applicable
O-7 What is the unit of analysis? 1. Student
2. Teacher
3. Classroom
4. School
5. Other (specify)
–99. Unsure/not applicable
O-8 Other unit of analysis [text entry];
–99 missing/can’t tell/not applicable
O-9 Timing of influence & outcome
measure collection
1. Simultaneously
2. Longitudinally
–99. Unsure
O-10 Specify timing [text entry];
–99 missing/can’t tell/not applicable
O-11 On what page(s) did you find the
timing of data collection described?
[text entry];
–99 missing/can’t tell/not applicable
Research design and effect sizes
E-1 Sample size (for relationship/effect) [text entry];
–99 missing/can’t tell/not applicable
E-2 On what page(s) did you find the
sample size?
[text entry];
–99 missing/can’t tell/not applicable
64
Code name Code description Code options
E-3 Direction of relationship between
influence and outcome
0. Null/no relationship
1. Positive
2. Negative
3. Mixed
–99. Unclear
E-4 Evidence of direction 1. Sign of correlation coefficient
2. Comparing means or rate of success
3. Indication in text
–99. Can’t tell/unclear
E-5 On what page(s) did you find the
direction of the relationship?
[text entry];
–99 missing/can’t tell/not applicable
E-6 In what table did you find the
direction of the relationship?
[text entry];
–99 missing/can’t tell/not applicable
E-7 Type of research design 1. Descriptive study
2. Correlational study
3. One-group/single-group preexperimental design
4. Quasi-experiment
5. Rct/true experiment (2+ groups)
–99. Can’t tell
E-8 Is there a treatment group and a
control group? 0. No
1. Yes
–99. Unclear
E-9 Is there random assignment to
treatment and control groups? 0. No
1. Yes
–99. N/a
E-10 On what page did the researchers
specify random assignment?
[text entry];
–99 Missing/can’t tell/not applicable
E-11 Level of assignment 1. Student
2. Teacher
3. Classroom
4. School
5. Other (specify)
–99. Unsure/not applicable
E-12 Other level of assignment [text entry];
–99 missing/can’t tell/not applicable
E-13 Is there matching of treatment units to
comparison units? 0. No
1. Yes
–99. N/a
E-14 Matching characteristics [text entry];
–99 missing/can’t tell/not applicable
65
Code name Code description Code options
E-15 On what page(s) did the researchers
indicate matching and matching
characteristics?
[text entry];
–99 missing/can’t tell/not applicable
E-16 Did the researchers report priorinfluence or pre-test statistics?
0. No
1. Yes
–99. Can’t tell
E-17 On what page(s) did the researchers
report pre-test statistics?
[text entry];
–99 missing/can’t tell/not applicable
E-18 In what table did the researchers
report pre-test statistics?
[text entry];
–99 missing/can’t tell/not applicable
E-19 Regression
0. No
1. Yes
–99. Can’t tell
E-20 On what page did the researchers
specify using regression?
[text entry];
–99 missing/can’t tell/not applicable
E-21 Multi-level/hierarchical modeling
0. No
1. Yes
–99. Can’t tell
E-22 On what page did the researchers
specify multi-level modeling?
[text entry];
–99 missing/can’t tell/not applicable
Experimental studies
EE-1 What is Nₜ? [text entry];
–99 Missing/can’t tell/not applicable
EE-2 On what page did you find Nₜ? [text entry];
–99 Missing/can’t tell/not applicable
EE-3 In what table did you find Nₜ? [text entry];
–99 Missing/can’t tell/not applicable
EE-4 What is Nc? [text entry];
–99 Missing/can’t tell/not applicable
EE-5 On what page did you find Nc? [text entry];
–99 Missing/can’t tell/not applicable
EE-6 In what table did you find Nc? [text entry];
–99 Missing/can’t tell/not applicable
EE-7 What is Mₜ? [text entry];
–99 Missing/can’t tell/not applicable
EE-8 On what page did you find Mₜ? [text entry];
–99 Missing/can’t tell/not applicable
EE-9 In what table did you find Mₜ? [text entry];
–99 Missing/can’t tell/not applicable
EE-10 What is SDₜ? [text entry];
–99 Missing/can’t tell/not applicable
66
Code name Code description Code options
EE-11 On what page did you find SDₜ? [text entry];
–99 Missing/can’t tell/not applicable
EE-12 In what table did you find SDₜ? [text entry];
–99 Missing/can’t tell/not applicable
EE-13 What is Mc? [text entry];
–99 Missing/can’t tell/not applicable
EE-14 On what page did you find Mc? [text entry];
–99 Missing/can’t tell/not applicable
EE-15 In what table did you find Mc? [text entry];
–99 Missing/can’t tell/not applicable
EE-16 What is SDc? [text entry];
–99 Missing/can’t tell/not applicable
EE-17 On what page did you find SDc? [text entry];
–99 Missing/can’t tell/not applicable
EE-18 In what table did you find SDc? [text entry];
–99 Missing/can’t tell/not applicable
EE-19 What is the effect size (d)? [text entry];
–99 Missing/can’t tell/not applicable
EE-20 What is the variance (v)? [text entry];
–99 Missing/can’t tell/not applicable
EE-21 Screenshot of effect size calculation [Image, –99 not applicable]
EE-22 What is SEₜ? [text entry];
–99 Missing/can’t tell/not applicable
EE-23 On what page did you find SEₜ? [text entry];
–99 Missing/can’t tell/not applicable
EE-24 In what table did you find SEₜ? [text entry];
–99 Missing/can’t tell/not applicable
EE-25 What is SEc? [text entry];
–99 Missing/can’t tell/not applicable
EE-26 On what page did you find SEc? [text entry];
–99 Missing/can’t tell/not applicable
EE-27 In what table did you find SEc? [text entry];
–99 Missing/can’t tell/not applicable
EE-28 What is the effect size (d)? [text entry];
–99 Missing/can’t tell/not applicable
EE-29 What is the variance (v)? [text entry];
–99 Missing/can’t tell/not applicable
EE-30 Screenshot of effect size calculation [Image];
–99 not applicable
EE-31 What is the t-statistic? [text entry];
–99 Missing/can’t tell/not applicable
EE-32 On what page did you find the tstatistic?
[text entry];
–99 Missing/can’t tell/not applicable
67
Code name Code description Code options
EE-33 In what table did you find the tstatistic?
[text entry];
–99 Missing/can’t tell/not applicable
EE-34 What is the effect size (d)? [text entry];
–99 Missing/can’t tell/not applicable
EE-35 What is variance (v)? [text entry];
–99 Missing/can’t tell/not applicable
EE-36 Screenshot of effect size calculation [Image];
–99 not applicable
EE-37 What is the p-value of the t-test? [text entry];
–99 Missing/can’t tell/not applicable
EE-38 On what page did you find the pvalue?
[text entry];
–99 Missing/can’t tell/not applicable
EE-39 In what table did you find the p-value? [text entry];
–99 Missing/can’t tell/not applicable
EE-40 What is the effect size (d)? [text entry];
–99 Missing/can’t tell/not applicable
EE-41 What is variance (v)? [text entry];
–99 Missing/can’t tell/not applicable
EE-42 Screenshot of effect size calculation [Image];
–99 not applicable
EE-43 How many groups are compared in the
F-test?
[text entry];
–99 Missing/can’t tell/not applicable
EE-44 What is the F-statistic of the F-test? [text entry];
–99 Missing/can’t tell/not applicable
EE-45 On what page did you find the Fstatistic?
[text entry];
–99 Missing/can’t tell/not applicable
EE-46 In what table did you find the Fstatistic?
[text entry];
–99 Missing/can’t tell/not applicable
EE-47 What is the effect size (d)? [text entry];
–99 Missing/can’t tell/not applicable
EE-48 What is variance (v)? [text entry];
–99 Missing/can’t tell/not applicable
EE-49 Screenshot of effect size calculation [Image];
–99 not applicable
EE-50 Frequency of yes/favorable outcome
for treatment group
[text entry];
–99 Missing/can’t tell/not applicable
EE-51 Frequency of no/unfavorable outcome
for treatment group
[text entry];
–99 missing/can’t tell/not applicable
EE-52 Frequency of yes/favorable outcome
for control group
[text entry];
–99 missing/can’t tell/not applicable
EE-53 Frequency of no/unfavorable outcome
for control group
[text entry];
–99 missing/can’t tell/not applicable
68
Code name Code description Code options
EE-54 On what page did you find the
contingency table/data for the
contingency table?
[text entry];
–99 missing/can’t tell/not applicable
EE-55 In what table did you find the
contingency table/data for the
contingency table?
[text entry];
–99 missing/can’t tell/not applicable
EE-56 What is the effect size (d)? [text entry];
–99 missing/can’t tell/not applicable
EE-57 What is the variance (v)? [text entry];
–99 missing/can’t tell/not applicable
EE-58 Screenshot of effect size calculation [image];
–99 not applicable
EE-59 d-index calculated?
0. No
1. Yes
–99. not applicable
EE-60 Effect size from original meta-analysis[text entry];
–99 missing/can’t tell/not applicable
EE-61 On what page did you find the effect
size from the original metaanalysis?
[text entry];
–99 missing/can’t tell/not applicable
EE-62 In what table did you find the effect
size from the original metaanalysis?
[text entry];
–99 missing/can’t tell/not applicable
Abstract (if available)
Abstract
Formative assessment encompasses a variety of educational practices geared towards improving both learning and instruction. Formative assessment has been explored through John Hattie’s Visible Learning project, which found a moderately positive effect (g = 0.4) on achievement. Most educational research on the topic, however, is conducted on populations that are disproportionately White. A meta-analysis was conducted on formative assessment interventions specifically among studies originally included within Hattie’s analysis with a sample which was at minimum cumulatively 40% Black and Hispanic and conducted within the United States. Four hundred and three reports were screened for eligibility, and only 20 met the population/setting criteria, eight of which did not include sufficient information to calculate an effect size. A meta-analysis of 12 reports containing 22 effects revealed a statistically significant average effect on achievement (g = 0.332). The results also indicated a very large degree of heterogeneity, which was expected given the broad nature of formative assessment. Publication status, percent Black, percent Hispanic, school level, subject domain, type of formative assessment intervention, and intervention duration were evaluated as potential moderators. However, no factor significantly moderated the effect. Overall, the meta-analysis highlights the dearth of studies conducted with diverse populations on the topic of formative assessment and provides a glimpse into how the practice of formative assessment can influence achievement among populations with significant amounts of students of color.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning together: a meta-analysis of the effect of cooperative learning on achievement among Black and Latinx students
PDF
Bridging gaps, building futures: a meta-analysis of collaborative learning and achievement for Black and Latinx students
PDF
Unveiling the visible impact: a meta-analysis on inquiry-based teaching and the effects on Black and Latinx student achievement
PDF
The effects of technology on the mathematical achievement of Black and Latinx students
PDF
The relationship between working memory and achievement among Black and Latinx students
PDF
The effects of inclusive education on academic achievement for special education and general education students of color: a meta-analysis
PDF
Effects of teacher autonomy support with structure on marginalized urban student math achievement
PDF
A meta-analysis exploring the relationships between racial identity, ethnic identity, and Black students' positive self-perceptions in school
PDF
Pooling historical information while addressing uncertainty and bias for power analysis: a Bayesian approach for designing single-level and multilevel studies
PDF
A meta-analysis of interventions to modify stereotypes about African Americans
PDF
Assessing the impact of diversity courses on students’ values, attitudes and beliefs
PDF
Examining the impact of LETRS professional learning on student literacy outcomes: a quantitative analysis
PDF
Supporting school leaders implementing a global education model in a national network of public schools: a promising practice
PDF
Examining perspectives of academic autonomy in community college students: a quantitative study
PDF
Curriculum and assessment alignment, instructional practices, and the impact on Hispanic/Latino students advanced placement exam achievement
PDF
Improving student achievement at a restructured high school academy of health sciences using an innovation gap analysis approach
PDF
The use and perceptions of experimental analysis in assessing problem behavior of public education students: the interview - informed synthesized contingency analysis (IISCA)
PDF
Enumerating Black identity in higher education
PDF
Beyond commitments: a qualitative examination of the persistent disparities faced by Black women in executive leadership roles post the 2020 crisis and beyond
PDF
Academic achievement among Hmong students in California: a quantitative and comparative analysis
Asset Metadata
Creator
Gilewski, Alex
(author)
Core Title
A meta-analysis of formative assessment on academic achievement among Black and Hispanic students
School
Rossier School of Education
Degree
Doctor of Education
Degree Program
Educational Leadership
Degree Conferral Date
2024-05
Publication Date
06/10/2024
Defense Date
05/29/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
formative assessment,meta-analysis,OAI-PMH Harvest,racially diverse,Visible Learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kho, Adam (
committee chair
), Patall, Erika (
committee chair
), Tobey, Patricia (
committee member
)
Creator Email
agilewsk@usc.edu,gilewski@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113992544
Unique identifier
UC113992544
Identifier
etd-GilewskiAl-13074.pdf (filename)
Legacy Identifier
etd-GilewskiAl-13074
Document Type
Dissertation
Format
theses (aat)
Rights
Gilewski, Alex
Internet Media Type
application/pdf
Type
texts
Source
20240610-usctheses-batch-1166
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
formative assessment
meta-analysis
racially diverse
Visible Learning