Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Pooling historical information while addressing uncertainty and bias for power analysis: a Bayesian approach for designing single-level and multilevel studies
(USC Thesis Other)
Pooling historical information while addressing uncertainty and bias for power analysis: a Bayesian approach for designing single-level and multilevel studies
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
POOLING HISTORICAL INFORMATION WHILE ADDRESSING UNCERTAINTY AND BIAS FOR POWER ANALYSIS: A BAYESIAN APPROACH FOR DESIGNING SINGLE-LEVEL AND MULTILEVEL STUDIES by Winnie Wing-Yee Tse A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the DOCTOR OF PHILOSOPHY (PSYCHOLOGY) August 2024 Copyright 2024 Winnie Wing-Yee Tse ACKNOWLEDGMENTS I am deeply grateful to my committee members, Dr. Hok Chio (Mark) Lai, Dr. Richard John, Dr. Christopher Beam, Dr. Erika Patall, and Dr. Samantha Anderson, for their exceptional guidance and support throughout the journey of completing this dissertation. In particular, I want to express my heartfelt appreciation to Dr. Lai for his unparalleled mentorship, which not only nurtured my academic growth but also inspired critical thinking. These five years have been a journey filled with growth and challenges, and I am truly thankful to Dr. Lai for all his invaluable guidance and inspiration. I would also like to extend my gratitude to my undergraduate mentors, Dr. Victoria Savalei, Dr. Xijuan Zhang, and Dr. Jeremy Biesanz, who have played key roles at the start of my academic journey. I am thankful to Dr. Savalei for providing me with opportunities to engage in fascinating projects; Dr. Zhang for her generous guidance and support in paving my early academic path; and Dr. Biesanz for inspiring these dissertation projects. I am immensely grateful to my labmates, Yichi Zhang, Meltem Ozcan, and Jimmy Zhang, for their unwavering companionship and support during the completion of this dissertation. I would like to express special thanks to Yichi for providing both intellectual and emotional support and walking through the ups and downs throughout this journey with me. I would also like to acknowledge the valuable contributions of Meltem for her time and assistance in developing the R package, and the collective efforts of my labmates in naming the R package, BACpowr. I would like to acknowledge the support of the Doctoral Fellowship by the Social Sciences and Humanities Research Council for this dissertation. Lastly, I would like to express my deepest gratitude to my partner who has been by my side throughout this journey with me, and to my friends who have anonymously offered their support to my academic endeavors. ii TABLE OF CONTENTS Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chatper I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Uncertainty in Historical Information . . . . . . . . . . . . . . 1 1.2 Bias in Historical Information . . . . . . . . . . . . . . . . . . 3 1.3 Pooling Historical Information . . . . . . . . . . . . . . . . . . 4 1.4 Purposes of the Current Dissertation . . . . . . . . . . . . . . . 5 Chatper II A BAYESIAN PROCEDURE FOR CLASSICAL POWER ANALYSIS 7 2.1 A Review of Power Analysis Practices . . . . . . . . . . . . . 7 2.2 Bayesian Procedure for Power Analysis . . . . . . . . . . . . . 9 2.2.1 Working Examples . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Input Parameter Distribution . . . . . . . . . . . . . . 12 2.2.3 Power Analysis Goals . . . . . . . . . . . . . . . . . . 16 2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chatper III ADDRESSING PUBLICATION BIAS . . . . . . . . . . . . . . . . 20 3.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . 22 3.2 Publication Bias and Uncertainty Adjusted Bayesian Method (PUB) 26 3.2.1 Likelihood Function . . . . . . . . . . . . . . . . . . . 27 3.2.2 Posterior Distribution . . . . . . . . . . . . . . . . . . 29 3.2.3 Power Analysis . . . . . . . . . . . . . . . . . . . . . 30 3.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Design Factors . . . . . . . . . . . . . . . . . . . . . . 31 3.3.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.1 The Effect of Publication Bias . . . . . . . . . . . . . . 44 3.4.2 The Effect of Uncertainty . . . . . . . . . . . . . . . . 44 3.4.3 Suggested Group Size and P-Value . . . . . . . . . . . 46 3.4.4 The Use of Stronger Priors . . . . . . . . . . . . . . . 47 3.5 General Recommendations . . . . . . . . . . . . . . . . . . . . 47 3.6 Limitations and Future Directions . . . . . . . . . . . . . . . . 48 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chatper IV SYNTHESIZING INTRACLASS CORRELATION ESTIMATES . . 51 iii 4.1 Random-Effects Meta-Analysis (RMA) . . . . . . . . . . . . . 53 4.2 Bayesian Power Analysis Procedure . . . . . . . . . . . . . . . 54 4.2.1 Illustration of the Bayesian Power Analysis Approaches 55 4.3 Power Analysis for Multilevel Studies . . . . . . . . . . . . . . 56 4.4 Meta-Analytic Methods for Pooling ICC Estimates . . . . . . . 58 4.4.1 Standard Random Effects Meta-Analysis (SRMA) . . . 58 4.4.2 Robust Variance Estimation (RVE) . . . . . . . . . . . 59 4.4.3 Bayesian Random Effects Meta-Analysis (BRMA) . . . 59 4.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5.1 Data Generation . . . . . . . . . . . . . . . . . . . . . 67 4.5.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . 69 4.5.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . 71 4.5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.6 Illustrative Examples of Meta-Analyzing ICC . . . . . . . . . . 85 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.8 Limitations and Future Directions . . . . . . . . . . . . . . . . 90 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Chatper V EXTENSION TO MULTILEVEL STUDIES . . . . . . . . . . . . . 92 5.1 Model Equations . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.1.1 Continuous Level-2 Predictor . . . . . . . . . . . . . . 93 5.1.2 Continuous Level-1 Predictor With a Fixed Effect . . . 95 5.1.3 Continuous Level-1 Predictor With a Random Slope . . 96 5.2 Bayesian Procedure for Power Analysis . . . . . . . . . . . . . 97 5.2.1 Input Parameter Distributions . . . . . . . . . . . . . . 98 5.2.2 Mean Power and Assurance . . . . . . . . . . . . . . . 103 5.3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3.1 Design of Study 1 . . . . . . . . . . . . . . . . . . . . 104 5.3.2 Design of Study 2 . . . . . . . . . . . . . . . . . . . . 105 5.3.3 Design of Study 3 . . . . . . . . . . . . . . . . . . . . 105 5.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.1 Limitations and Future Directions . . . . . . . . . . . . 118 Chatper VI CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Appendix A: Illustration of the Bayesian Power Analysis Procedure . 130 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Appendix B: An Illustrative Example of Meta-Analyzing ICC . . . . 133 iv LIST OF TABLES Table No Description Page 1 Degrees of Freedom and Definition of n˜ for Different t-tests. . . . . . . . . 27 2 Summary Statistics of the Estimated Effect Sizes. . . . . . . . . . . . . . . 35 3 Mean Power of the Traditional and Bayesian Methods. . . . . . . . . . . . 36 4 Actual Assurance (%) of the Traditional Method, and the Bayesian and TM Methods with an Intended 80% Assurance. . . . . . . . . . . . . . . . . . 37 5 Actual Assurance (%) of the Traditional Method, and the Bayesian and TM Methods with an Intended 95% Assurance. . . . . . . . . . . . . . . . . . 38 6 Number of Times a Sample Size Planning Goal was Used With the TM Method Across 5,000 Iterations . . . . . . . . . . . . . . . . . . . . . . . . 42 1 Degrees of Freedom and Nocentrality Parameters for Two-Level Models. . 98 2 Mean Estimates and Empirical Standard Errors of Study 1. . . . . . . . . . 107 3 Mean Estimates and Empirical Standard Errors of Study 2. . . . . . . . . . 108 4 Mean Estimates and Empirical Standard Errors of Study 3. . . . . . . . . . 109 5 Mean Estimates and Empirical Standard Errors of Study 3 (cont.). . . . . . 110 v LIST OF FIGURES Figure No Description Page 1 Likelihood Functions, Prior Distributions, and Posterior Distributions of Examples 1-3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Likelihood Function, Prior Distribution, and Posterior Distribution of Example 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Relationship Between Group Size and Power, Mean Power, and Assurance. 18 1 Sampling Distributions of Effect Size. . . . . . . . . . . . . . . . . . . . . 23 2 Likelihood functions With and Without Publication Bias. . . . . . . . . . . 24 3 Prior and Posterior Distributions. . . . . . . . . . . . . . . . . . . . . . . . 25 4 Distributions of Suggested Group Sizes Across Methods. . . . . . . . . . . 39 5 Relationship Between Suggested Group Sizes and p-values. . . . . . . . . . 43 6 Distributions of Suggested Group Sizes by Power Analysis Goals of the TM Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 1 Sampling Distributions of ICC Estimated With Restricted Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2 Sampling Distributions of ICC Estimated With Penalized Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3 Data Generation Process of the Present Simulation Study . . . . . . . . . . 67 4 Between-Study Distributions of True ICCs Across Conditions . . . . . . . 68 5 Bias in Estimating the Overall ICC . . . . . . . . . . . . . . . . . . . . . . 73 6 RMSE in Estimating the Overall ICC . . . . . . . . . . . . . . . . . . . . 74 7 Bias in Estimating the Between-Study Variance . . . . . . . . . . . . . . . 76 8 RMSE in Estimating the Between-Study Variance . . . . . . . . . . . . . . 77 9 95% CI Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 vi 10 Average 95% CI Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 11 95% PI Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 12 Average 95% PI Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 13 ICC Distributions Obtained From SRMA and BRMA . . . . . . . . . . . . 87 1 Sampling Distributions of Standardized Fixed Effect Coefficient. . . . . . . 100 2 Sampling Distributions of Effect Heterogeneity. . . . . . . . . . . . . . . . 102 3 Mean Power of the Conventional Method and the Bayesian Method That Aims at .80 Power in Study 1. . . . . . . . . . . . . . . . . . . . . . . . . 111 4 Mean Power of the Conventional Method and the Bayesian Method That Aims at .80 Power in Study 2. . . . . . . . . . . . . . . . . . . . . . . . . 112 5 Mean Power of the Conventional Method and the Bayesian Method That Aims at .80 Power in Study 3. . . . . . . . . . . . . . . . . . . . . . . . . 113 6 Assurance of the Conventional Method and the Bayesian Method That Aims at 80% Assurance in Study 1. . . . . . . . . . . . . . . . . . . . . . . . . 114 7 Assurance of the Conventional Method and the Bayesian Method That Aims at 80% Assurance in Study 2. . . . . . . . . . . . . . . . . . . . . . . . . 115 8 Assurance of the Conventional Method and the Bayesian Method That Aims at 80% Assurance in Study 3. . . . . . . . . . . . . . . . . . . . . . . . . 116 1 Bayesian Procedure for Classical Power Analysis . . . . . . . . . . . . . . 121 vii ABSTRACT Power analysis requires a priori knowledge about the true values of input parameters, such as an effect size. Ignoring uncertainty in the parameters may result in reduced statistical power in subsequent studies; thus, previous studies have developed a Bayesian procedure to account for uncertainty in classical power analysis. This procedure enables researchers to integrate prior beliefs and historical information from various sources into probability distributions. By utilizing these distributions, researchers can make informed decisions on sample sizes that appropriately address uncertainty. This dissertation introduces three advancements in the Bayesian procedure for classical power analysis. The existing Bayesian procedure has assumed that all historical findings are published, regardless of their statistical significance. Given the prevalence of publication bias in the behavioral and psychological sciences literature, I developed a Bayesian method that addresses both publication bias and uncertainty. In addition to effect size, power analysis for multilevel studies involves more parameters, such as the intraclass correlation (ICC), which measures the strength of association among units within a cluster. While the Bayesian procedure allows the integration of synthesized results to construct the probability distribution, the focus has been on effect size estimates. Therefore, I extended the Bayesian procedure to meta-analyze ICC estimates and utilize the resulting probability distribution for power analysis. Whereas there is growing popularity in multilevel studies, the existing Bayesian procedure has been developed for single-level designs and a few multilevel designs with binary predictors (e.g., cluster randomized trials). To bridge this gap, I expanded the Bayesian procedure for power analysis of three two-level designs with continuous predictors. These advancements offer researchers methodologies to address uncertainty and bias in designing various types of studies. viii CHAPTER I INTRODUCTION Sample size planning is a fundamental consideration in designing a new scientific and experimental study. To yield meaningful results, the sample size must be sufficiently large to ensure a high probability of detecting a true effect if it exists—a concept known as statistical power. In light of concerns about replicability in research (e.g., Open Science Collaboration, 2015), it has become increasingly important to justify sample size before data collection, such as in preregistration (Moore, 2016; Van ’T Veer and Giner-Sorolla, 2016; Simmons et al., 2021) and registered reports (Chambers and Tzavella, 2021). Insufficient sample sizes can lead to missed import effects due to low statistical power or falsely detecting non-existent effects, impacting the credibility of research outcomes (Simmons et al., 2011). Therefore, determining an adequately powered sample size is a crucial aspect of study planning. 1.1 Uncertainty in Historical Information Power analysis requires a priori knowledge about the true values of design parameters, such as the effect size (i.e., the strength of an effect). As researchers often have limited information about the true values, the conventional practice is to replace these values with the best educated estimates, which entail uncertainty. However, past research has shown that using a single best guess and ignoring uncertainty could result in an underestimated sample size (Anderson et al., 2017; Du and Wang, 2016; McShane and Bockenholt, ¨ 2016; Tse and Lai, 2023). For example, suppose that a historical study reported that the standardized mean difference estimate was 0.31 with a 95% confidence interval of [0.01, 0.61]. McShane and Bockenholt ( ¨ 2016) found that 214 participants are required to achieve .80 power on average over the uncertainty in the effect size estimate for a one-sided independent sample t-test. Ignoring the uncertainty and using 0.31 as the best guess, however, one would believe that the sample size requirement to achieve .80 power is 130, with which the power of a new study is only .71 on average over the uncertainty in the effect size estimate. An approach that addresses uncertainty is to conduct classical power analysis in the Bayesian 1 framework. Due to its hybridity, this approach has also been known as a hybrid classicalBayesian (Spiegelhalter et al., 2004) and Bayesian-classical hybrid (Pek and Park, 2019) approach. In essence, this approach utilizes historical information to update one’s beliefs about the true value of a parameter, described with a probability distribution. Based on historical information, one may believe that certain values are more probable to be the population effect size to a degree of uncertainty. Instead of a single best guess, the distribution of the population effect size, which incorporates uncertainty, is then used to perform power analysis (Du and Wang, 2016; Kruschke, 2013; McShane and Bockenholt, ¨ 2016; Pek and Park, 2019; Spiegelhalter and Freedman, 1986). The Bayesian approach has been shown to achieve a higher mean power and a higher probability of achieving the desired level of power than the approach that ignores uncertainty in the parameter estimates (McShane and Bockenholt, ¨ 2016; Tse and Lai, 2023). For single-level designs, previous studies have explored using the Bayesian approach to account for uncertainty in the effect size estimates (e.g., Du and Wang, 2016; McShane and Bockenholt, ¨ 2016) or the true effect size (e.g., Pek and Park, 2019), and related software programs are available (e.g., the R shiny app PCES, McShane and Bockenholt, ¨ 2016; the R package HybridPower, Park and Pek, 2019). With the Bayesian approach, researchers can address different sources of uncertainty, including the sampling variability in an estimate from a historical study (McShane and Bockenholt, ¨ 2016; Pek and Park, 2019), betweenstudy variability in estimates from multiple historical studies (Du and Wang, 2016), and limited knowledge in the true effect size (Pek and Park, 2019). Discussions in the literature have focused on how to determine the distribution of effect size to incorporate historical information and the uncertainty in the information for power analysis. Returning to the previous example, given the information from a historical study, one can utilize the best guess value, 0.31, as the mean of a normal distribution with a standard deviation of 0.15 that reflects the sampling variability. Power analysis with this effect size distribution accounts for uncertainty due to sampling variability in the historical effect size estimate. Other approaches to address various types of uncertainty will be introduced in subsequent discussions. For multilevel designs, the recent development of the Bayesian approach extends to randomized trials (CRTs; Tse and Lai, 2023; Williamson et al., 2023; Wilson, 2023) and multisite randomized trials (MSRTs; Tse and Lai, 2022). These designs are commonly used for 2 evaluating a clinical or educational treatment by randomizing clusters or participants into treatment conditions. To design new CRTs or MSRTs, the Bayesian approach incorporates the prior distributions of not only the treatment effect size but also the intraclass correlation (ICC) and effect size heterogeneity across clusters (Tse and Lai, 2023; Tse and Lai, 2022). Sample size planning for these designs focuses on the main effect of a binary predictor, the treatment, on an outcome variable. However, many other multilevel designs aim to evaluate the effect of a continuous predictor on an outcome, such as the impact of school bullying and student-teacher connectedness on academic performance (Konishi et al., 2010), the correlations in couples’ hormone levels on postpartum relationship quality (Saxbe et al., 2017), and the effect of daily stress exposure and affective responses to daily stressors on the mortality risk (Chiang et al., 2018). A valuable research direction is to expand the Bayesian approach for planning multilevel designs that involve continuous predictors. 1.2 Bias in Historical Information In addition to uncertainty, information from historical studies can be biased if the literature only includes publications with statistically significant findings—an issue known as publication bias (Rothstein et al., 2005), the statistical significance filter (Gelman, 2018; Vasishth et al., 2018), and the file drawer problem (Rosenthal, 1979). With publication bias, the remaining, unfiltered findings are likely to have an overestimated effect size (Anderson et al., 2017; Franco et al., 2014; Gelman and Carlin, 2014). For example, Open Science Collaboration (2015) conducted 100 replication studies and found that the mean effect size of the original studies doubled the magnitude of the mean effect size of the replication studies. Although multiple factors could have contributed to the difference in the mean effect sizes, publication bias might have played a key role in inflating the mean effect size, as the nonsignificant findings were filtered out in the literature (Open Science Collaboration, 2015). As another example, Anderson et al. (2017) found that with 25 participants per group for a two-sample t-test, effect size estimates have to be at least as large as 0.57 to be published in the literature if p < .05 is required. Nonetheless, Anderson et al. (2017) reported that in this case, the population effect size is most likely 0.16, which is a lot lower than the minimum publishable effect size estimate in this example (Hedges, 1984). 3 Despite its strength in addressing uncertainty, the previously mentioned Bayesian approach has required the assumption that no publication bias exists in the historical information. When information is biased, the Bayesian approach may also underestimate the sample size requirement, leading to studies with low power. To mitigate this issue, it is beneficial to extend the Bayesian approach to incorporate methods for addressing publication bias in historical information. 1.3 Pooling Historical Information Du and Wang (2016) proposed a Bayesian procedure to incorporate uncertainty in effect size estimates from multiple studies, including the within- and between-study variabilities, for power analysis. Historical studies that examined a similar outcome may have resulted in variable effect size estimates due to different study designs, population characteristics, and sampling variability. Random-effects meta-analysis is a popular methodology for synthesizing historical results to estimate the overall effect size, as well as the variability in the effect size estimates across studies. The conventional practice is to use the overall effect size as the best guess of the population effect size for power analysis, ignoring the uncertainty in the effect size estimates. The strength of the Bayesian procedure by Du and Wang (2016) is it captures the probability distribution of the population effect size using Bayesian randomeffects meta-analysis. Using the resulting distribution in power analysis allows researchers to address uncertainty due to within- and between-study variabilities. For multilevel designs, Tse and Lai (2023) have discussed the extension of this approach to CRTs to incorporate within- and between-study variances of the effect size and ICC estimates from multiple studies. The general idea is to first obtain the overall parameter estimate and the total variance (i.e., the sum of the within- and between-variances) from a meta-analysis and then construct the probability distributions of the parameters for power analysis (Tse and Lai, 2023). However, there has been limited discussion on the estimation method of the between-study variability of ICCs. A valuable extension of the Du and Wang’s (2016) procedure and Tse and Lai’s (2023) idea is to perform Bayesian random-effects meta-analysis and obtain the probability distribution of ICC for power analysis. When pooling ICC estimates, a concern is that their distribution is typically nonnormal and positively skewed (Bhat and Beretvas, 2022; Hedberg and Hedges, 2014). The standard 4 random-effects meta-analysis model and the Bayesian model in Du and Wang (2016) require the assumption that the errors follow a normal distribution. Modeling ICC values with a normal distribution imposes the assumption that ICC values could potentially be negative, whereas ICC is bounded between 0 and 1. Past research has evaluated random-effects meta-analysis methods in modeling parameters with nonnormality at the between-study level (e.g., Kontopantelis and Reeves, 2012a). However, there has been limited research in examining the random-effects meta-analysis methods in pooling parameters with nonnormally distributed errors at the within-study level. Given that ICC violates the assumption of normally distributed errors, it remains a question of whether the standard random-effects metaanalysis provides accurate estimation and inferences in pooling ICC estimates. It would be useful to explore and assess meta-analysis methods that relax the normality assumption, as well as an extended procedure of Du and Wang (2016) for power analysis that addresses nonnormality. 1.4 Purposes of the Current Dissertation The goal of this dissertation is to develop a Bayesian procedure for classical power analysis for single and multilevel studies. This procedure (a) addresses uncertainty and publication bias in designing single-level studies (Chapter III), (b) meta-analytically pools historical information from multiple studies to determine the probability distribution of parameters, particularly ICC (Chapter IV), and (c) determines adequately powered sample size for multilevel studies (Chapter V). In the next chapter, I will present the findings from a brief review of how researchers justify their sample sizes in preregistration on the Open Science Framework. I summarize the conventional procedure for power analysis based on the review. Building upon the current practices, I then introduce the general Bayesian procedure for power analysis. In Chapter III, I will explore how to address publication bias with the proposed Bayesian approach, named Publication-Bias-and-Uncertainty-Adjusted Bayesian (PUB) method, when designing single-level studies. The PUB method incorporates historical information and the understanding that this information is likely affected by publication bias. With this method, researchers can determine an effect size distribution that accounts for both uncertainty and publication bias for power analysis. This chapter opens with an example illustrating the 5 impact of publication bias on power analysis. Following the example, I delve into the mathematical details of the PUB method, in particular on how to obtain the effect size distribution that addresses publication bias and uncertainty. I then present the simulation study that evaluates power analysis approaches in their ability to address uncertainty and bias. In Chapter IV, I will propose Bayesian random-effects meta-analysis methods to pool historical ICC estimates. The goals of this synthesis are two-fold: to derive the probability distribution of the ICC for power analysis, and to estimate the overall ICC and the variability of the ICC estimates across studies. This chapter will begin with an overview of randomeffects meta-analysis methods, followed by the introduction of the proposed Bayesian metaanalysis methods. Next, I will discuss a simulation study that compares the performance of random-effects meta-analysis methods in pooling ICC estimates. The chapter concludes with an example demonstrating the proposed Bayesian method for pooling ICC estimates and obtaining the ICC probability distribution for power analysis. In Chapter V, I will expand the Bayesian approach to power analysis for two-level studies with continuous predictors. In a two-level model, power analysis requires the input of variance components, such as the cross-cluster random effects and error variances (Snijders, 2005), which are often unintuitive for researchers to define. In this chapter, I reparameterize the formulas in Snijders (2005) in terms of fixed effects, ICC, and effect size heterogeneity, which are common parameters in the power analysis literature for CRTs and MSRTs (e.g., Dong and Maynard, 2013). Researchers may more easily interpret these parameters and determine their values for power analysis based on the existing literature on these parameters. I further derive the formulas for cluster-mean-centered level-1 predictors and provide recommendations on determining the probability distributions of the parameters. I close the chapter with three simulation studies that assessed the performance of the Bayesian approach in designing two-level studies with a continuous predictor. 6 CHAPTER II A BAYESIAN PROCEDURE FOR CLASSICAL POWER ANALYSIS Sample size justification before data collection has received growing emphasis in research practice. It is also a major component of preregistration, a process where researchers document their study plans, such as research designs, hypotheses, and statistical analyses, before initiating the study. One motivation behind this practice is to underscore the importance of determining a sample size with sufficient statistical power in a new study (Open Science Collaboration, 2017; Van Den Akker et al., 2023). To better understand recent practices in power analysis, I conducted a brief review of how researchers justify their sample size selection in the preregistrations on the Open Science Framework. Based on the review, I outline a typical procedure and common practices in power analysis and introduce how the Bayesian procedure builds upon the current practices. The mathematical foundations of this procedure then follow. 2.1 A Review of Power Analysis Practices To understand recent practices in power analysis, I scraped and reviewed the sections ”Sample size” and ”Sample size rationale” in the recent preregistrations on Open Science Framework. Due to the massive amount of data, I narrowed down the scope to the last 500 preregistrations in 2023. All 500 preregistrations were manually annotated, and 219 of them were retained in the review after excluding the records that were (a) duplicated (n = 19), (b) without or unclear sample size justification (n = 103), (c) not written in English (n = 34), and (d) not involved in statistical hypothesis testing or data collection with human subjects (e.g., scoping reviews; n = 110). The scope of the present review is on power analysis, with which researchers determine a sample size that achieves the desired level of power. Among the reviewed preregistrations, 133 justified the sample size selection with a power analysis. Based on the review, the procedure of power analysis can generally be summarized into five steps: 1. Selecting statistical models and tests to be performed in the new study, 7 2. Reviewing historical information to draw insights into the true values of the input parameters, 3. Selecting input parameter values, 4. Identifying the power analysis goal, and 5. Determining the sample size that achieves the goal. In the first step, researchers identify the statistical models and tests that will be performed in the new study. Depending on the statistical tests, power analysis requires different input parameters. For example, in a two-samples t-test, the input parameter is the true effect size— the standardized mean difference in the outcome between the two populations of interest. In a two-level CRT, the input parameters include not only the true cluster-level effect size of the outcome of interest but also the true ICC—a parameter that measures the strength of association among units within a cluster. While the true values of input parameters are often unknown, in Steps 2 and 3, researchers often draw insights from historical studies and their knowledge to select the input parameter values for power analysis. As shown in the review, common practices include adopting the parameter estimates as the best guess values from a pilot study (n = 11), an original study to be replicated (n = 6), a historical study that had a similar outcome (n = 34), multiple historical studies, including meta-analysis (n = 21), and a mix of different sources of information (n = 9). When historical information was unavailable, researchers selected an effect size based on their theoretical understanding (n = 85), an effect size that has been commonly reported in the field (n = 2), or the smallest effect size that is deemed meaningful (n = 2). In several registrations, researchers acknowledged different sources of uncertainty when selecting the input parameter values. One source of uncertainty is the lack of historical studies on the effect size of interest. The input value of the effect size was selected based on theoretical understanding (e.g., Meliksetian et al., 2023; Chhabra et al., 2023). Another source of uncertainty reported was the difference in the study characteristics (e.g., methods) between the historical study and the new study. In such cases, researchers (e.g., Sattelmayer et al., 2023; Sampaio and Boggio, 2023) decided to adopt a smaller effect size estimate than the historical estimate in the power analysis. 8 In Step 4, researchers identify the power analysis goal; that is, the level of power the new study will achieve. As found in the review, common goals are to have .80, .90, and .95 power to detect the true effect size. Finally, Step 5 entails calculating the necessary sample size to meet this goal. The resulting sample size that achieves the selected goal (e.g., .80 power) was reported to be the target sample size for the new study. Accurate power analysis requires that the input parameter values align with the true parameter values. However, choosing these values accurately in Step 3 poses a particular challenge because the parameter values are often unknown. The sources of uncertainty include the differences in the study characteristics and the lack of investigation of the effect of interest in the literature. Although not explicitly discussed in the reviewed preregistrations, the variability in historical parameter estimates due to sampling may introduce further complexity. Therefore, methods that enable researchers to incorporate both available information and associated uncertainty are beneficial for conducting power analysis. 2.2 Bayesian Procedure for Power Analysis The Bayesian procedure builds upon the current practices and involves an approach that integrates information and uncertainty into power analysis. In this procedure, Steps 1 and 2 are the same as the typical practices, where researchers select the statistical methods and gather information from the literature. However, in Step 3, instead of selecting an input parameter value, researchers determine an input parameter distribution, reflecting the available information and uncertainty for power analysis. Steps 4 and 5 parallel the typical procedure, focusing on setting a power analysis goal and calculating the necessary sample size to achieve the goal. Because the input takes the form of a distribution incorporating uncertainty instead of a single value, the power analysis goals differ from the typical practices. Specifically, the considerations include the average probability of detecting a true effect (i.e., mean power) and the probability of achieving the intended level of power (i.e., assurance) across the uncertainty. I will return to the discussion of these goals further towards the end of this chapter. 9 2.2.1 Working Examples The essence of the Bayesian approach is to determine a distribution of the parameter of interest given the available information and uncertainty. In the following discussion, I consider three examples. All these examples aim to perform an independent samples t-test to compare the means between two groups. The goal is to determine the necessary sample size that meets .80 power to detect an effect size (i.e., standardized group mean difference) in a t-test at a .05 significance level. These examples consider situations where researchers have (a) uncertainty in an effect size estimate from a historical study, (b) uncertainty in effect size estimates from multiple historical studies, or (c) a lack of historical information about the effect size of interest. The first example comes from McShane and Bockenholt ( ¨ 2016). Suppose that researchers want to have a sufficient sample size to replicate the findings of Study 2 of Iyengar and Lepper (2000), which investigated the impact of the number of choices and students’ academic performance by comparing the scores of the treatment and control groups. Given the reported estimates in Iyengar and Lepper (2000), the effect size estimate can be calculated as 0.43 and its associated standard error as 0.18 for groups sizes of 52 and 74, respectively, for the two groups.1 The standard error of 0.18 reflects the uncertainty in the effect size estimate due to sampling variability. To incorporate the available information and uncertainty, researchers can set 0.43 as the mean and 0.18 as the uncertainty (i.e., standard deviation) of the effect size distribution. Assuming that the effect size follows a normal distribution, Figure 1c shows the distribution of the effect size as the input parameter distribution for power analysis. The second example is drawn from Du and Wang (2016). Consider that researchers are planning a new study to compare the mental spatial rotation ability between males and females. They reviewed a meta-analysis across 29 studies conducted by Linn (1985). The overall effect size estimate was .72 and the effect size estimates varied from -0.08 to 1.28 across the 29 studies, reflecting a substantial amount of between-study heterogeneity (Du and Wang, 2016; Linn, 1985). Du and Wang (2016) adopted a Bayesian meta-analysis tech1The estimate and uncertainty in McShane and Bockenholt ( ¨ 2016) are in the unstandardized metric. Here, the effect size estimate is calculated as d = Mean difference Pooled standard deviation = 0.4 0.92 = 0.43 and the standard error of the effect size estimate is qn1+n2 n1n2 + d2 2(n1+n2) = q 52+74 (52)(74) + 0.432 2(52+74) = 0.18. 10 Figure 1: Likelihood Functions, Prior Distributions, and Posterior Distributions of Examples 1-3. 11 nique to determine the distribution of the effect size, incorporating the information about the effect size, as well as the uncertainty due to between- and within-study variabilities. As will be discussed in Chapter IV, researchers can alternatively opt for a frequentist meta-analysis technique to obtain the necessary information to construct the effect size distribution. Based on the synthesized information with a frequentist meta-analysis, Figure 1f shows the effect size distribution, which can be used as the input parameter distribution for power analysis. The third example, taken from Park and Pek (2022), focuses on determining the sample size for a new study investigating the differences in short-term memory between conditions in which objects are grouped together or apart. In this example, researchers plan to introduce a novel experimental manipulation whose effect has limited investigation in the literature. Relying on theoretical understanding, they believe that the population effect size is most likely 0.5 with a 50% chance of falling between 0.2 and 0.8. In this scenario, Park and Pek (2022) recommended setting the input distribution as a normal distribution with a mean of 0.5 and a standard deviation of 0.44. Figure 1i shows the input distribution of effect size, which reflects the researchers’ understanding, for conducting power analysis. 2.2.2 Input Parameter Distribution A practical question is how to determine the input parameter distribution. Here I introduce a general framework to conceptualize the construction of the input parameter distribution for power analysis. Within this framework, the input parameter distribution is defined as the posterior distribution, which comprises the likelihood function and the prior distribution. The likelihood function captures information from historical studies, while the prior distribution represents researchers’ beliefs about the parameter. In essence, we synthesize available data and prior knowledge to construct the probability distribution of the input parameter, known as the posterior distribution, for power analysis. In statistical analysis, a common question that arises is how likely we would have observed the historical parameter estimate for a population value. The likelihood function, p(y|θ), denotes the likelihood of observing the data, y, for various population parameter values, θ. Figure 1a presents the likelihood function of Example 1. With the total sample size of 126 (52 + 74), there is inherent uncertainty in the effect size estimate of 0.43 attributable to sampling variability. While the likelihood function indicates that 0.43 is the most likely population 12 effect size, it also suggests that other population effect sizes could have yielded the effect size estimate, although with lower likelihood. Similarly, in Example 2, the likelihood function in Figure 1d indicates that a population effect size of 0.72 is most likely given an overall effect size estimate of 0.72 across 29 studies. This likelihood function accounts for the uncertainty arising from sampling variability within each of the studies, as well as the heterogeneity among the studies. Conversely, in Example 3 where historical data is lacking, the likelihood function appears as a flat line, providing no information in constructing the input parameter distribution. A prior distribution, denoted as p(θ), reflects researchers’ prior belief about the population parameter, θ. Based on the review in 2.1, in the lack of historical information, researchers often turned to theoretical understanding to determine the input parameter value. Within the proposed framework, an informative prior distribution can be specified to reflect researchers’ beliefs about the most probable parameter value with a degree of uncertainty. For instance, in Example 3, researchers may opt to define a normal prior distribution with a mean of 0.5 and a standard deviation of 0.44 to incorporate their belief and uncertainty about the population effect size. The prior distribution depicted in Figure 1h illustrates that the most probable value is 0.5, with a 50% probability that the population effect size lies between 0.2 and 0.8. Contrarily, in Examples 1 and 2, researchers may choose a non-informative prior distribution, as shown in Figure 1b and e, which imposes no prior beliefs in constructing the input parameter distribution. A posterior distribution, denoted as p(θ|y), is the result from the multiplication of a likelihood function, p(y|θ) and a prior distribution π(θ), reflecting the combination of historical information and prior belief. The Bayes’s rule (Bayes and Price, 1763) defines the posterior distribution as follows: p(θ|y) ∝ p(y|θ)π(θ). (2.1) In Examples 1 and 2, as the prior distributions provide no information, the shape of the posterior distribution is the same as the likelihood function. On the other hand, in Example 3, the likelihood function provides no information, whereas the prior does, and thus the posterior distribution takes the same shape as the prior distribution. The resulting posterior distribution can then be used as the input distribution for conducting power. 13 As readers may notice, a simplification of this framework involves directly constructing the input parameter distribution when relying solely on historical information or prior belief (but not both). For instance, in Example 1, assuming that the effect size follows a normal distribution, we can establish the input parameter distribution as a normal distribution with a mean of 0.43 and a standard deviation of 0.18, yielding the posterior distribution in Figure 1c. To note, the approach proposed by McShane and Bockenholt ( ¨ 2016) involves directly utilizing a normal distribution as the input parameter distribution for the unstandardized mean difference. In Example 3, the posterior distribution is the same as the prior distribution as the likelihood function provides no information. In this scenario, setting the input parameter distribution with the normal distribution, as proposed by Park and Pek (2022), will align exactly with the posterior distribution in Figure 1i. The proposed framework is general and flexible for researchers to construct input parameter distributions. While many methods require researchers to estimate the parameter values based on historical data and theoretical insights, this framework allows researchers to differentiate between these information sources and incorporate them into a probability distribution. Since most studies aim to explore unknown aspects in the literature, there is inherent uncertainty around the parameter of interest that necessitates researchers’ judgement even with the presence of historical information. Let us consider the final example of this chapter, which is extracted from the review conducted in 2.1. Sampaio and Boggio (2023) planned a study to compare the difference in perception of contributions on out-group members between two groups, such as racial or gender groups. They referenced a similar study by Campanha et al. ( ˜ 2011), who evaluated the perception of fairness with a dependent samples t-test, and the effect size estimate was 1.16 with a total sample size of 15.2 However, they noted the difference between the methods of Campanha et al. ( ˜ 2011) and their studies, and decided to adopt a more conservative effect size of 0.5 in Cohen’s d metric (Cohen, 1988). Within the proposed framework, we can derive the likelihood function using the historical data of an effect size estimate of 1.16 with a sample size of 15. Given the small sample size, the likelihood function suggests that a broad range of population effect size values between 2The effect size can be derived with the reported t-statistic, t(14) = 4.5, and the total sample size of 15 by Campanha et al. ( ˜ 2011) for a dependent samples t-test. 14 Figure 2: Likelihood Function, Prior Distribution, and Posterior Distribution of Example 4. 15 0 and 2 are likely (Figure 2a). Considering the differences in study designs (e.g., withinsubject vs between-subject) and methods (e.g., dependent vs independent samples t-test), we can choose a relatively conservative and informative normal prior distribution. Here I use a normal prior with a mean of 0.5 and a standard deviation of 0.2 as an example. This prior reflects our prior belief about the effect size for the perception of contributions, with the most probable value of 0.5 and a 95% prior probability that the population effect size falls between 0.11 and 0.89 (Figure 2b). By integrating historical data and prior beliefs, the resulting posterior distribution indicates 0.78 as the most probable effect size value, with a 95% posterior probability that the population effect size lies between 0.48 and 1.08 (Figure 2c). In cases where a more conservative prior is preferred, researchers can consider lowering the mean of the prior. They may also further reduce the standard deviation to indicate a belief that there is a higher probability that the population effect size is about the chosen, conservative mean. 2.2.3 Power Analysis Goals After obtaining the input parameter distribution, the next steps involve setting the power analysis goal and determining the necessary sample size that achieve the intended goal. Null hypothesis significance testing is a method of classical inference to evaluate statistical relationships. For example, an independent t-test is used to test against the null hypothesis of no difference between the population means of two groups, H0 : µ1 − µ2 = 0, where µ1 and µ2 are the population means of each of the two groups. In standardized units, H0 : µ1−µ2 σp = δ = 0, where δ is the population effect size and σp is the population pooled standard deviation. Power is the probability of correctly rejecting H0 if the population means of the two groups are indeed different, i.e., p(Reject H0|δ). In other words, power is the probability of detecting a true effect if it exists. As found in the review of current practices (2.1), researchers commonly select .80, .90, and .95 as the target power level to determine the necessary sample size for a new study. With the Bayesian power analysis procedure, since the input is a distribution rather than a value, power also takes the form of a distribution. The power distribution is often summarized with the mean (e.g., McShane and Bockenholt, ¨ 2016; Spiegelhalter and Freedman, 1986; Liu and Wang, 2019; Pek and Park, 2019) and assurance (e.g., Du and Wang, 2016; 16 Anderson et al., 2017; Pek and Park, 2019). The mean power (Liu and Wang, 2019), as known as the expected power (McShane and Bockenholt, ¨ 2016; Spiegelhalter et al., 2004; Spiegelhalter and Freedman, 1986) and average power (Gillett, 1994), is the weighted average of power. The mean power indicates the average probability of detecting a true effect over the uncertainty in the population parameter. The general equation of mean power is given by E(Power) = Z Θ p(Reject H0|θ, n)p(θ|y)dθ, (2.2) where θ is the input parameter vector for θ ∈ Θ, p(Reject H0|θ, n) is the power function for a given sample size vector n and θ, and p(θ|y) is the posterior probability density function. For an independent samplest-test, the effect size (δ) is the only input parameter, and the mean power can be expressed as E(Power) = R ∞ −∞ p(Reject H0|δ, n1, n2)p(δ|y)dδ, for a specified size of group 1 (n1) and group 2 (n2). For example, if a new study with n1 = n2 = 50 has a mean power of .76, this study will have an average of 76% probability of detecting the effect size of interest adjusting across the specified uncertainty in the effect size. The assurance (Anderson et al., 2017; Liu and Wang, 2019), also known as the assurance level (Du and Wang, 2016; Pek and Park, 2019), denotes the probability of achieving the intended level of power accounting for the uncertainty. The general equation of assurance is given by A(Power) = Z ΘL p(θ|y)dθ, (2.3) where ΘL is the parameter space of θ associated with power values at or higher than the intended level of power, L, i.e., p(Reject H0|θ, n) ≥ L. For instance, for L = .80 and a study design of n1 = n2 = 50, a 60% assurance means that there is a 60% chance of the study to achieve .80 power, accounting for the uncertainty specified in the effect size distribution. Note that as the sample size of the new study increases, so do the power, mean power, and assurance. However, the relationship between sample size and power, mean power, or assurance is non-linear, as shown in Figure 3. Specifically, increasing the sample size has a stronger impact when the initial power, mean power, and assurance are low, but a weaker impact otherwise. Moreover, achieving a higher assurance requires a larger sample size 17 Figure 3: Relationship Between Group Size and Power, Mean Power, and Assurance. compared to achieving a higher mean power. Referring to Example 1, with a group size of 100, a new study will have .75 mean power and 57% assurance of achieving .80 power. To reach a mean power of .80 and an 80% assurance, a group size of 129 and 206 are required, respectively. The selection of target mean power and assurance depends on the researchers’ goals of sample size planning. The first general recommendation is to set the mean power as the intended power level. For example, if the intended power is .80 using the conventional procedure, the mean power can be selected at .80 with the Bayesian procedure, to ensure that the average probability of detecting the true effect is 80% over the specified uncertainty. The second recommendation is to choose an assurance larger than 50% to account for the uncertainty (Anderson et al., 2017) and ensure a larger chance of achieving the desired level of power. 18 A strength of the Bayesian procedure for power analysis is it enables researchers to probabilistically describe the outcomes of power analysis (Du and Wang, 2016). While increasing sample size can be challenging due to budget constraints and other practical concerns, the Bayesian procedure provides insights into the outcomes of a study with a specific sample size in terms of its mean power level and chance of achieving the intended power. 2.3 Conclusion In this chapter, I reviewed the recent power analysis practices and presented the Bayesian procedure for power analysis. The Bayesian procedure parallels the typical procedure with a key difference in using an input parameter distribution rather than a best guess value for power analysis. The input parameter distribution reflects the available information and uncertainty about the input parameter of interest. Power analysis with this distribution addresses uncertainty and allows researchers to understand the outcomes of power analysis probabilistically, such as the average probability of detecting a true effect and the probability of achieving the intended level of power. In Section 2.2.2, the discussion on the likelihood function relies on the assumption that the historical information is free of bias. In Chapter III, I will delve into the issue of publication bias and develop a Bayesian approach to address this bias. In Chapter IV, I will revisit Example 2 of this chapter on the Bayesian meta-analysis technique to determine the effect size distribution and extend this technique to construct the intraclass correlation distribution for multilevel studies. Finally, in Chapter V, I will expand the proposed Bayesian framework to multilevel studies. 19 CHAPTER III ADDRESSING PUBLICATION BIAS Sample size planning often relies on information from historical studies, such as parameter estimates. However, such information from historical studies can be biased if only publications with statistically significant findings are present in the literature. This issue is often known as publication bias (Rothstein et al., 2005) and the statistical significance filter (Gelman, 2018; Vasishth et al., 2018). With publication bias, findings that achieve statistical significance are likely to have an overestimated effect size (Anderson et al., 2017; Franco et al., 2014; Gelman and Carlin, 2014). Inflated effect size estimates in historical studies can have a detrimental effect on sample size planning for new studies. Researchers commonly employ power analysis to ensure a new study has a high probability of detecting a true effect if it exists. Power analysis requires the input of the true effect size, which is unknown and often replaced by an estimate taken at its face value from a historical study (Anderson et al., 2017). When publication bias is present, an overestimated effect size will yield smaller than necessary sample sizes, leading to low power in the new studies. Furthermore, given that data are sampled from populations and naturally vary from one sample to another, effect size estimates are subject to variability. This sampling variability can aggravate the risk of underestimating the sample size requirement (Anderson et al., 2017; Du and Wang, 2016; McShane and Bockenholt, ¨ 2016). To address publication bias and uncertainty, Taylor and Muller (1996) introduced a likelihoodbased method that adjusts for the overestimation in historical information for power analysis. Taylor and Muller’s (TM, 1996) method has been applied to various statistical methods, including t-tests, ANOVA (Anderson et al., 2017), and regression (Anderson, 2021), and implemented in an R package named Bias- and Uncertainty-Corrected Sample Size (BUCSS; Anderson and Kelley, 2020). Anderson and colleagues (2017; 2021) have systematically evaluated the TM method by repeatedly simulating historical studies and planning sample size for a new study with the TM method. They found that the TM method adequately adjusted for publication bias and produced appropriate power in most conditions (Anderson et 20 al., 2017; Anderson and Maxwell, 2017). However, challenges were observed when both the population effect size and the size of the historical study were small, where the TM method could face difficulty in adjusting for the strong bias in the effect size estimates (Anderson et al., 2017). A potential alternative method to address uncertainty and publication bias is to adopt a Bayesian method for classical power analysis. In essence, this method utilizes historical information to update one’s beliefs about the population effect size, described with a probability distribution. Based on historical information, one may believe that certain values are more probable than others to be the population effect size. Instead of a single effect size estimate, the distribution of the population effect size, which incorporates uncertainty, is then used to compute power and determine sample size (Du and Wang, 2016; Kruschke, 2013; McShane and Bockenholt, ¨ 2016; Pek and Park, 2019; Spiegelhalter and Freedman, 1986). Despite its strength in addressing uncertainty, the past development of the Bayesian method relied on the assumption that historical data were free from publication bias. However, building upon the work Taylor and Muller (1996) and Anderson et al. (2017), this assumption can potentially be relaxed to adjust for publication bias. Due to the flexibility in defining the probability distribution, the Bayesian method may also overcome the challenges the TM method faced when the historical study was severely underpowered. In the current study, I develop a Bayesian method, named Publication-Bias-and-UncertaintyAdjusted Bayesian (PUB) method, to address both publication bias and uncertainty. In the process of updating a belief, the PUB method incorporates historical information, as well as the fact that the information entails publication bias. The resulting distribution of the population effect size accounts for both uncertainty and publication bias and is then used to determine sample size. In the following, I will begin with an illustrative example to demonstrate the impact of the publication bias and the overarching picture of the PUB method. Next, I will provide the mathematical details of the proposed method, including the derivation of the probability distribution that addresses uncertainty and publication bias. I will then discuss a simulation evaluating sample size planning methods in their ability to address uncertainty and bias. Finally, I conclude with the strengths and limitations of the proposed method and provide recommendations for sample size planning. 21 3.1 An Illustrative Example In this section, I provide a simulated example to illustrate the impact of publication bias and a general idea of how the PUB method addresses this issue for power analysis. Mathematical details and discussions will then follow. Suppose a huge team of 10,000 researchers aims to compare the anxiety levels of participants who receive treatment with those who do not. While unknown to the researchers, the population effect size—the standardized mean difference in the anxiety levels between the treatment and control groups—is 0.4. To estimate the effect size, each of the researchers recruits and randomizes 40 participants into two groups (i.e., each group has 20 participants) and performs a two-sided independent sample t-test at a significance level of .05. As shown in Figure 1, if all the researchers report their effect size estimates, the mean effect size estimate is 0.4, which is equal to the population effect size. However, if the researchers only report statistically significant findings, the sampling distribution of the effect size becomes a truncated normal distribution, which includes only estimates with an absolute magnitude larger than 0.64. The mean of this sampling distribution becomes 0.84, which overestimates the population effect size more than doubles the population effect size. Imagine a new researcher reviews the filtered study results and takes the mean of the filtered results as the population effect size to determine the sample size for a replication study. With 0.84 as the input value, the sample size formula yields 48, 62, and 76 total participants to achieve .80, .90, and .95 power respectively. Nonetheless, because the true population effect size is 0.4, the actual power of this replication study with a sample size of 48, 62, and 76 is only .27, .34, and .41 respectively. In statistical analysis, the likelihood function depicts the likelihood of observing the data for various effect size values. For example, Figure 2a shows the likelihood function of observing a mean effect size estimate of 0.4 when all historical findings are reported. Without publication bias, we would most likely have observed the mean effect size estimate of 0.4 if the population effect size was 0.4, which aligns with the actual population effect size. With publication bias at the significance level of .05, although the mean effect size estimate is 0.84, the likelihood function that adjusts for the bias indicates that it is more likely to observe this estimate if the population effect size is closer to 0.4 than 0.84, as shown in Figure 2b. The 22 Figure 1: Sampling Distributions of Effect Size. 23 Figure 2: Likelihood functions With and Without Publication Bias. likelihood function is in line with what we know from an omniscient view in this simulated example—with the presence of publication bias, we would most likely have observed an overestimated effect size of 0.84 if the population effect size is a value close to 0.4 instead. To address publication bias, the proposed approach utilizes the likelihood function by Taylor and Muller (1996), as shown in Figure 2b, for power analysis. Although the population effect size is unknown, one can obtain the probability of a particular population effect size given the observed data. In Bayesian statistics, the probability distribution of an effect size given the prior information is known as the posterior distribution. Broadly speaking, the posterior distribution comprises the likelihood function and the prior distribution. One can adopt the 24 Figure 3: Prior and Posterior Distributions. likelihood function in Figure 2b to account for the fact that only statistically significant findings are reported, or the likelihood function in Figure 2a if all results are observed. The prior distribution reflects our prior belief about the population effect size. For example, without any particular knowledge, one may believe all effect size values are equally likely, which is known as a non-informative prior. As shown in Figure 2c, with a non-informative prior, the posterior distribution has the same shape as the likelihood function. In other words, one relies solely on the historical data to construct the probability distribution of the population effect size for power analysis. Whereas traditional power analysis requires an input of the population effect size value, the proposed approach adopts the posterior distribution of the effect size in power analysis. In the posterior distribution, different effect size values are assigned distinct weights (i.e., probability densities), visually indicated by their respective heights in Figure 3e. For example, 25 the effect size of 0.4 has a larger weight (greater height) than the effect size of 0.84. Correspondingly, the power value associated with an effect size of 0.4 is weighted more heavily than the power value associated with an effect size of 0.84. Since the population effect size takes on a distribution, power for a new replication study also has a distribution. For easier interpretation, we can summarize the power distribution with the mean (e.g., McShane and Bockenholt, ¨ 2016; Spiegelhalter and Freedman, 1986) and assurance (e.g., Anderson et al., 2017; Du and Wang, 2016), as discussed in Section 2.2.3. In this example, a new study with a group size of 100 (i.e., a total sample size of 200) has .76 mean power and 64.81% assurance of achieving the intended power (e.g., .80 in this example) across the specified uncertainty in the probability distribution of the effect size. In other words, as we are uncertain about the population effect size, based on the specified probability distribution, there is a .76 average probability of detecting the true effect and a 64.81% chance of achieving the intended .80 power in a new study with a group size of 100. Although we knew the population effect size is 0.4 in this simulated example, in practice, we only observe an overestimate such as 0.84 if publication bias exists. To address this issue, the PUB method utilizes a probability distribution rather than a best guess of effect size, for power analysis. The probability distribution of the effect size (a) incorporates historical information and prior knowledge, (b) accounts for the statistical filtering of historical information, if it exists, and (c) reflects our uncertainty about the population effect size. The resulting power distribution informs us about the average power that a study with a certain design will achieve across the specified probability distribution. Furthermore, we can tell about the probability of a future event, such as the chance that a new study of a particular design will achieve the intended power. In the next section, I provide mathematical details about the PUB method for power analysis. 3.2 Publication Bias and Uncertainty Adjusted Bayesian Method (PUB) Building upon the framework introduced in 2.2, the PUB method allows adjustment for both uncertainty and publication bias. The gist of the proposed approach for power analysis is to use the posterior distribution of the effect size, which encompasses historical information and prior knowledge about the population effect size and publication bias. As discussed, the posterior distribution is composed of the likelihood function and the prior distribution. In 26 Table 1: Degrees of Freedom and Definition of n˜ for Different t-tests. One sample t-test Independent samples t-test Dependent samples t-test ν N − 1 N − 2 = n1 + n2 − 2 N/2 − 1 = n − 1 n N ˜ n1n2 n1+n2 n Note. ν = degrees of freedom. N = the total sample size. n1 = size of group 1. n2 = size of group 2. n = size of the paired samples. this section, I provide the mathematical details of how to construct these three distributions and how the probability distribution of effect size is used for power analysis. While I focus on independent samples t-tests in the following, the method applies to a variety of statistical tests. 3.2.1 Likelihood Function In an independent sample t-test, the effect size is the standardized mean difference between the two groups of comparison. The population effect size is denoted as δ, and an estimate of the effect size is d. As illustrated in the simulated example, when only statistically significant findings are reported in the literature, the observed effect size estimate (d) is likely an overestimate of the population effect size (δ). For a two-sided t-test at a significance level of α, a historical study with a total sample size of N would only achieve statistical significance if its t statistic, ts, has an absolute value larger than the critical t value, tcrit(1−α/2; ν), where ν is the degrees of freedom according to Table1. Correspondingly, to achieve statistical significance, its effect size estimate needs to have an absolute value larger than tcrit(1−α/2; ν)/ √ n˜ (Taylor and Muller, 1996), where n˜ is defined in Table 1. The t-statistic is a function of the effect size, in particular, δ = t qn1+n2 n1n2 for an independent samples t-test with group sizes of n1 and n2. For example, with a group size of n1 = n2 = 100, a two-sided test is statistically significant if its t-statistic is beyond the interval of (-1.97, 1.97) for tcrit(.975; ν = 198) = 1.97, corresponding to an effect size estimate with an absolute value larger than 0.62. If the group size of the historical study is instead n1 = n2 = 20, the effect size estimates are significant if they fall beyond the interval of [-0.62, 0.62], which is wider than the interval in a study with a larger group size. Namely, when historical studies have smaller sample sizes and statistical filtering applies, fewer effect size estimates will fall outside the interval and be reported in the literature. 27 With statistical filtering, the sampling distribution of the effect size estimates is a truncated distribution where estimates falling inside of the interval are unobservable. To account for this issue, Taylor and Muller (1996) proposed the use of a likelihood distribution that follows from the truncated distribution for power analysis (see also Anderson et al., 2017). With truncation at the lower limit a and the upper limit b, the general form of a likelihood function, for any test statistic distribution (e.g., t- and F-distributions), is (Taylor and Muller, 1996) L(λ, ν) = f(x|λ, ν) 1 − F(b|λ, ν) + F(a|λ, ν) , (3.1) for a ≤ x ≤ b and the probability density function f(.) and cumulative density function F(.) of a nontruncated distribution with a population noncentrality parameter λ and a vector of degrees of freedom ν. Specifically, for a truncated t distribution, the likelihood function is (Taylor and Muller, 1996) L(t; λ, ν) = ft(x|λ, ν) 1 − Ft(b|λ, ν) + Ft(a|λ, ν) (3.2) where a = tcrit(α/2, ν) and b = tcrit(1 − α/2, ν), and ft(.) and Ft(.) are the probability density function and cumulative density function, respectively, of a nontruncated noncentral t distribution with a noncentrality parameter λ = δ √ n˜ and ν degrees of freedom. The numerator denotes the likelihood of observing ts when all statistical findings are reported; the denominator reflects the truncation of t statistics within the interval of statistical nonsignificance. The resulting likelihood is the likelihood of observing ts given publication bias occurred. The TM method, in essence, adjusts the overestimated noncentrality parameter for statistical filtering based on the likelihood function in Equation 3.2. The adjusted noncentrality parameter estimate, λˆ A, is then treated as the population noncentrality parameter and used for power analysis. This method also allows the adjustment of different degrees of uncertainty in λˆ A by choosing λˆ A associated with different percentiles of the likelihood distribution (Taylor and Muller, 1996; Anderson et al., 2017). The choice of percentile is tied to the assurance (Anderson et al., 2017) or confidence level (Taylor and Muller, 1996) of achieving the intended power. For example, there is 80% assurance of achieving the intended power if a λˆ A at the 20th percentile is used to determine sample size. Since selecting the 50th per28 centile results in a λˆ A that does not adjust for uncertainty (Anderson et al., 2017), Taylor and Muller (1996) recommended choosing λˆ A associated with a more conservative 5th percentile to have 95% assurance of reaching the intended power. Following the procedure in Taylor and Muller (1996), Anderson et al. (2017) proposed Bias- and Uncertainty-Corrected Sample Size (BUCSS) and systematically compared different approaches for addressing publication bias and uncertainty, including the TM method using 50th, 20th, and 5th percentiles, in a simulation study. The simulation results showed that while the sample size requirement is larger, higher power and assurance are guaranteed using more conservative percentiles. Instead of choosing a λˆ A associated with a certain percentile, the PUB method adopts the entire likelihood function to construct the posterior distribution of the population effect size. Since t = d √ n˜ and λ = δ √ n˜, Equation 3.2 can be re-expressed to indicate the likelihood of the observed effect size estimate, given various population effect size values and publication bias, as follows: L(d|δ, n) = ft(d|λ, ν) 1 − Ft(b|λ, ν) + Ft(a|λ, ν) , (3.3) where n is the sample size vector (e.g., n = [n1 n2] for an independent samples t-test) that yields n˜ and ν according to Table 1. 1 3.2.2 Posterior Distribution Bayesian inference allows us to obtain what we want—posterior probability distribution of the effect size—from what we possess—likelihood function and prior distribution. To adjust for publication bias and uncertainty, the goal is to use a probability distribution of the population effect size that factors in the historical information and the knowledge about the bias. The likelihoods in Equation 3.3 indicate how likely we would have observed the sample effect size estimate(s) if the population effect size is of certain values, i.e., how likely is d given δ. However, what we want to know is the probability of the population effect size given a sample effect size estimate, i.e., how probable is δ given d, which is the posterior probability. With the likelihood function, we only make probabilistic statements on the historical data, such as how likely we would have observed the data if the population effect size of a certain value. On the other hand, with the posterior distribution, we can directly describe how 1Note that the likelihood function within the Bayesian framework is conditional on the population value, L(d|δ, n). However, within the frequentist framework, because the population value is considered fixed, the likelihood function is unconditional, L(t; λ, ν). 29 probable a population effect size value is given the historical information and knowledge. With the Bayes’s rules (Bayes and Price, 1763), the posterior probability density function is p(δ|d, n) = L(d|δ, n)π(δ) R ∞ −∞ L(δ; n)π(δ)dδ (3.4) where L(.) is the likelihood function defined in Equation 3.3, and π(.) is the prior probability density function for the population effect size (see Section 2.2.2 for details). The numerator is the product of the likelihood and prior probability density for a given population effect size value, and the denominator is the integral of these products over the space of the population effect size. In essence, the posterior distribution is a combination of the likelihood function, which encompasses historical information about the population effect size and publication bias, and the prior distribution, which reflects our prior belief about the population effect. 3.2.3 Power Analysis As discussed in Section 2.2.3, with the effect size distribution as an input, power takes the form of a distribution and is often summarized with mean and assurance. Here I define mean power and assurance for power analysis that addresses publication bias and uncertainty. For a historical study with a sample size vector of nhist and a new study with a sample size vector of nplan, the mean power is given by E(Power) = Z ∞ −∞ p(Reject H0|δ, nplan)p(δ|d, nhist)dδ, (3.5) where a power value p(Reject H0|·) for a δ value is weighted by the posterior probability density, p(δ|d, nhist), of the δ value based on the historical information, accounting for publication bias. The mean power indicates the average power level across the specified uncertainty in the population effect size (Pek and Park, 2019). For example, if a new study with n1 = n2 = 50 has a mean power of .76, this study will achieve a .76 power averaging across the uncertainty in the population effect size specified in the posterior distribution. The assurance is given by A(Power) = Z ∞ δL p(δ|d, nhist)dδ, (3.6) 30 where δL is the lower bound of the effect size values associated with power at or higher than the intended level L, i.e., p(Reject H0|δ, nplan) ≥ L. A 64.81% assurance of achieving .80 power denotes that the new study that has n1 = n2 = 50 will have 64.81% chance of reaching at least the intended level of power. 3.3 Simulation Study I performed a Monte Carlo simulation study to systematically evaluate the performance of the PUB method in addressing publication bias for power analysis and compared it with the methods discussed in the literature. As outlined below, the present simulation design followed the design in Anderson et al. (2017). Suppose that a historical study was analyzed and its finding was published only if it was statistically significant. A researcher reviewed the published historical study and planned the sample size for a new study based on the published result with various methods that either ignore or address publication bias and uncertainty, including the traditional method, the Bayesian methods, and the TM method (Taylor and Muller, 1996). This simulation study addresses the following questions: • What are the mean power and assurance of the replication studies if the uncertainty and publication bias are ignored in sample size planning? • Do the PUB methods with an intended .80 mean power indeed achieve .80 mean power? What is the mean power of the naive Bayesian (NAB) method that ignores publication bias? • Do the PUB methods with an intended 80% and 95% assurance indeed achieve 80% and 95% assurance, respectively? How do the proposed methods perform compared to the NAB method and TM method with an intended 80% and 95% assurance? 3.3.1 Design Factors The current simulation study has a 3 × 3 design with two design factors manipulated: the population effect size (δ) and the group size of the historical study (nhist). The population effect size (δ) was set to be 0.2, 0.5, or 0.8, indicating a small, medium, or large effect size, respectively, according to Cohen’s guidelines (Cohen, 1988). Following the simulation design in Anderson and Maxwell (2017), I chose the group size of the historical study (nhist) 31 to be 20, 40, and 80, which are commonly seen in psychological studies as reviewed in Anderson and Maxwell (2017). 3.3.2 Procedure Per iteration, I generated data for a historical study by drawing two samples, each of a size of nhist, from N(0, 1) and N(δ, 1) respectively, where δ is the standardized mean difference. The data were then analyzed with an independent samples t-test. To recreate the scenario of publication bias, I filtered out statistically nonsignificant findings and resampled until a simulated historical study achieved statistical significance at the .05 level. The resulting effect size estimate (d) was then used to perform power analysis and plan sample size for a replication study (nplan). Power analysis was performed with (a) the traditional method, (b) the NAB method, (c) the PUB method, and (d) the TM method. Different methods allow sample size planning with different goals. With the traditional method, I chose .80 as the intended power. With the NAB and PUB methods, I planned a sample size that achieved either a .80 mean power or an 80% assurance. With the TM method, the sample size was determined to achieve an 80% assurance. Details of each of the methods are provided in the following. I recorded the suggested group size by each method and computed the actual power of such a group size based on the population effect size of the corresponding condition. For example, in the condition with a population effect size of 0.2, if a method suggested a group size of 253, the resulting actual power is .61 given δ = 0.2. Across the 5,000 iterations, I summarized the mean of the actual power and the assurance of achieving .80 power. Traditional Method The traditional method, also named as ”effect size at face value” method (Anderson et al., 2017), takes the effect size estimate reported in a historical study as the population effect size for power analysis. I chose this method for comparison because it has been commonly used to justify sample size planning (Anderson et al., 2017). However, this method ignores the fact that the effect size estimate in the historical study is only an estimate of the population effect size and contains uncertainty. Moreover, this method does not allow adjustment for publication bias. To perform power analysis with the traditional method, I input the effect 32 size estimate of a simulated historical study to the power.t.test() function in the stats R package (R Core Team, 2023) and plan sample size that achieves the intended power of .80. The Naive Bayesian Method The Bayesian method introduced in Chapter II accounts for the uncertainty in the population effect size but not the potential publication biases in the effect size estimate of a historical study. For brevity, I refer to this method as the NAB method to distinguish it from the PUB method. The procedure of the NAB method is similar to the proposed method, except that the likelihood function in Equation 3.3 does not account for the truncation of effect size estimates due to publication bias and assumes a normal distribution for the effect size. The prior distribution is also chosen to be noninformative, and hence the posterior distribution of the effect size is also a normal distribution, having the same shape as the likelihood function. To perform power analysis with the NAB method, I took the effect size estimate and standard error of the estimate, SE(d), from the generated historical study and constructed the distribution of the population effect size with a normal distribution, δ ∼ N(d, SE[d]). With this distribution of effect size, I computed the mean power using numerical integration with the cubature R package (Narasimhan et al., 2023) and determined sample sizes that achieve .80 mean power, 80% assurance, and 95% assurance. Publication-Bias-and-Uncertainty-Adjusted Bayesian Method With the PUB method, I followed Equation 3.3 to derive the likelihood function, Equation 3.4 to obtain the posterior distribution, Equation 3.5 to compute mean power, and Equation 3.6 to compute assurance. Two priors were selected to determine sample size: (a) a noninformative uniform prior, Unif(−∞,∞) and (b) a weakly informative normal prior, N(0.1, 1). As discussed, the noninformative prior indicates no prior belief about the population effect size and the resulting posterior distribution has the same shape as the likelihood function. The weakly informative normal prior represents a belief that the population effect is conservatively small at 0.1, which is smaller than the population effect size in all conditions, but with a large degree of uncertainty in this belief. In the following, I refer to this prior as a ”weak and conservative” prior. The resulting posterior distribution still heavily depends on 33 the likelihood function but is slightly more condensed around 0.1. The mean power and assurance were computed using numerical integration with the cubature R package, and sample sizes were determined with a goal to achieve .80 mean power, 80% assurance, and 95% assurance. The TM Method As discussed in Section 3.2.1, the TM method is a likelihood-based method that adjusts the noncentrality parameter estimate based on the likelihood function in Equation 3.2. Choosing an adjusted noncentrality parameter estimate (λˆ A) associated with the 20th and 5th percentile corresponds to a goal of achieving 80% and 95% assurance, respectively (Anderson et al., 2017). The TM method is designed to determine sample size that achieves a certain level of assurance but not mean power. Therefore, I did not compare this method with the methods with an intended mean power of .80. In reference to the simulation design in Anderson et al. (2017), I determined sample size with the TM method using the 20th and 5th percentiles and compared it with the methods with an intended assurance of 80% and 95%, respectively, using the BUCSS R package (Anderson and Kelley, 2020). Note that Anderson et al. (2017) discussed a difficulty with the TM method, where the resulted λˆ A could have a value of zero, yielding an infinitely large sample size requirement. This difficulty occurs often when the simulated history study is severely underpowered. BUCSS returns an error message when this happens. To circumvent this issue, Anderson et al. (2017) adopted the following procedures, which gradually relax the power analysis goal. For the TM method with 20th percentile (i.e., intended assurance of 80%), if λˆ A was zero, they first reduced the intended assurance to 50%, and if λˆ A was still zero, they further increased the significance level for statistical filtering (αP) from .05 to .10. For the TM method with 5th percentile (i.e., intended assurance of 95%), if λˆ A was zero, they reduced the intended assurance to 80%, then to 50%, and if λˆ A was still zero, they further increased αP from .05 to .10. I replicated these procedures in the simulation study when planning sample size with TM method and recorded the number of iterations in the simulation study when λˆ A was zero and the procedures were applied. 34 Table 2: Summary Statistics of the Estimated Effect Sizes. nhist δ Mean Median Minimum 0.2 0.71 0.75 0.64 0.5 0.87 0.82 0.64 20 0.8 0.99 0.93 0.64 0.2 0.55 0.54 0.45 0.5 0.65 0.62 0.45 40 0.8 0.83 0.82 0.45 0.2 0.41 0.39 0.31 0.5 0.54 0.52 0.31 80 0.8 0.81 0.80 0.32 Note. nhist = the group size of the historical study. δ = population effect size. 3.3.3 Results Table 2 shows the mean and median effect size estimates and the minimum absolute effect size estimates in this simulation study. The impact of the publication bias was strongest in the condition with a small population effect size and a small size of the historical study (δ = 0.2, nhist = 20), where the minimum effect size estimate was much higher than the population effect size. Conditions with a larger population effect size and a larger size of the historical study had a weaker impact of publication bias, where the mean effect size estimates were closer to the population effect size. The condition that had the weakest publication bias was when δ = 0.8 and nhist = 80, where the mean effect size was an accurate estimate of the population effect size. When Uncertainty and Publication Bias is Ignored As shown in Table 3, the mean power of the traditional method, which ignores uncertainty and publication bias, had a mean power below .80 across all conditions. The mean power was as low as .11 when the population effect size was small and the size of the historical study was small (δ = 0.2, nhist = 20) and increased to close to .80 as the population effect size and the size of the historical study increased. The assurance of attaining .80 power for the traditional method was as low as 0%, and increased to 50% as the population effect size increased and the size of the historical study increased (Table 4), which is consistent with 35 Table 3: Mean Power of the Traditional and Bayesian Methods. nhist δ Traditional NAB PUB-U PUB-W 0.2 .11 .14 .58 .66 0.5 .39 .51 .84 .89 20 0.8 .65 .74 .86 .91 0.2 .18 .25 .75 .77 0.5 .60 .71 .87 .89 40 0.8 .76 .81 .86 .88 0.2 .30 .42 .83 .84 0.5 .74 .80 .86 .87 80 0.8 .79 .81 .82 .84 Note. nhist = the group size of the historical study. δ = population effect size. NAB = the NAB method. PUB-U = the PUB method with a uniform prior. PUB-W = the PUB method with a weak prior. the literature (e.g., Anderson et al., 2017; Du and Wang, 2016). Methods With an Intended .80 Mean Power The NAB method, which accounts for the uncertainty but not publication bias, had a mean power lower than .80 in most conditions, except when the population effect size was large and the historical study was not small (Table 3). Contrarily, the PUB methods achieved more than .80 mean power in most conditions, except when δ = 0.2 and nhist = 20, 40. In these conditions, although failing to achieve .80 mean power, the PUB method attained a higher mean power (e.g., .75) than the NAB method (e.g., .25). the PUB method with a weak and conservative prior achieved slightly higher mean power (e.g., .77) than the Bayesian method with a noninformative prior. Methods With an Intended 80% Assurance Similarly, the NAB method had an assurance below 80% in all conditions but close to 80% when the population effect size was large (Table 4). On the other hand, the PUB method achieved 80% assurance in most conditions, again except when the population effect size was small and the size of the historical study was small. Although failing to achieve the intended level of assurance when δ = 0.2 and nhist was small, the PUB method with the weak 36 Table 4: Actual Assurance (%) of the Traditional Method, and the Bayesian and TM Methods with an Intended 80% Assurance. nhist δ Traditional NAB PUB-U PUB-W TM 20th 0.2 0 0 62 71 26 0.5 0 35 83 89 72 20 0.8 27 67 83 90 80 0.2 0 0 76 79 42 0.5 16 64 85 88 80 40 0.8 46 77 82 87 82 0.2 0 15 83 84 63 0.5 44 76 82 84 80 80 0.8 50 78 80 83 81 Note. nhist = the group size of the historical study. δ = population effect size. NAB = the NAB method. PUB-U = the PUB method with a uniform prior. PUB-W = the PUB method with a weak prior. TM 20th = Taylor and Muller’s (1996) 20th percentile method. and conservative prior yielded the highest assurance (71.08% when nhist = 20 and close to 80% when nhist = 40) among all methods. The TM method generally had a slightly lower assurance than the PUB method. In particular, when δ = 0.2 and nhist = 20, the assurance was 25.66% for the TM method, consistent with the finding in Anderson et al. (2017) and Anderson and Maxwell (2017). All methods performed similarly when the population effect size was large and the size of the historical study was relatively large. Methods With an Intended 95% Assurance The result patterns for the methods with an intended 95% assurance were similar to those for the methods with an intended 80% assurance (Table 5). The NAB method, again, did not achieve the intended 95% assurance. the PUB method with the noninformative prior achieved 95% assurance in almost all conditions except for one (δ = 0.2 and nhist = 20). With the weak and conservative prior, the PUB method attained 95% assurance in all conditions. The TM method, however, achieved 95% or close to 95% assurance only when the population effect size was medium to large. 37 Table 5: Actual Assurance (%) of the Traditional Method, and the Bayesian and TM Methods with an Intended 95% Assurance. nhist δ NAB PUB-U PUB-W TM 5th 0.2 26 94 96 29 0.5 77 95 98 86 20 0.8 87 92 98 94 0.2 57 96 97 48 0.5 89 96 97 95 40 0.8 92 95 97 96 0.2 77 96 97 73 0.5 92 95 96 94 80 0.8 92 94 95 95 Note. nhist = the group size of the historical study. δ = population effect size. NAB = the NAB method. PUB-U = the PUB method with a uniform prior. PUB-W = the PUB method with a weak prior. TM 5th = Taylor and Muller’s (1996) 5th percentile method. Suggested Group Size Figure 4 presents the distributions of group size suggested by each method. The horizontal line denotes the group size required to achieve .80 power given the population effect size was 0.2, 0.5, or 0.8. Therefore, in this figure, assurance can also be considered as the number of times when the suggested group size is larger than the required group size to achieve .80 power. The traditional method had the lowest suggested group size among all methods across conditions. In the conditions with a strong impact of publication bias (e.g., δ = 0.2, nhist = 20), all suggested group sizes were below the group size needed to achieve .80 power, explaining the 0% assurance by the traditional method in those conditions. More suggested group sizes by the traditional method reached the required group size to achieve .80 power as the population effect size and the size of the historical study increased. In the condition with the weakest impact of the publication bias (δ = 0.8, nhist = 80), about 50% of the suggested group sizes were at or above the required group size, indicating a 50% assurance, consistent with the observation before. 38 Figure 4: Distributions of Suggested Group Sizes Across Methods. 39 The NAB method had the second lowest suggested group size among all methods across conditions. Although higher than the traditional method, the NAB method suggested a group size lower than the required group size to achieve .80 power in the conditions with a strong publication bias. Like the traditional method, the NAB method suggested a group size that reached the required size as δ and nhist increased; unlike the traditional method, the NAB method suggested a group size higher than the required size about 80% of the time in the condition with small publication bias. Overall, the Bayesian methods with uniform and weak priors performed similarly. However, with the weak prior, the Bayesian method suggested larger group sizes and attained higher assurance. Particularly in the condition with δ = 0.2 or nhist = 20, where all methods failed to achieve 80% assurance, using a weak and conservative prior yielded more group sizes beyond the required size for .80 power. Notice that the weak prior had a stronger effect and suggested a larger group size in the conditions with a small size of the historical studies. When the historical studies had a larger group size, using either the uniform or weak prior yielded similar results. Across conditions, the TM method suggested smaller group sizes than the PUB methods. In the conditions with δ = 0.2, the TM method had an assurance lower than the intended level as few group sizes reached beyond the required size for .80 power. By contrast, in the conditions with a larger population effect size, the TM method appeared to require a smaller group size than the Bayesian methods to achieve 80% assurance. It is important to note that, with the TM method, the power analysis goal was relaxed (e.g., from aiming at 80% assurance to 50% assurance) if this method suggested an infinitely large group size. When this situation happened, the comparison of the suggested group size was unfair because the power analysis goal differed between the TM method and the Bayesian methods. The next section provides a deeper investigation. A Closer Comparison Between the TM and Bayesian Methods As discussed before, I adopted the procedure in Anderson et al. (2017) that gradually relaxed the power analysis goal when the TM method resulted in an adjusted noncentrality parameter λˆ A = 0 and an infinitely large group size. Here I focus the discussion on the methods with an intended 80% assurance, which can be generalized to the methods with an intended 40 95% assurance. Recall that if the goal was to achieve 80% assurance but λˆ A = 0, the power analysis goal was first relaxed to 50% assurance, adjusting for statistical filtering at .05 significance level. If λˆ A was still 0, the goal was further relaxed to 50% assurance, adjusting for statistical filtering at .10 significance level. The first investigation is on when the adjusted noncentrality parameter becomes zero. To correct for the overestimated noncentrality parameter due to publication bias, the TM method adjusts the estimate downward using Equation 3.2. Essentially, if the estimated effect size was small, the estimated noncentrality parameter would also be small, and a downward adjustment on the estimate could lead to a value of zero. Table 6 summarizes when and how many times the power analysis goal was relaxed over the 5,000 iterations. Across conditions, when the p-value of the historical study was larger than about .01, the power analysis goal was relaxed from 80% assurance to 50% assurance. When the p-value of the historical study was larger than about .025, the goal was further relaxed from adjusting for filtering at .05 level to .10 level. The second investigation is on the suggested group size by the TM method for different power analysis goals. As shown in Figure 5, the suggested group size of the TM method grew exponentially from p = 0 and approached a vertical asymptote at about p = .01. Per relaxing the power analysis goal from aiming at 80% assurance to 50% assurance, the suggested group size dropped and grew exponentially until reaching another asymptote at about p = .025. Per further relaxing the goal to adjusting for filtering at αP = .10 instead of αP = .05, the suggested group size dropped and grew exponentially again until reaching the third asymptote at p = .05. This figure shows that with the TM method, each power analysis goal has its own limit, at which the suggested group size approaches infinity. The final investigation is to revisit the distributions of the suggested group size of the TM and Bayesian methods. In Figure 6, the boxplots are now split into three phases—when the TM method aimed to achieve (a) 80% assurance adjusting filtering at .05 level (p ≤ .01), (b) 50% assurance adjusting filtering at .05 level (.01 < p ≤ .025), and (c) 50% assurance adjusting filtering at .10 level (.025 < p ≤ .05). When p ≤ .01 where both methods aimed to achieve 80% assurance, the two methods had similar median suggested group sizes, with the TM method showing a larger variability. When p > .01 where the TM method had less 41 Table 6: Number of Times a Sample Size Planning Goal was Used With the TM Method Across 5,000 Iterations 80% Assurance (.05 sig.) 50% Assurance (.05 sig.) 50% Assurance (.10 sig.) nhist δ N iterations d p N iterations d p N iterations d p 20 1288 2.01 .011 1571 0.86 .026 2141 0.74 .050 40 1577 1.16 .010 1543 0.59 .025 1880 0.51 .050 0.2 80 1916 0.85 .010 1433 0.41 .025 1651 0.36 .050 20 2136 2.01 .011 1402 0.86 .026 1462 0.74 .050 40 2931 1.32 .010 1135 0.59 .025 934 0.51 .050 0.5 80 4045 1.12 .010 549 0.41 .025 406 0.36 .050 20 3108 2.21 .011 1063 0.86 .026 829 0.74 .050 40 4382 1.62 .010 381 0.59 .025 237 0.51 .050 0.8 80 4965 1.43 .010 24 0.41 .025 11 0.36 .047 Note. nhist = the group size of the historical study. δ = population effect size. 80% Assurance (.05 sig.) = the goal of achieving 80% assurance, adjusting the statistical filtering at the .05 significance level. 50% Assurance (.05 sig.) = the goal of achieving 50% assurance, adjusting the statistical filtering at the .05 significance level. 50% Assurance (.10 sig.) = the goal of achieving 50% assurance, adjusting the statistical filtering at the .10 significance level. N iterations = number of iterations that a goal was used. d = the maximum effect size estimate in the iterations that a goal was used. p = the maximum p-value in the iterations that a goal was used. 42 Figure 5: Relationship Between Suggested Group Sizes and p-values. 43 stringent power analysis goals than the Bayesian method, the TM method had smaller median suggested group sizes than the Bayesian method. By relaxing the power analysis goals, the TM method suggested smaller group sizes to achieve the initial goal (i.e., 80% assurance) in the conditions with a larger population effect size and a large size of the historical study. However, this behavior could lead to a smaller than required group size to achieve the initial goal in the conditions with a small population effect size or a small size of the historical study. 3.4 Discussion 3.4.1 The Effect of Publication Bias Due to publication bias, both the traditional and the NAB methods, which ignore the bias, resulted in power, mean power, or assurance lower than the intended level. Notice that when the population effect size or the size of the historical study increased, the mean power or assurance of these methods increased. In these conditions where the effect of the publication bias was weaker, more generated effect size estimates were large enough to be detected as statistically significant, or more studies were large enough to detect smaller generated effect size estimates. Therefore, these methods were able to reach mean power or assurance closer to the intended level in these conditions. 3.4.2 The Effect of Uncertainty In this simulation study, the only method that ignored uncertainty was the traditional method, which resulted in the lowest mean power among all compared methods. In the conditions with a weak effect of the publication bias, the assurance of the traditional method increased to nearly 50%. This observation is consistent with the literature (e.g., Anderson et al., 2017)— even without publication bias, the method that ignores uncertainty reaches 50% assurance, indicating that about 50% of the time the traditional method yielded a power at or higher than .80. In such conditions, half of the time the effect size estimates were below the population effect size, leading to a group size with power smaller than the intended level, whereas the other half of the time the estimates were above the population effect size, resulting in a group size with power larger than the intended level. In the conditions with a strong effect of publication bias, compounded with the effect of uncertainty, the traditional method had an assurance of as low as 0%. 44 Figure 6: Distributions of Suggested Group Sizes by Power Analysis Goals of the TM Method. 45 3.4.3 Suggested Group Size and P-Value As shown in Figure 5, the Bayesian method increasingly overestimated the required group size as the p-value in the historical study increased. The implication is that, for larger pvalues (i.e., smaller estimated effect size), the Bayesian method tended to over-adjust for the bias and uncertainty and overestimate the group size needed. The TM method shared a similar pattern, where the adjustment was stronger and the suggested group size was larger for larger p-values in the historical studies. The implemented procedure for the TM method was to gradually relax the power analysis goal when the method hit its limit at certain pvalues. With this procedure, the TM method generally had a lower median group size than the Bayesian method, which can be beneficial in conditions with a larger population effect size and larger size of the historical study but otherwise detrimental to the power of the new studies. At this point, one may wonder whether it is advised to relax the power analysis goal for larger p-values in the historical studies, given that the Bayesian method tended to over-adjust and overestimate the required group size for larger p-values. The answer is no for three reasons. First, given the population effect size is unknown in practice, whether or not the methods over-adjust is also unknown. For example, when we observed a p-value of .01 in a historical study, the Bayesian method suggested a group size beyond the required group size if δ = 0.5 but below the required group size if δ = 0.2. Second, assuming that we have some degree of certainty that the Bayesian method over-adjusts (e.g., nhist = 80 and the historical p-value was larger than .01), it is unknown how much to relax while securing the initial goal (e.g., 80% assurance). Third, relaxing the power analysis goals hinders the interpretation of the result. It is important to reiterate that the Bayesian method allows us to describe the probability of achieving the intended power for a given group size. If a goal is relaxed from 80% assurance to 60% assurance, given the posterior distribution of the effect size remains the same, the resulting group size is expected to achieve only 60% assurance but not the intended 80%. Therefore, it is unadvised to relax the goal depending on the observed p-value using the PUB method. 46 3.4.4 The Use of Stronger Priors In this simulation study, the weakly informative prior was selected to be a normal prior with a mean of 0.1 and a standard deviation of 1. This weak prior is considered more conservative than the uniform prior because the population effect size was set to be at least 0.2. As can be seen in the simulation results, the mean power and assurance were higher with this weak and conservative prior than with the noninformative prior. In the condition where δ = 0.2 and nhist = 20, all methods failed to achieve the intended mean power and assurance. Extending the observation where a more informative but conservative prior yielded higher mean power and assurance, I reran the simulation study for the δ = 0.2 and nhist = 20 condition with a stronger normal prior, δ ∼ N(0.1, 0.5), which has a smaller standard deviation. With this stronger prior, the PUB method had .80 mean power when the intended mean power was .80, 84.72% when the intended assurance was 80%, and 98.88% when the intended assurance was 95%. The corresponding mean suggested sample sizes were 546, 1126, and 8599 for an intended mean power of .80, an intended assurance of 80%, and an intended assurance of 95%, respectively. 3.5 General Recommendations If one suspects publication bias exists in the historical information, I advise against using the traditional method and the NAB method. Both methods ignore the bias and can largely underestimate the required group size, particularly when the population effect size is small to medium. That being said, both methods perform fairly well when the population effect size is large (i.e., weak publication bias) and the size of the historical study is large (i.e., small uncertainty). In these conditions, the traditional method achieves a mean power close to the intended level of power, and the NAB method reaches the intended level of mean power and assurance. The PUB method and the TM method are recommended to address publication bias and uncertainty. There are two circumstances I suggest the Bayesian method over the TM method. First, when the TM method approaches its limit and results in infinitely large group size, instead of relaxing the power analysis goal, I recommend the use of the Bayesian method, which determines the required group size for the intended goal. Second, if one believes 47 the population effect size is small, the Bayesian method, particularly with a stronger and conservative prior, will yield a mean power and assurance closer to the intended level. 3.6 Limitations and Future Directions A common criticism of Bayesian methods is the subjectivity in the use of prior distribution. However, it is important to recognize that traditional power analysis practices, similarly, require subjective judgments when selecting the best-educated guess about the population effect size. In both cases, researchers need to carefully consider and justify their choices. The present simulation study investigated the use of different priors. Based on the findings, my general recommendation is to use a non-informative or weakly informative prior distribution if there is limited prior knowledge about the population effect size. With a non-informative or weakly informative prior, the results were similar because the posterior distribution relied heavily on the information from a historical study and was relatively unaffected by the choice of priors. In the context of power analysis, I generally discourage the use of ”strong and optimistic” priors that have a large mean effect size and a small standard deviation, unless there are strong justifications. However, the use of relatively strong but conservative priors may be beneficial in circumstances with a strong publication bias and uncertainty, as found in the simulation study. The present simulation study focuses on addressing bias and uncertainty in information from one historical study. The uncertainty in the effect size estimate is represented by the withinstudy variability. When there are multiple studies, such as from a meta-analysis, the Bayesian method requires further development to incorporate the uncertainty due to between-study variability. Du and Wang (2016) proposed a Bayesian procedure to incorporate uncertainty in effect size estimates from a meta-analysis. Similar to the PUB method, this Bayesian procedure also adopts the posterior distribution derived based on historical information for power analysis. On the other hand, this procedure uses a Bayesian meta-analytical technique to obtain the posterior distribution of the effect size, which accounts for both within- and between-study variabilities (Du and Wang, 2016; Pek and Park, 2019). Despite its strength in addressing different sources of uncertainty, this procedure has not been developed to account for publication bias (Du and Wang, 2016). An extension of the Bayesian procedure or the PUB method will be useful to address both uncertainty and bias in incorporating information 48 from multiple studies for power analysis. The derivation of the likelihood distribution requires an assumption that all nonsignificant findings are filtered out from the literature. Practically, this assumption may not hold given the rising awareness of publication bias and the growing acceptance of publishing nonsignificant results, such as in registered reports (Nosek and Lakens, 2014). In such cases, a proportion of nonsignificant findings are published while others remain unobserved. With both the TM and Bayesian methods, we can relax the assumption by adjusting statistical filtering at a higher level of significance (e.g., relaxing from αP = .05 to .10). However, this assumption does not nicely translate to that only a proportion of nonsignificant findings are unpublished, but rather, it indicates all findings with a p-value less than .10 are trimmed out. With this stringent assumption, both methods provide more conservative results. In other words, these two methods would achieve at or above the intended level of mean power and assurance if statistical filtering applies to only a portion but not all nonsignificant findings. In theory, a simulation procedure that rejects nonsignificant findings at a probability (e.g., 60%) may permit the estimation of the likelihood for various effect size values given an effect size estimate, but this methodology needs further investigation. 3.7 Conclusion This chapter focuses on a question about how to incorporate our prior knowledge about publication bias into power analysis. The PUB methods allow researchers to construct a likelihood function based on historical results, factoring in the truncation in statistically nonsignificant effect sizes. By combining this with the prior belief about the population effect size of a new study, researchers can establish a posterior distribution to describe the probability of various population effect sizes, thereby adjusting for uncertainty and bias. Compared to a point estimate, the posterior distribution retains richer information about the population effect size from historical findings. Based on the posterior distribution, the Bayesian method determines a sample size that achieves the intended level of mean power or assurance. While aiming at a more stringent power analysis goal may pose challenges due to budget constraints and practical considerations, the Bayesian method provides insights into the probability of the outcomes in the sample size planning, such as how probable a study with a given sample size achieves the intended level of power. By understanding these properties of sample 49 size, researchers can make more informed decisions on choosing a study design that achieves adequate power. 50 CHAPTER IV SYNTHESIZING INTRACLASS CORRELATION ESTIMATES Meta-analysis is a statistical technique for synthesizing results from a set of historical studies to understand the overall findings of a body of research. This technique can also provide useful information for planning future studies. Sample size planning requires the input of true parameter values. For designing multilevel studies, these parameters include the effect of interest and intraclass correlation (ICC), which measures the strength of association among units within a cluster (e.g., students within a school). While the true parameter values are often unknown, researchers often draw insight from previous research findings to make educated guesses. Although study findings may vary from multiple sources, meta-analysis allows estimation of the overall effect and potential generalization of aggregated findings to new studies (Hall and Rosenthal, 2018). The estimates from meta-analysis are particularly useful and informative for planning future studies. One often overlooked aspect of power analysis is the uncertainty associated with the parameter estimate taken from historical studies or meta-analyses. Historical findings with a similar research question can vary between studies due to differences in study designs and population characteristics. Furthermore, even with the same design and characteristics, studies may produce diverse parameter estimates due to sampling variability. To address these variabilities, random-effects meta-analysis (RMA) serves as a meta-analytical method for aggregating parameter estimates. In the context of power analysis for single-level studies, Du and Wang (2016) proposed a Bayesian procedure to incorporate the uncertainty in the overall effect size estimate using a Bayesian RMA (BRMA). The idea is to utilize the effect size distribution that accounts for the uncertainty due to between-study and within-study variabilities in power analysis. For multilevel studies, a natural extension of the Bayesian procedure by Du and Wang (2016) involves aggregating ICC estimates with BRMA to obtain an ICC distribution that incorporates between-study and within-study variabilities for power analysis. However, two features of ICC may impede the application of RMA. First, the distribution of ICC is typically 51 nonnormal and positively skewed (Bhat and Beretvas, 2022; Hedberg and Hedges, 2014). Second, ICC is bounded between [0, 1]. Given the focus on pooling effect size estimates in the meta-analysis literature, standard RMA (SRMA) assumes the estimates to be unbounded and follows approximately a normal distribution. Research has shown that nonnormality at the between-study level can lead to biased overall effect size and between-study variance estimates (Rubio-Aparicio et al., 2018; Blazquez-Rinc ´ on et al., ´ 2023). Whereas past simulation studies have evaluated nonnormality at the between-study level (Kontopantelis and Reeves, 2012a; Kontopantelis and Reeves, 2012b; Rubio-Aparicio et al., 2018; Blazquez-Rinc ´ on et al., ´ 2023), there has been limited investigation into nonnormality at the within-study level in the meta-analysis literature. For example, Kontopantelis and Reeves (2012a) performed simulation studies exploring different forms of true effect size distributions at the between-study level (e.g., skew-normal) while assuming that normally distributed within-study errors. Similarly, in the simulation studies, Rubio-Aparicio et al. (2018) and Blazquez-Rinc ´ on et al. ( ´ 2023) manipulated the skewness and kurtosis of the between-study distributions of standardized mean differences but simulated within-study observations from normal distributions. Nonetheless, due to the bounded nature of ICC, the within-study sampling distribution is likely skewed, particularly when the true ICC is close to the boundaries at zero and one. Therefore, exploring the effects of nonnormality at the within-study level is crucial for synthesizing ICC estimates. An alternative method to model ICC estimates is BRMA, which offers flexibility in selecting distributions at the between-study and within-study levels. While Du and Wang (2016) employed normal distributions at both levels, alternative distributions that accommodate the features of ICC can be explored. With a focus on effect size estimates, methodologists have recommended using skewed distributions or t distributions to handle outliers at the betweenstudy level (Baker and Jackson, 2008; Lee and Thompson, 2008; Beath, 2014). However, since the theoretical limits of ICC are [0, 1], t distributions may not be the most suitable choice. In the context of sample size planning, studies have proposed using Beta distributions (Singh and Mukhopadhyay, 2016; Spiegelhalter, 2001; Sarkodie et al., 2023) and normal distributions restricted within the range of [0, 1] (Moerbeek and Teerenstra, 2015; Sarkodie et al., 2023) for the distribution of ICC. Therefore, to extend the Bayesian power analysis procedure, a potential direction involves using Beta distributions or normal distri52 butions with a restricted range in a BRMA to synthesize ICC estimates. In this chapter, I introduce an extension of the Bayesian power analysis procedure to metaanalyze ICC estimates and incorporate the ICC distribution into power analysis. The following sections provide an overview of RMA and the Bayesian power analysis procedure that adopts RMA to establish effect size distributions for power analysis of single-level studies. After a review of power analysis for multilevel studies, I discuss existing RMA methods, including the standard RMA (SRMA) and robust variance estimation (RVE), and outline the proposed BRMA models for pooling ICC estimates. Subsequently, I present the simulation study that assesses the performance of the meta-analytic methods for synthesizing ICC estimates. Based on the simulation results, I recommend methods for pooling ICC estimates and conclude the chapter with an illustration of the Bayesian power analysis procedure for designing multilevel studies. 4.1 Random-Effects Meta-Analysis (RMA) As meta-analysis is commonly used to aggregate effect size estimates from multiple studies, I begin by providing an introduction to this method for estimating the overall effect size. Across studies, effect size estimates may vary drastically due to within-study sampling variability and between-study variability. RMA, also known as the random-effects model, is a meta-analytic technique that estimates the overall effect size, as well as the sampling variability in the effect size estimates and the heterogeneity in the true effect size values across studies. RMA models effect size estimates at the within-study and between-study levels. Due to study-specific properties, RMA allows the true effect size values, δk, to vary across K studies. At the between-level, the SRMA assumes that the true effect size is a random variable that follows a normal distribution with a variance of τ 2 , referred to as the between-study variance. At the within-level, the SRMA expects that an effect size estimate of a study, dk, follows also a normal distribution with a variance of σ 2 k , denoted as the within-study variance. To pool effect size estimates, a general RMA is given by (DerSimonian and Laird, 53 1986), Within-Level: dk = δk + ek, ek ∼ N(0, σ2 k ) Between-Level: δk = µδ + uk, uk ∼ N(0, τ 2 ) (4.1) where, µδ is the overall effect size, δk is the true effect size, dk is the effect size estimate, uk is the random effects that signify the deviation from the overall effect size, and ek is the sampling errors for the kth study. For example, for independent samples t-tests, the effect size estimate (i.e., standardized mean difference estimate) for the kth study is dk = x¯1k−x¯2k sk , where x¯1k and x¯2k are the means of the two groups and sk is the pooled standard deviation. The within-study variance (σ 2 k ) is estimated by the squared variance SE(dk) 2 = n1k+n2k n1kn2k + d 2 k 2(n1k+n2k) , where n1k and n2k are the sample sizes of the two groups. One of the most popular estimation methods for τ 2 is the restricted maximum-likelihood (REML) estimation, which has been recommended for its approximate unbiasedness and efficiency (Viechtbauer, 2005; Viechtbauer, 2010) and its broad application to a variety of models (Pustejovsky and Tipton, 2022). REML estimation is also a default option in popular meta-analysis R packages, metafor (Viechtbauer, 2010) and clubSandwich (Pustejovsky, 2022). In this dissertation, I refer to the RMA with REML estimation as the SRMA. 4.2 Bayesian Power Analysis Procedure As discussed in Chapter 2, the essence of the Bayesian procedure is to utilize an input parameter distribution for power analysis. With information from multiple studies, Du and Wang (2016) proposed using Bayesian random-effects meta-analysis (BRMA) to pool effect size estimates and obtain the posterior distribution of effect size as the input distribution for power analysis. The BRMA can be expressed as (Harrer et al., 2021; Higgins et al., 2009) Within-Level: dk|δk, σ2 k ∼ N(δk, σ2 k ) Between-Level: δk|µδ, τδ ∼ N(µδ, τ 2 δ ) (µδ, τ 2 δ ) ∼ p(.) τ 2 δ > 0 (4.2) 54 which is equivalent to Model 4.1, except that this model requires the specification of the prior distributions, p(.), for µδ and τ 2 δ . Du and Wang (2016) have suggested noninformative priors, which are common in the Bayesian meta-analysis literature (e.g., Jansen et al., 2008; Sutton and Abrams, 2001); specifically, µδ ∼ N(0, 10000) and τ 2 δ ∼ Uniform(0, 100). The proposed procedure by Du and Wang (2016) involves (a) fitting Model 4.2 to construct the posterior distribution of effect size, (b) drawing effect size values from the posterior distribution with Markov chain Monte Carlo (MCMC) methods, and (c) using the effect size distribution to obtain the power distribution and determine sample size. Du and Wang (2016) implemented this procedure in an R function pas(), which uses the rjags (Plummer, 2023) package to perform MCMC. In practice, researchers may only have the results of an SRMA from a published study or may prefer using the SRMA over BRMA. In such cases, I suggest establishing the effect size distribution using the overall effect size estimate and the total variance estimate obtained from non-Bayesian RMA. The total variance estimate is the sum of the estimates of the between-study variance and within-study variance, τˆ 2 +SE(ˆµδ) 2 (Higgins et al., 2009; Riley et al., 2011). Assuming δ follows a normal distribution, the effect size distribution can be constructed as δ ∼ N(ˆµδ, τˆ 2 + SE[ˆµδ] 2 ). 4.2.1 Illustration of the Bayesian Power Analysis Approaches There are two possible approaches for performing Bayesian power analysis with RMA. The first approach relies on BRMA (Du and Wang, 2016), whereas the second approach allows the use of non-Bayesian methods, such as SRMA. To demonstrate the comparability of the two approaches in terms of the power analysis outcomes, here I meta-analyze the data in Working Example 2 of Chapter II, an example taken from Du and Wang (2016). Replicable R script of this illustration is available in Appendix VI. Recall that the goal of Working Example 2 is to plan a study using the historical information from 29 studies. The effect size estimates refer to the estimates of the standardized mean difference in mental spatial rotation ability between males and females. To address any potential bias, the effect size estimates reported in Du and Wang (2016) (Table 1, p. 593) are in the Hedges’ g metric. With the first approach, I first pool the effect size estimates of the 29 studies with BRMA and then obtain the posterior distribution of effect size for power analysis. The pas() function 55 written by Du and Wang (2016), available on http://www3.nd.edu/lwang4/power/, takes care of all these steps. Users only need to supply the effect size estimate and sample size of each of the 29 studies, as well as the sample size of the new study to the function. The function then calculates and produces the mean power—the average chance of detecting the true effect—and assurance—the chance of achieving the intended level of power. If the new study has n1 = n2 = 60, the procedure by Du and Wang (2016) suggests that the mean power is .82 and the assurance of achieving .80 power is 72%. With the second approach, I first use the rma() function from the metafor R package (Viechtbauer, 2010) to pool effect size estimates with the SRMA. The results include an overall effect size estimate of µˆδ = 0.72, a between-study variance of τˆ 2 = 0.11, and a within-study variance of SE(ˆµδ) 2 = 0.005. Next, I specify the input distribution of effect size with N(ˆµδ, τˆ 2 + SE(ˆµδ))2 to evaluate mean power (Equation 2.2) and assurance (Equation 2.3). The functions defined in Appendix VI utilize the cubature R package (Narasimhan et al., 2023) to compute mean power and assurance with numerical integration. Consistent with the results of the procedure by Du and Wang (2016), the proposed procedure yields a mean power of .83 and an assurance of achieving .80 power is 73%. for a new study with n1 = n2 = 60. 4.3 Power Analysis for Multilevel Studies Moving beyond the discussion of power analysis for single-level studies, here I introduce power analysis for multilevel studies, which requires the input of not only the effect size but also the ICC. ICC is a measure of the strength of association among units within a cluster. For data with a two-level structure, ICC is typically defined using the variance components in an unconditional two-level model: Level 1: Yij = β0j + eij , eij ∼ N(0, σ2 ); Level 2: β0j = γ00 + uj , uj ∼ N(0, τ 2 ); (4.3) where Yij is the outcome variable for the ith unit and jth cluster, β0j is the random intercept of each cluster, γ00 is the grand intercept across clusters, uj is the cluster-level random effects, 56 and eij is the unit-level sampling error. In a two-level model, ICC is given by ρ = τ 2 τ 2 + σ 2 , (4.4) which can be interpreted as the proportion of variances due to the between-cluster differences. As shown in Equation 4.4, ICC is a ratio of variance components and is bounded between [0, 1]. For clustered data, the assumption of independent observations fails to hold, and the strength of dependence is quantified by ICC. This dependence, if unaccounted for, inflates the sampling variance of a sample statistic (e.g., regression coefficient), by a factor of Deff = 1 + (n − 1)ρ, (4.5) where n is the average cluster size (Lai and Kwok, 2015; Muthen and Satorra, ´ 1995; Snijders and Bosker, 2012; Kish, 1965). The variance inflation factor is known as the design effect (Kish, 1965). A design effect larger than 1.1 signifies a substantial degree of dependence among units that necessitates the use of a multilevel model (Lai and Kwok, 2015). For instance, in a study assessing the impact of an educational program on students’ mathematics achievement, an ICC of .1 implies that students within the same school tend to have more similar mathematics achievement than those from different schools. Failure to address this dependence would result in an inflation of variance of the program effect by a design effect of 1 + (200 − 1)(.1) = 20.9, for an average cluster size of 200. The inflation in variance can lead to erroneous statistical inferences. Therefore, employing a multilevel model is essential to appropriately address the clustering structure within the data. When planning a study with a clustering structure, researchers need to assess the degree of dependence among units and specify an ICC value for power analysis. In determining the ICC value, researchers can draw upon existing literature. For example, Hedges and Hedberg (2013) reported ICC estimates for mathematics achievement among 6th graders in 11 states, varying from .045 to .273, with a mean of .188. Given the variability, researchers can employ RMA to synthesize ICC estimates and assess the amount of variances. The overall ICC estimate and the variability estimates can then inform power analysis. 57 4.4 Meta-Analytic Methods for Pooling ICC Estimates In this section, I discuss the three meta-analytic methods for synthesizing ICC estimates, including SRMA, RVE, and BRMA. 4.4.1 Standard Random Effects Meta-Analysis (SRMA) Treated as effect size estimates, ICC estimates can be pooled using a SRMA as follows: Within-Level: ρk = θk + ek, ek ∼ N(0, σ2 k ) Between-Level: θk = µθ + uk, uk ∼ N(0, τ 2 ) (4.6) where, for the kth study, θk is the true ICC, ρk is the ICC estimate, uk is the betweenstudy random effects from the overall ICC with a variance of τ 2 , ek is the within-study sampling errors with a variance of σ 2 k , and µθ denotes the overall ICC. Notice that the SRMA makes three assumptions: (a) the within-level errors, ek, follow a normal distribution, (b) the between-level random effects, uk, also follow a normal distribution, and (c) the ICC estimates, ρk, and true ICCs, θk, are unbounded and can take any values beyond the range of 0 and 1. However, the distribution of ICC is often reported to be skewed and nonnormal (Hedberg and Hedges, 2014; Bhat and Beretvas, 2022), potentially violating the normality assumptions at both the within-level and between-level. Previous research has investigated the effects of nonnormality at the between-level using the SRMA (i.e., with REML estimation) on the overall effect size estimates (Rubio-Aparicio et al., 2018), the confidence interval coverages of the true effect size (Kontopantelis and Reeves, 2012b; Rubio-Aparicio et al., 2018), and the between-study heterogeneity estimates (Blazquez-Rinc ´ on et al., ´ 2023). With nonnormality at the between-study level, SRMA has been shown to produce biased overall effect sizes (Rubio-Aparicio et al., 2018), confidence interval converges below the nominal level (Kontopantelis and Reeves, 2012b; Rubio-Aparicio et al., 2018), and biased heterogeneity estimates (Blazquez-Rinc ´ on et al., ´ 2023). Moreover, larger between-study heterogeneity exacerbates the biases and reduces the coverages using SRMA in cases where the betweenlevel distribution is nonnormal. Whether synthesizing ICC estimates with SRMA results in similar issues requires further investigation. 58 4.4.2 Robust Variance Estimation (RVE) Robust variance estimation (RVE) is a method that relaxes the strict assumptions that SRMA makes when synthesizing parameter estimates. RVE is commonly used to pool dependent effect size estimates (e.g., estimates from the same study) with REML estimation—an approach implemented in the clubSandwich R package (Pustejovsky and Tipton, 2022). However, when the estimates are independent (e.g., all drawn from distinct studies), RVE with REML estimation yields the same result as SRMA. Moreover, while RVE with REML estimation is efficient in fitting a variety of models (Pustejovsky and Tipton, 2022), this approach also relies on the normality assumption. On the other hand, RVE with the method of moments estimation—implemented in the robumeta R package (Z. Fisher et al., 2017)—is more accommodating to nonnormal distribution (Hedges et al., 2010; Pustejovsky and Tipton, 2022). RVE with the method of moments estimation does not require the sampling distribution to follow a particular shape, focusing instead on equating the theoretical moments with the sample moments (Hedges et al., 2010; Pustejovsky and Tipton, 2022). For these reasons, past research has recommended the use of RVE (Bhat and Beretvas, 2022) with the method of moments estimation (Kivlighan et al., 2020) to meta-analyze ICC estimates, whose distribution may be nonnormal. Despite relaxing the normality assumption, RVE still requires the assumption that ICC estimates are unbounded and can take on any real numbers. 4.4.3 Bayesian Random Effects Meta-Analysis (BRMA) BRMA offers a viable alternative by relaxing the normality assumption and accounting for the bounded nature of ICC estimates. Previous studies have recommended Bayesian methods to address nonnormality, with a focus on the between-level distribution (Baker and Jackson, 2008; Lee and Thompson, 2008; Beath, 2014). Specifically, these methods model the between-level distributions of skewed distributions (Lee and Thompson, 2008) and t distributions to handle outliers (Baker and Jackson, 2008; Lee and Thompson, 2008). The advantage of Bayesian methods lies in their flexibility in employing different forms of distributions to accommodate the characteristics of a parameter. The development of these Bayesian meta-analytic methods has mainly focused on addressing 59 outliers within the distribution of effect size estimates at the between-study level. However, for ICC estimates, potential deviations from normality are a concern not only at the betweenstudy level but also at the within level. Furthermore, unlike most effect size estimates, ICC estimates have a restricted range between [0, 1]. Thus, the currently available Bayesian strategies might not be directly applicable for synthesizing ICC estimates, highlighting the need to investigate alternative distributions. ICC Distribution at the Between-Study Level For the ICC distribution at the between-study level, I suggest employing either a Beta proportion distribution or a normal distribution that is restricted at the range of [0, 1]. In the context of sample size planning with Bayesian approaches, methodologists have recommended modeling the distribution of ICC using a Beta distribution (Singh and Mukhopadhyay, 2016; Spiegelhalter, 2001) and a normal distribution restricted at a range above zero (Moerbeek and Teerenstra, 2015). Following these suggestions, the first proposed model involves the use of a Beta proportion distribution for the between-level distribution. A Beta proportion distribution is identical to a Beta distribution but has a different parameterization that allows for direct modeling of the overall ICC, µρ. The second proposed model employs a normal distribution restricted within [0, 1] to ensure the true ICC values, θk, remain within their theoretical bounds. ICC Distribution at the Within-Study Level One complication with modeling the within-level ICC distribution lies in the estimation of ICC in the original studies. In studies with a small number of clusters, Model 4.3 with maximum likelihood (ML) estimation methods, including REML, can sometimes result in singular fit, where the between-cluster variance estimate and hence the ICC estimate are zero (Chung et al., 2013; McNabb and Murayama, 2021). Despite the true ICC being substantially large, an ICC estimated with ML methods can be zero due to sampling variability, especially when the number of clusters is also small (McNabb and Murayama, 2021). Strategies have been proposed to address this issue. For instance, one approach involves the maximum penalized likelihood estimation (Chung et al., 2013), which prevents the variance estimates from approaching zero and has been implemented in the blme R package (Chung et al., 2013). Another recommended strategy is to fit a full Bayesian multilevel model that accounts 60 for the uncertainty in the variance components, implemented in R packages such as brms (Burkner, ¨ 2017). Due to the singularity issue with ML methods, the sampling distribution of ICC estimates can deviate from the typical expectation. I conducted a small-scale simulation to compare the sampling distributions of ICC with theoretical distributions, including Beta and normal distributions. Specifically, I simulated 5,000 sets of two-level data with Model 4.3 and estimated ICC using REML with the lme4 R package (Bates et al., 2015) and the partial Bayesian method with the blme R package (Chung et al., 2013). I manipulated four true ICC values: .1, .2, .5, and .8, and fixed the number of clusters at 20 and cluster size at 5 to simulate a relatively small sample size condition. Figures 1 and 2 show the sampling distributions of ICC (black, dotted lines) estimated using REML and the partial Bayesian method, respectively. All figures are overlayed with a normal distribution (blue, thin line) and a Beta distribution (red, thick line) constructed using the mean and variance of the simulated ICC estimates. 61 Figure 1: Sampling Distributions of ICC Estimated With Restricted Maximum Likelihood Estimation Note: Black and dotted lines indicate the distributions of simulated ICC estimates. Blue, solid, thin lines denote the normal distribution with the same mean and variance of the simulated ICC estimates. Red, solid, thick lines denote the Beta distribution with the same mean and variance of the simulated ICC estimates. 62 Figure 2: Sampling Distributions of ICC Estimated With Penalized Likelihood Estimation Note: Black and dotted lines indicate the distributions of simulated ICC estimates. Blue, solid, thin lines denote the normal distribution with the same mean and variance of the simulated ICC estimates. Red, solid, thick lines denote the Beta distribution with the same mean and variance of the simulated ICC estimates. In the presence of the singularity issue, Figure 1 shows that, when the true ICC is small (.1, .2), the sampling distributions of ICC are zero-inflated and the non-zero ICC estimates follow a normal distribution more closely than a Beta distribution. When the true ICC is larger (.5, .8), both the normal and Beta distributions approximate the sampling distribution of ICC, with the Beta distribution aligning more closely with the sampling distribution. By contrast, without the singularity issue, Figure 2 shows that the Beta distribution provides a closer fit to the sampling distribution of ICC in comparison to the normal distribution across conditions. This illustration demonstrates that the sampling distribution of ICC can deviate from expected forms with REML estimation due to the singularity issue. Moreover, the estimation of standard errors of ICC estimates presents another challenge. A 63 widely used standard error formula for an ICC estimate (ρ) is given by (R. A. Fisher, 1970; Hedberg and Hedges, 2014; Snijders and Bosker, 2012) SE(ρ) = (1 − ρ)[1 + (n − 1)ρ] s 2 n(n − 1)(Jn − 1), (4.7) where J is the number of clusters and n is the mean cluster size. However, this formula was developed based on large sample approximations (R. A. Fisher, 1970; Ukoumunne, 2002) and may yield biased standard error estimates in cases of small sample sizes. The bias in the standard error estimates could be exacerbated by the presence of the singularity issue. Without relying on large sample approximations, an alternative method to estimate SE(ρ) involves using Bayesian multilevel analysis and obtaining the posterior standard deviation of ICC. Based on the above investigation, my general recommendation is to use Bayesian methods to estimate ICC and its standard error if the raw data of the original studies is available. This approach is particularly useful when the true ICC is expected to be small, and the original studies involve a small number of clusters. As shown in Figure 2, the sampling distribution of ICC aligns more closely with a Beta distribution than with a normal distribution. In cases without the singular issue, such as when using Bayesian estimation, I suggest adopting a Beta distribution to model the within-level ICC distribution. However, access to raw data may often be limited for secondary analyses. Given that ICC is often estimated using ML estimation methods, in this chapter, I focus the discussion on pooling ICC estimates from models that employ ML estimation. As shown in Figure 1, a Beta distribution poorly aligns with the sampling distribution of ICC in the presence of the singularity issue. Therefore, in cases where ML methods were used to analyze historical data, I recommend employing a normal distribution restricted within the range of [0, 1] to model the within-level ICC distribution. 64 Proposed BRMA Models In summary, I propose two BRMA models to meta-analyze ICC estimates. The first RMA model is as follows Within-Level: ρk|θk, σ2 k ∼ N(θk, σ2 k ) Between-Level: θk|µθ, κ ∼ Beta proportion(µθ, κ) (µθ, κ) ∼ p(.) ρk ∈ [0, 1] τ 2 θ = µθ(1 − µθ) κ + 1 (4.8) where the between-level distribution is a Beta proportion distribution, with a mean of µρ and a concentration of κ; the within-level is a normal distribution restricted within [0, 1]. The overall ICC is represented by µρ , and the between-study variance, τ 2 ρ , can be derived from µρ and κ. Similar to Du and Wang (2016), I select a uniform prior for µρ, which assigns any real number between [0, 1] with equal probability, and a vague prior for κ ∼ Gamma(0.01, 0.01) (Kruschke, 2015). The second RMA model is given by Within-Level: ρk|θk, σ2 k ∼ N(θk, σ2 k ) Between-Level: θk|µθ, τ 2 θ ∼ N(µθ, τ 2 θ ) (µθ, τ 2 θ ) ∼ p(.) ρk ∈ [0, 1] θk ∈ [0, 1] (4.9) where both the between-level and within-level distributions are normal restricted within the range of [0, 1]. Similarly, I choose a uniform prior for µρ, which could be any real number between [0, 1] with equal probability, and also a uniform prior for τ 2 ρ , which could be any real number above zero. 65 4.5 Simulation Study In the previous section, I outlined three meta-analytic methods for aggregating ICC estimates: (a) SRMA, (b) RVE, and (c) BRMA. Moreover, I highlighted how the presence of singular fit in the original studies affects the sampling distribution of ICC. The present simulation study aims to investigate the following questions: 1. Given the potential bias in the ICC estimates and their standard errors due to the singularity issue, how well do these methods estimate the true overall ICC and betweenstudy heterogeneity? 2. To what extent do the confidence intervals (CIs) constructed by these methods capture the true overall ICC? 3. To what extent do the prediction intervals (PIs) from these methods encompass the true ICC of a future study? The first two questions are crucial in evaluating the performance of these meta-analytic methods in making statistical inferences about the true overall ICC, given the potential bias in the estimates. The third question ties closely to the central idea of this chapter: to obtain the distribution of true ICC that incorporates the between-study and within-study variability for planning future studies. Thus, examining the third question sheds light on appropriate methods for aggregating ICC estimates for conducting power analysis. 66 4.5.1 Data Generation Figure 3: Data Generation Process of the Present Simulation Study Figure 3 summarizes the data generation process of the present simulation study. Per replication, true ICC values, θk, for K studies are drawn from either a Beta distribution or a truncated normal distribution with a restricted range of [0, 1]. The Beta distributions have a mean of µθ, which denotes the true overall ICC, and a variance of τ 2 θ , which indicates the between-study heterogeneity.1 The normal distribution with truncation has a mean of µθ and a variance of τ 2 θ and lies within the interval of [0, 1]. I manipulated the number of studies to be small or moderately large (K = 10, 40), the true overall ICC to be small, medium, and large (µθ = .2, .5, .8), and the between-study variance to be small or large (τ 2 θ = .1 2 , .2 2 ). Figure 4 shows the between-study distribution of the true ICCs across these conditions. Notice that when µθ = .5, the between-study distribution is symmetric, whereas the distribution is positively skewed when µθ = .2 and negatively skewed when µθ = .8. 1Beta distributions are typically parameterized with the shape parameters, α and β. We can derive the shape parameters with the mean and variance of the Beta distributions, α = µ h µ(1−µ) σ2 − 1 i and β = (1 − 67 Figure 4: Between-Study Distributions of True ICCs Across Conditions Note: µθ = true overall ICC. τ 2 θ = true between-study variance. The between-study distribution is either a Beta distribution or a truncated normal distribution. For each of the K studies, I then generated two-level data of varying numbers of clusters with an average of J¯ clusters, each of size n, based on Model 4.3. In this model, the grand µ) h µ(1−µ) σ2 − 1 i . 68 intercept γ00 was set to 0, and the total variance of yij was set to 1. Recall that the relationship between the true ICC and the true variance components is θk = τ 2 k τ 2 k +σ 2 k . With Var(yijk) = τ 2 k + σ 2 k = 1, the true between-study variance was τ 2 k = θk and the true overall within-study variance was σ 2 k = 1 − θk. Next, ICC was estimated, ρk, by fitting a two-level model with REML using the lme4 R package (Bates et al., 2015). The standard error of ICC estimate, SE(ρk), is calculated using Equation 4.7. With K studies, the data generation process resulted in K sets of ICC estimates and standard error estimates to be aggregated. This study has a 2 × 2 × 3 × 2 × 2 × 2 factorial design, with a total of 96 conditions. Figure 3 details the six design factors manipulated. Per condition, I simulated 1,000 replications using the SimDesign R package (Chalmers and Adkins, 2020). 4.5.2 Data Analysis Per replication, the generated K sets of ICC estimates are meta-analyzed by the SRMA, RVE, and BRMA. According to Model 4.6, SRMA was performed using the rma.mv() function in the metafor R package (Viechtbauer, 2010), and RVE was conducted using the robu() function in the robumeta R package with small-sample corrections and a withinstudy effect-size correlation of .8. The rma.mv() function utilizes REML, whereas robu() uses the method of moments estimation. The two BRMA models, 4.8 and 4.9, were fitted to the data using the cmdstanr R package (Gabry and Ceˇ snovar, ˇ 2022) with four chains, each with 1000 warmup iterations and 4000 sampling iterations. At the within-study level, both models shared the same normal distribution, restricted within [0, 1]. However, the first model involves a Beta proportion distribution but the second model has a normal distribution restricted within [0, 1] at the between-study level. For the two BRMA models, I calculated the model convergence rate as the proportion of iterations when diagnostic summary() in cmdstanr reported zero divergent chains. From each of the fitted models, I obtained (a) the overall ICC estimate, µˆθ, (b) the betweenstudy variance estimate, τˆ 2 θ , (c) the 95% CI, (d) the 80% PI, and (e) the 95% PI. For SRMA and RVE, the 95% CI can be calculated as (Sanchez-Meca and Mar ´ ´ın-Mart´ınez, 2008) µˆθ ± t(.975,K−1)SEˆ (ˆµθ), (4.10) 69 where t(.975,K−1) is the 97.5th percentile of the central t-distribution with K − 1 degrees of freedom, and SEˆ (ˆµθ) is the standard error of the ICC estimate. A CI provides a range of values within which the true ICC is expected to lie. A 95% CI indicates that over repeated sampling, 95% of the calculated intervals are expected to contain the true overall ICC. Moreover, the 80% PI can be approximated by (Viechtbauer, 2010; Viechtbauer and Lopez- ´ Lopez, ´ 2022; Higgins et al., 2009; Riley et al., 2011) µˆθ ± t(.9,K−1)q τˆ 2 θ + SEˆ (ˆµθ) 2 , (4.11) where t(.9,K−1) is the 90th percentile of the central t-distribution with K − 1 degrees of freedom.2 The 95% PI can be calculated using Equation 4.11 but with t(.975,K−1). A PI provides a range of values within which the true ICC of a future study will fall (Riley et al., 2011). For example, a 95% PI denotes that over repeated sampling, 95% of the calculated intervals are expected to contain a future true ICC. For BRMA, the 95% CI or credible interval is given by [µ ∗ θ,.025; µ ∗ θ,.975], (4.12) which are 2.5th and 97.5th percentiles of the posterior draws of ICC, µ ∗ θ . A 95% credible interval indicates that the interval has 95% chance of containing the true overall ICC. The PIs can be obtained from the posterior draws of the predicted true ICC, θ ∗ , which are drawn from the between-level distribution fitted in the model (i.e., Beta proportion distribution or normal distribution restricted between zero and one). The 80% PI can be constructed as [θ ∗ .1 ; θ ∗ .9 ], (4.13) which are the 10th and 90th percentiles of the posterior draws of θ ∗ . Similarly, the 95% PI is the 2.5th and 97.5th percentiles of the posterior draws of θ ∗ . The interpretation of a 95% PI, for example, is that there is a 95% chance that the interval captures the true ICC of a future study. 2As noted in Viechtbauer (2010), a t distribution with either K − 1 (Viechtbauer, 2010; Viechtbauer and Lopez-L ´ opez, ´ 2022) or K − 2 (Higgins et al., 2009; Riley et al., 2011) is an approximation. I followed the metafor package (Viechtbauer, 2010; Viechtbauer and Lopez-L ´ opez, ´ 2022) and use K − 1 degrees of freedom. 70 4.5.3 Evaluation Criteria I evaluated the following criteria for each of the four methods: SRMA, RVE, BRMA with a Beta between-study distribution (BRMA-Beta), and BRMA with a normal between-study distribution restricted within [0, 1] (BRMA-TN). Bias I computed the bias of the overall ICC (µθ) and between-study variance (τ 2 θ ) estimators as follows ¯ ϑˆ − ϑ, (4.14) where ϑ is the true overall ICC or the true between-study variance, and ¯ ϑˆ = PR r=1 ϑˆr R is the mean of the estimates of the overall ICC or between-study variance across R = 1, 000 replications. Root Mean Squared Error (RMSE) In addition, I computed the RMSE of the estimators of µθ and τ 2 θ as follows sPR r=1( ¯ ϑˆ − ϑ) 2 R , (4.15) which can be expressed as the sum of the squared bias of the estimator and the variance of the estimator. When two methods demonstrate comparable bias levels, a smaller RMSE indicates lower sampling variability and better estimation accuracy. CI and PI Coverages and Widths The CI coverage rate was calculated as the proportion of times CIs by a method captured the true overall ICC across 1,000 replications. Valid 95% CIs should have a CI coverage of 95%. Similarly, the PI coverage was determined as the proportion of times PIs by a method contained a new true ICC drawn from the true between-study distribution across 1,000 replications. The CI and PI widths were computed as the average differences between the upper and lower bounds of the corresponding intervals. A method that achieves adequate coverage with narrower CI or PI intervals indicates greater precision in containing the true overall ICC or the true ICC of a new study, respectively. 71 4.5.4 Results The result patterns are similar in the conditions with a Beta between-study distribution and a truncated normal distribution. Therefore, I present the results for the conditions with a Beta between-study distribution. The convergence rates of BRMA-Beta exceed 90% in most conditions. In five conditions with a small number of studies, average number of clusters, cluster size, and a large betweenstudy variance, the model convergence rates ranged between 52% to 85%. BRMA-TN had poorer convergence rates with a range of 6.5% to 87% in 13 conditions and a convergence rate above 90% in the rest of the conditions. BRMA-TN resulted in poorer convergence when the number of studies, average number of clusters, and cluster size were small. Bias and RMSE in Overall ICC Figures 5 and 6 show the bias and RMSE of the methods in estimating the overall ICC, respectively. When the true overall ICC was µθ = .5, where the between-study distribution is symmetric, all methods had close to zero bias in the overall ICC and comparable RMSE. When the true between-study variance was small (τ 2 θ = .1 2 ), all methods performed similarly in terms of biases and RMSEs. Specifically, all methods exhibited small negative biases for µθ = .2 and small positive biases for µθ = .8 in the condition of a small average number of clusters (J¯ = 20). Increasing the average number of clusters (J¯ = 50) reduced the bias to close to zero. RMSE decreased with a larger average number of clusters, cluster size, and number of studies. 72 Figure 5: Bias in Estimating the Overall ICC Note: The true between-study distribution is a Beta distribution. µθ = true overall ICC. τ 2 θ = true between-study variance. K = number of studies. J¯ = average number of clusters. n = cluster size. 73 Figure 6: RMSE in Estimating the Overall ICC Note: The true between-study distribution is a Beta distribution. µθ = true overall ICC. τ 2 θ = true between-study variance. K = number of studies. J¯ = average number of clusters. n = cluster size. When the true between-study variance was large (τ 2 θ = .2 2 ), RVE had a larger negative bias for µθ = .2 and n = 50 and a substantially large positive bias for µθ = .8 across different cluster sizes. The bias reduced as the average number of clusters increased. RVE 74 also demonstrated larger RMSE in the conditions of µθ = .8 and in the conditions where µθ = .2 and n = 50. All other three methods performed similarly in terms of the bias and RMSE. To note, in the conditions with a truncated normal between-study distribution, the biases and RMSEs of all methods were smaller than in the conditions with a Beta between-study distribution. Bias and RMSE in Between-Study Variance Figures 7 and 8 present the bias and RMSE of the methods in estimating the between-study variance, respectively. When the true between-study variance was small (τ 2 θ = .1 2 ), SRMA and RVE had close to zero biases and the lowest RMSEs across conditions. However, BRMA-Beta demonstrated small positive biases for µθ = .2 and n = 5, and BRMA-TN had positive biases for µθ = .5 and n = 5. When the average number of clusters (J¯) or the number of studies (K) were larger, both BRMA methods had comparable biases and RMSEs with SRMA and RVE. 75 Figure 7: Bias in Estimating the Between-Study Variance Note: The true between-study distribution is a Beta distribution. µθ = true overall ICC. τ 2 θ = true between-study variance. K = number of studies. J¯ = average number of clusters. n = cluster size. 76 Figure 8: RMSE in Estimating the Between-Study Variance Note: The true between-study distribution is a Beta distribution. µθ = true overall ICC. τ 2 θ = true between-study variance. K = number of studies. J¯ = average number of clusters. n = cluster size. When the true between-study variance was large (τ 2 θ = .2 2 ), RVE had substantial biases across conditions. In particular, the biases were positive when µθ = .5 and negative when µθ = .8 and µθ = .2 and n = 50. BRMA-TN also demonstrated positive biases, to a smaller 77 degree, when the number of studies was small (K = 10). SRMA and BRMA-Beta had the lowest biases, with SRMA having a smaller bias in most conditions. Similarly, the biases and RMSEs of all methods were smaller in the conditions with a truncated normal between-study distribution. 95% CI Coverage and Width Figures 9 shows the 95% CI coverages, and Figure 10 shows the average 95% CI lower and upper bounds of all methods across 1,000 replications. The bar between a lower bound and an upper bound indicates the average 95% CI width. When µθ = .5, where the between-study distribution of true ICCs was symmetric, all methods reached 95% CI coverages. When the number of studies was small (K = 10), BRMABeta had the narrowest CI width and BRMA-TN had the widest CI width. When the number of studies was larger (K = 40), all methods had similar CI widths. 78 Figure 9: 95% CI Coverage Note: The true between-study distribution is a Beta distribution. The horizontal line denotes the 95% coverage. µθ = true overall ICC. τ 2 θ = true between-study variance. K = number of studies. J¯ = average number of clusters. n = cluster size. 79 Figure 10: Average 95% CI Bounds Note: The true between-study distribution is a Beta distribution. The dotted lines are the theoretical limits of ICC. µθ = true overall ICC. τ 2 θ = true between-study variance. K = number of studies. J¯ = average number of clusters. n = cluster size. Across conditions, BRMA-Beta had the highest CI coverage, followed by BRMA-TN and SRMA, whereas RVE had the lowest CI coverage. BRMA-Beta achieved 95% CI coverages across conditions, except when both the average number of clusters and cluster size were 80 small. BRMA-TN and SRMA had lower than 95% CI coverage in the conditions with a small number of studies (K = 10) and a small average number of clusters (J¯ = 20) and reached 95% CI coverage as K or J¯ increased. RVE had low CI coverage, particularly when the true overall ICC was large (µθ = .8), the average number of clusters was small and the true between-study variance was large (τ 2 θ = .2 2 ). All methods also showed improved CI coverage in the conditions with a truncated normal between-study distribution. 81 95% PI Coverage and Width Figure 11: 95% PI Coverage Note: The true between-study distribution is a Beta distribution. The horizontal line denotes the 95% coverage. µθ = true overall ICC. τ 2 θ = true between-study variance. K = number of studies. J¯ = average number of clusters. n = cluster size. 82 Figure 12: Average 95% PI Bounds Note: The true between-study distribution is a Beta distribution. The dotted lines are the theoretical limits of ICC. µθ = true overall ICC. τ 2 θ = true between-study variance. K = number of studies. J¯ = average number of clusters. n = cluster size. Since the result patterns for 80% and 95% PI are similar, here I report the details of 95% PI coverage and width. Figure 11 presents the 95% PI coverage, and Figure 12 summarizes the average lower and upper bounds across 1,000 replications, as well as the PI widths, of all 83 methods. Similar to other evaluation criteria, when the true between-study distribution was symmetric (µθ = .5), all methods reached 95% PI coverage, although RVE and SRMA had slightly lower coverage when J¯ = 20. Despite the adequate PI coverage when µθ = .5, RVE and SRMA had wide PI widths, particularly when τ 2 θ = .04. When µθ = .2 or .8, BRMA-Beta had the highest 95% PI coverage, closely followed by SRMA in most conditions. SRMA had lower than 95% PI coverage when the average number of clusters was small (J¯ = 20) and the number of studies was small (K = 20). Increasing J¯ and K improved the 95% PI coverage for SRMA. BRMA-TN had lower PI coverage when the true between-study variance was large (τ 2 θ = .2 2 ). RVE had the lowest PI coverage and the narrowest PI width when the true overall ICC was large (µθ = .8). The low PI coverage and narrow PI width could be explained by the strong bias in the estimation of between-study variance by RVE. When τ 2 θ = .04, the average PI bounds of SRMA and RVE fell outside of the theoretical limits of ICC in most conditions. Note also that all methods had better PI coverage when the true between-study distribution was a truncated normal distribution, although the average PI bounds also went beyond the theoretical limits when τ 2 θ = .04. Summary The challenges with meta-analyzing ICC estimates included not only the skewness in the distributions but also their theoretical limits. Due to the singularity issue, the potential bias in the ICC estimates and their standard errors added a layer of complexity. Contrary to expectation, despite the violation of the normality assumption, SRMA performed well in terms of bias and RMSE when estimating the overall ICC and between-study variance. BRMABeta also had a comparable bias and RMSE with SRMA. On the other hand, when the true between-study variance was large, BRMA-TN had unstable performance, and RVE could at times have substantial biases with high RMSEs. In terms of CI and PI coverages, BRMA-Beta reached the desired level of coverage with the narrowest width in almost all conditions. Although BRMA-TN had lower coverages in some conditions, it had reasonable coverages in the conditions with a larger number of studies 84 and an average number of clusters. SRMA also had reasonable CI and PI coverages but average PI bounds beyond the theoretical limits. RVE had the most unstable performance, particularly when the true between-study variance was large. Overall, both SRMA and BRMA-Beta had the best performance. As supported by this simulation, these two methods were reasonably accurate in making statistical inferences about the true overall ICC. Moreover, the simulation results revealed that BRMA-Beta effectively recovers the between-study distribution of the true ICCs, despite the challenges in the skewed distributions and the singularity issue. It is important to note that, when the true ICCs followed a truncated normal distribution, BRMA-Beta also maintained decent performance. With SRMA, although the PI coverage was not ideal, the true ICC distribution could still be approximated using the unbiased estimates of true overall ICC and between-study variance. In the next section, I will demonstrate the use of these two methods to obtain the true ICC distribution for conducting power analysis. 4.6 Illustrative Examples of Meta-Analyzing ICC Hedges and Hedberg (2013) reported the ICC estimates and their standard errors for Grade 1-11 Mathematics achievement across 11 states—Arkansas, Arizona, Colorado, Florida, Kansas, Kentucky, Louisiana, Massachusetts, North Carolina, West Virginia, and Wisconsin. The reported ICC estimates represent the strength of association in Mathematics achievement among students within a school. Across the 11 states, the average number of schools per grade ranges from 44 to 920, and the average school size per grade ranges from 49 to 187. In this section, I illustrate how to meta-analyze the ICC estimates across states and obtain the ICC distribution for power analysis using SRMA and BRMA. Specifically, I aggregate ICC estimates across states for each of Grades 3-8 and 10, which has ICC estimates from at least five states. Researchers who plan a two-level study on Mathematics achievement in one of the 11 states can directly adopt the estimates from Hedges and Hedberg (2013) to construct the ICC distribution for power analysis. On the other hand, researchers who plan a two-level study in multiple states or states other than the 11 states can adopt RMA to obtain the predicted ICC distribution. RMA allows estimation of the overall ICC and random deviations from the overall ICC across states, as well as generalization of the results to other states (Hall and 85 Rosenthal, 2018). The resulting ICC distribution from RMA represents the distribution of true ICCs across states for Mathematics achievement in each grade. Consistent with the simulation study, I utilize Models 4.6 and 4.8 to conduct SRMA and BRMA respectively. The BRMA model has a Beta distribution at the between-state level and a normal distribution with truncation beyond [0, 1] at the within-state level. With SRMA, I first aggregate the ICC estimates using the metafor R package (Viechtbauer, 2010) and extract the estimates of the overall ICC (µˆθ), the between-state variance (τˆ 2 θ ), and the withinstate variance (SEˆ [ˆµθ] 2 ). Next, I approximate the true ICC distribution with a Beta distribution that has a mean of µˆθ and a standard deviation of q τˆ 2 θ + SEˆ [ˆµθ] 2 . The resulting ICC distributions are outlined in blue in Figure 13. To fit the BRMA model, I use the cmdstanr R package (Gabry and Ceˇ snovar, ˇ 2022) with four chains, each with 1000 warmup iterations and 4000 sampling iterations, leading to a total of 16,000 posterior draws. The posterior distribution of ICC, as illustrated with black and dashed lines in Figure 13, constitutes 16,000 posterior draws of the predicted true ICC, θ ∗ . Appendix VI includes the R script for aggregating ICC estimates of Grade 8 Mathematics achievement with the two methods. 86 Figure 13: ICC Distributions Obtained From SRMA and BRMA Note: θ = true ICC. The histograms, outlined by black, dashed lines, denote the posterior distributions of ICC obtained from BRMA. The blue, solid lines outline the Beta distributions constructed using the overall ICC and between-study variance estimates from SRMA. As shown in Figure 13, the shapes of the ICC distributions constructed using SRMA and BRMA are fairly similar. The simulation results showed that BRMA with a Beta betweenlevel distribution tended to have a larger τˆ 2 θ than SRMA. Consistently, in these examples, the ICC distributions from SRMA are narrower than those from BRMA. Note that both distributions are approximations of the true ICC distributions and slight deviations between the two distributions are expected. These examples show that the posterior distribution of ICC from BRMA is reasonably similar to a Beta distribution constructed using estimates 87 from SRMA. Suppose that we plan a school-randomized study that evaluates an educational treatment on Grade 3 Mathematics achievements in California. To perform power analysis with the Bayesian procedure, we can use my developed R package, BACpowr, which stands for Bayesian Approach for Classical power analysis. This package accepts input parameter distributions to perform power analysis. Users can select a theoretical distribution, such as Beta, or supply posterior draws to construct an input parameter distribution. With the Bayesian approach, power analysis with two-level cluster randomized studies requires the input distributions of effect size and ICC. Suppose we believe the effect size follows a normal distribution with a mean of 0.5 and a standard deviation of 0.15 and select this distribution as the effect size distribution. In addition, we specify a Beta distribution with a mean of µˆθ = .182 and a standard deviation of q τˆ 2 θ + SEˆ (ˆµθ) 2 = .155 from SRMA as the ICC distribution. The results suggest that a study with 50 schools, each with 50 students (J = 50, n = 50) will have a mean power of .82 and an assurance of 67%. In other words, incorporating the uncertainty in the effect size and ICC, our new study will have an average 82% probability of detecting the treatment effect and a 67% chance of achieving .80 power. Alternatively, we can provide the posterior draws of ICC to the BACpowr package. The results indicate that a study with J = 50, n = 50 will have a mean power of .85 and an assurance of 73% of achieving .80 power, incorporating the uncertainty in the effect size and ICC specified in the distributions. 4.7 Discussion Meta-analysis has been useful for understanding an overall finding of a body of research and informing future study planning. While the meta-analysis literature has mostly focused on synthesizing effect size estimates, there has been growing interest in pooling ICC estimates (e.g., Hedberg and Hedges, 2014; Bhat and Beretvas, 2022). However, given the unique properties of ICC and potential assumption violations, it has been a question whether standard meta-analytic methods reasonably estimate the overall ICC and the between-study variance. To relax the assumptions, I proposed Bayesian methods that appropriately bound ICC within its theoretical limits and account for skewness. Moreover, I systematically compared and assessed the performance of meta-analytic methods in a simulation study. 88 The simulation study provided support to the standard method (SRMA) for meta-analyzing ICC estimates, even in conditions with violations of the normality assumption at the betweenstudy level. Overall, SRMA consistently had low biases and RMSEs in estimating the overall ICC and between-study variance. However, the CI coverage and PI coverage could be lower than the nominal level when the number of studies and the sample size of the historical studies were small. Moreover, the CI and PI widths were relatively wide and the boundaries of the intervals could reach beyond the theoretical limits of ICC when the true overall ICC is close to zero or one. Another method that produced comparably low biases and RMSEs was the Bayesian method (BRMA) with a Beta between-level distribution. Although this method had slightly higher biases in estimating the between-study variance than SRMA in some conditions, the Bayesian method yielded the highest CI and PI coverages among methods and reached the nominal level in almost all conditions. The CIs and PIs were also properly bounded within the theoretical limits. Section 4.4.3 presented an interesting discovery of the within-study sampling distribution of ICC. When the number of clusters is small, ML estimation methods could sometimes result in a zero estimate of the between-cluster variance and thus ICC, even though their true values are non-zero. Whereas past research has commonly approximated the ICC distributions with Beta distributions (Singh and Mukhopadhyay, 2016; Spiegelhalter, 2001; Sarkodie et al., 2023), the sampling distribution of ICC estimated with ML estimation appears to be a mixture of a normal distribution and a zero-inflated distribution. By contrast, with the penalized likelihood estimation which avoids the singularity issue, ICC follows more closely a Beta distribution than a normal distribution, consistent with the literature. With ML estimation, given the sampling distribution approximately follows a normal distribution, the normality assumption of SRMA may hold, yielding relatively unbiased results. At the end of the chapter, I illustrated how to meta-analyze ICC estimates with SRMA and BRMA and construct the ICC distribution for power analysis. Extended from the procedure by Du and Wang (2016), BRMA was employed to obtain the posterior distributions of ICC, which were then used in power analysis. To provide a convenient alternative, I proposed a method to construct the ICC distribution using estimates from SRMA, which could be a preferable method or a method employed in a previous study. This extension completes a piece of the Bayesian procedure for power analysis, which allows the incorporation of a 89 collection of research findings to inform power analysis for multilevel studies. 4.8 Limitations and Future Directions Given the popularity of ML estimation, the present simulation focused on pooling ML estimates of ICC. The simulation results are generalizable to aggregating historical findings estimated using ML estimation but not other estimation methods. A potential direction for a simulation study is to generate historical ICC estimates using penalized likelihood estimation or full Bayesian estimation and meta-analyze these ICC estimates. These ICC estimates, as shown in Figure 2, follow more closely to a Beta distribution than a normal distribution. Given the within-study distribution can be heavily skewed, it is worth exploring the performance of (a) SRMA, (b) BRMA that has a normal within-study distribution restricted within [0, 1], and (c) BRMA that has a Beta within-study distribution. Like ICCs, correlation coefficients also have theoretical bounds within [−1, 1]. Meta-analysis with correlation estimates often involves transforming the estimates to Fisher’s z scores or Hedges’ g, which asymptotically follow a normal distribution (e.g., Harrer et al., 2021; Du and Wang, 2016). For example, McShane and Bockenholt ( ¨ 2016) and Du and Wang (2016) have applied z-transformation on the correlation estimates and used methods with normality assumptions pooling the transformed estimates for power analysis. Similar transformations may also apply to ICCs to allow the use of methods that require the normality assumption. However, one of the challenges lies in the difficulty of estimating the standard errors of the transformed ICCs. Another point of consideration is how to use the aggregated, transformed overall ICC estimate for subsequent analysis. For correlation coefficients, power analysis can be performed directly on the transformed coefficients, without loss of generality (Du and Wang, 2016). However, some forms of back-transformation would be necessary for the transformed ICC estimates for subsequent power analysis. Applying transformations on ICC estimates for meta-analysis and power analysis requires further investigation. Although this chapter expands on the Bayesian procedure for aggregating ICC estimates across studies, it relies on the assumption of no publication bias in these estimates. However, ICC estimates are not consistently reported in the literature. For instance, in the metaanalysis conducted by Ahlen et al. (2015) on randomized and cluster-randomized trials, only four out of 30 studies included ICC estimates. The question of whether the absence of 90 reported ICC estimates introduces biases remains. To mitigate potential biases, a method similar to the PUB approach discussed in Chapter III could be developed. A future direction includes developing an approach that can derive the posterior distribution of ICC while accounting for potential biases arising from missing data. 4.9 Conclusion This chapter establishes the groundwork for the Bayesian power analysis procedure which incorporates different sources of uncertainty of the estimates of parameters, including the effect size and ICC. Although the emphasis in this chapter has been on the ICC, conducting power analysis for multilevel studies with a more complex structure requires the consideration of additional parameters. Nonetheless, information on these parameters may often be scarce in the literature. The next chapter will delve into the extension of the Bayesian power analysis procedure to accommodate more complex multilevel studies. 91 CHAPTER V EXTENSION TO MULTILEVEL STUDIES Power analysis for multilevel studies with continuous predictors requires the input of population parameter values, including the fixed effect coefficients and random effect components (Hox and Roberts, 2011; Snijders, 2005). Similar to designing single-level studies, these population values are typically unknown and often replaced by educated estimates. As discussed in earlier chapters, previous studies have highlighted the importance of considering uncertainty, as neglecting the uncertainty can lead to insufficient power in single-level studies (McShane and Bockenholt, ¨ 2016; Du and Wang, 2016; Anderson et al., 2017) and multilevel studies with a binary predictor (e.g., treatment groups; Tse and Lai, 2023; Tse and Lai, 2022). Similarly, for multilevel studies with continuous predictors, ignoring uncertainty in the fixed effect coefficients and random effect components is also likely to diminish power. While there has been limited research on this issue, further exploration is needed to understand how ignoring uncertainty in these parameters may affect power analysis for multilevel studies with continuous predictors. In addition to uncertainty, a key challenge in designing multilevel studies is the complexity of defining random effects components (e.g., cross-cluster random effects; Scherbaum and Pesner, 2018). For example, one aims to replicate the study of Fan and Bains (2008), who examined the effect of students’ prior mathematics achievement on their current mathematics achievement. This study has a two-level structure, with students nested within schools, students’ prior mathematics achievement as a level-1 continuous predictor, and students’ current achievement as the outcome. When the effect of prior mathematics achievement is believed to differ across schools, a two-level model with random slopes on this predictor can be used. Power analysis for this design requires the input of the between-school variance of the outcome, the within-school variance of the outcome, and the within-school variance of the predictor (Snijders, 2005). These parameter values are difficult to define as they have study-specific units and may often lack historical reference values. For designing two-level studies with a level-1 predictor, a recommended practice is to disag92 gregate the between- and within-level effects by cluster-mean centering (Enders and Tofighi, 2007; Raudenbush and Bryk, 2002). Building on the previous example, the impact of students’ prior mathematics achievement on their current performance can stem from individual or school-specific factors. Student performance may be influenced by their individual abilities and learning progression, as well as the educational programs offered by their school. By centering a level-1 predictor with cluster means (e.g., school averages of prior mathematics achievement), researchers can disentangle the within- and between-level effects, facilitating interpretations of the predictor-outcome relationship. However, existing closed-form formulas for designing two-level studies are specific to uncentered level-1 predictors (Snijders, 2005). Developing formulas for cluster-mean centered predictors would better inform sample size planning for two-level studies with a level-1 continuous predictor. In this chapter, I will achieve three goals: (a) reparameterizing existing formulas with more easily interpretable parameters for researchers to define, (b) deriving the formulas for clustermean centered level-1 predictors, and (c) extending the Bayesian method for designing twolevel studies with a continuous level-2 or level-1 predictor. In the following, I begin with a review of the model equations for three types of two-level designs. I will then reparameterize the standard error formulas by Snijders (2005) and extend these formulas for cluster-mean centered level-1 predictors. In a series of three simulation studies, I evaluate the performance of the proposed Bayesian method for designing two-level studies. 5.1 Model Equations This chapter focuses on three types of two-level models with (a) a continuous level-2 predictor, (b) a continuous level-1 predictor that has a constant effect across clusters, and (c) a continuous predictor at level-1 that has a random effect across clusters. 5.1.1 Continuous Level-2 Predictor The first model is a two-level model with a continuous predictor at level 2. For instance, a study with students nested within schools has school resources (e.g., teacher-to-student ratio) as a school-level variable predicting student’s mathematics achievement. The equation 93 of this two-level model is Level 1: Yij = β0j + eij , eij ∼ N(0, σ2 ) Level 2: β0j = γ00 + γ01Xj + u0j , u0j ∼ N(0, τ 2 ) Combined: Yij = γ00 + γ01Xj + u0j + eij , (5.1) where Yij is the outcome variable, Xj is a level-2 predictor, γ00 is the grand mean of the outcome, γ01 is the average effect of Xj , u0j is the cluster-specific random effect from the grand mean with a variance of τ 2 , and eij is the participant-specific random error with a variance of σ 2 . The total variance of the outcome variable is s 2 y = γ 2 01s 2 x + τ 2 + σ 2 , where s 2 x is the total variance of X. Power analysis for this two-level design has a focus on the average effect of the level-2 predictor, γ01, such as the effect of school-level resources on student-level achievement. The variance of γ01 is (Snijders, 2005) Var(γ01) = nτ 2 + σ 2 Jns2 x , (5.2) where J is the number of clusters and n is the cluster size, assumed constant across clusters. A challenge lies in defining the values of the design parameters, including τ 2 , σ 2 , and s 2 x . The variances of the cluster-specific random effect (τ 2 ) and the participant error (σ 2 ) depend on the scale of Y . When performing a power analysis for a new study, one may find it difficult to adopt estimates from a historical study that has a different metric than the new study. As γ01 depends on the scale of Y , it is common practice to standardize γ01 by the total variance of Y to facilitate interpretation. Standardizing Y yields the standardized average effect of X on Y as Γ01 = γ01 sy . Furthermore, X can also be standardized, allowing Γ01 to represent the change in Y , expressed in standard deviation units, for each standard deviation increase in X. The standard error of the standardized effect of X can be derived as SE(Γ01) = r (1 − Γ 2 01) · [1 + (n − 1)ρy] Jn , (5.3) where ρy = τ 2 τ 2+σ2 is the ICC of Y . The reparameterized formula expresses the standard error of the effect size in terms of (a) Γ01, the standardized coefficient of X, and (b) ρy, the ICC 94 of Y . Both Γ01 and ρy are unitless and generalizable from studies of similar designs, making their interpretations broadly applicable. The standardized fixed effects can be understood as the correlation between the predictor and the outcome, which ranges from [−1, 1]. On the other hand, the ICCs quantify the strength of association in the outcome among units within a cluster, with values ranging from [0, 1]. 5.1.2 Continuous Level-1 Predictor With a Fixed Effect The second model is a two-level model with a level-1 predictor that has a constant effect across clusters. For instance, a school study predicts student’s mathematics achievement with their prior mathematics scores. Prior mathematics performance is a student-level predictor and is assumed to have the same effect across schools. With a level-1 continuous predictor, a recommended practice is to center this predictor on its cluster mean, X cmj , to disentangle the between- and within-cluster effects (Enders and Tofighi, 2007; Raudenbush and Bryk, 2002). The cluster-mean-centered predictor is denoted as X cmcij . For a level-1 predictor with a fixed effect across clusters, the random intercept model is Level 1: Yij = β0j + β1jX cmcij + eij , eij ∼ N(0, σ2 y ) Level 2: β0j = γ00 + γ01X cmj + u0j , u0j ∼ N(0, τ 2 y ) β1j = γ10 Combined: Yij = γ00 + γ01X cmj + γ10X cmcij + u0j + eij , (5.4) where γ01 is the average level-2 effect of X, γ10 is the level-1 effect X on the outcome Y , σ 2 y is the within-cluster variance of Y , and τ 2 y is the between-cluster variance of Y . The total variance of Y is s 2 y = γ 2 01τ 2 x + γ 2 10σ 2 x + τ 2 y + σ 2 y , where τ 2 x and σ 2 x are the between- and within-variances of X, respectively. The total variance of X is s 2 x = τ 2 x + σ 2 x . Since X may have a between-cluster and within-cluster effect, the ICC of X is ρx = τ 2 x τ 2 x+σ2 x . The focus of this two-level model is often the effect of the cluster-mean-centered predictor on the outcome. For instance, how students’ prior performance affects their current achievement. Extended from Snijders (2005), the formula for the variance of the coefficient for the cluster-mean-centered effect of X is Var(γ10) = σ 2 y Jnσ2 x . (5.5) 95 The variance is a function of the variances of the participant-specific random errors of X and Y , which depend on the units of the predictor and outcome, respectively. Similarly, X and Y can be made scale-free through standardization, yielding s 2 y = 1 and s 2 x = 1. With Y standardized, the effect of the cluster-mean-centered predictor, X cmcij , is Γ10 = γ10 sy , and the effect of the cluster means of the predictor, X cmj , is Γ01 = γ01 sy . With X also standardized, τ 2 x = ρx and σ 2 x = (1 − ρx). Using some algebraic manipulation, the standard error formula of Γ10 can be expressed as SE(Γ10) = s [(1 − Γ 2 01ρx − Γ 2 10(1 − ρx)] · (1 − ρy) Jn(1 − ρx) . (5.6) The reparameterized formula takes the input of (a) the standardized coefficient of the average level-2 fixed effect, Γ01, (b) the standardized coefficient of the level-1 fixed effect, Γ10, (c) the ICC of the outcome, ρy, and (d) the ICC of the predictor, ρx. All these parameters are unit-free and therefore generalizable. 5.1.3 Continuous Level-1 Predictor With a Random Slope The final model is a two-level model with a level-1 predictor that has varying effects across clusters. Using the previous example, the effect of students’ prior mathematics scores on their current performance may vary across schools. For a two-level study with a level-1 predictor that has random slopes across clusters, the random slope model is Level 1: Yij = β0j + β1jX cmcij + eij , eij ∼ N(0, σ2 ) Level 2: β0j = γ00 + γ01X cmj + u0j β1j = γ10 + u1j u0j u1j ∼ N 0 0 , τ 2 0 τ01 τ 2 1 Combined: Yij = γ00 + γ01X cmj + γ10X cmcij + u0j + u1jX cmcij + eij , (5.7) where u0j is the cluster-specific random effect from the grand mean with a variance of τ 2 0 , u1j is the cluster-specific random effect from the average effect of the predictor X with a variance of τ 2 1 , and the covariance between the two random effects is τ01. The total variance 96 of Y is s 2 y = γ 2 01τ 2 x + (γ 2 10 + τ 2 1 )σ 2 x + τ 2 0 + σ 2 , where τ 2 x and σ 2 x are the between- and withinvariances of X, respectively. The total variance of X is s 2 x = τ 2 x + σ 2 x , and the ICC of X is ρx = τ 2 x τ 2 x+σ2 x . In this model, the heterogeneity of the effect of X across clusters can be quantified with ω = τ 2 1 τ 2 0 , which is a parameter commonly considered in the power analysis literature with a binary level-1 predictor (e.g., Dong and Maynard, 2013). Similar to the case with a fixed slope, the focus in this design is also the cluster-meancentered effect of the predictor at level 1, X cmcij . Extended from Snijders (2005), the variance of coefficient of the cluster-mean-centered effect is Var(γ10) = nτ 2 1 σ 2 x + σ 2 τ 2 1 σ 2 x + τ 2 0 + σ 2 . (5.8) In the same manner, X and Y can be standardized such that s 2 y = 1 and s 2 x = 1. Standardizing Y yields standardized coefficients of X cmcij , Γ10 = γ10 sy , and of X cmj , Γ01 = γ01 sy . Moreover, standardizing X gives τ 2 x and σ 2 x = (1 − ρx). Altogether, the standard error of the standardized cluster-mean-centered effect is SE(Γ10) = s [1 − Γ 2 01ρx − (Γ2 10 + τ 2 1 )(1 − ρx)]1 + [nω(1 − ρx) − 1]ρy Jn(1 − ρx) . (5.9) Compared to the formula with a fixed slope, this reparameterized formula with a random slope takes an additional input of the effect heterogeneity, ω, which also does not depend on the scale of the predictor or the outcome and is generalizable across studies. 5.2 Bayesian Procedure for Power Analysis In these models, power denotes the probability of detecting a non-zero effect of the predictor if the effect exists. Specifically, with a level-2 predictor, the null hypothesis is H0 : Γ10 = 0; with a level-1 predictor the null hypothesis is H0 : Γ01 = 0. In mathematical notation, power is defined as p(Reject H0|θ), where θ is the vector of the design parameters for a specific model (e.g., θ = [Γ01 ρy] for a model with a level-2 predictor. Coefficients of the fixed effect parameters can be tested by t-tests (Snijders and Bosker, 2012). The power function for a two-sided t-test is p(Reject H0|θ, J, n) = p(t ≥ t1−α/2;λ,ν) + p(t ≤ tα/2;λ,ν), (5.10) 97 Table 1: Degrees of Freedom and Nocentrality Parameters for Two-Level Models. Model with ν λ Required Inputs Level-2 predictor J − 2 Γ01q Jn (1−Γ 2 01)[1+(n−1)ρy] Γ01, ρy Level-1 predictor (fixed slope) Jn − 3 Γ10q Jn(1−ρx) [1−Γ 2 01ρx−Γ 2 10(1−ρx)](1−ρy) Γ01, Γ10, ρx, ρy Level-1 predictor (random slope) Jn − 3 Γ10q 1 [1−Γ 2 01ρx−(Γ2 10+τ 2 1 )(1−ρx)] · Jn(1−ρx) 1+[nω(1−ρx)−1]ρ Γ01, Γ10, ρx, ρy, ω Note. ν = degrees of freedom. λ = noncentrality parameter. where ν is the degree of freedom, λ is the noncentrality parameter, and t1−α/2 and tα are the critical values at the 1−α/2 and α/2 quantiles respectively. Table 1 summarizes the degrees of freedom and noncentrality parameters for each of the models. 5.2.1 Input Parameter Distributions The Bayesian procedure utilizes the distributions of the input parameters, rather than selected values, for power analysis. An input parameter distribution is a synthesis of historical information and prior beliefs. Within the Bayesian framework, researchers can incorporate historical information in the likelihood function and represent their prior belief in the prior distribution. The product of the likelihood function and prior distribution yields the posterior distribution, which is the input parameter distribution. As discussed in Chapter II, one approach involves directly specifying the input parameter distribution that approximates the true distribution of the parameter. With standardized mean difference, for example, research has recommended approximating its distribution with a normal distribution (Pek and Park, 2019; Gillett, 1994; Du and Wang, 2016). For the two-level models, the input parameters can be classified into three types: (a) standardized fixed effect coefficients, (b) ICCs, and (c) effect heterogeneity. The next step is to determine the appropriate distributions for the input parameters in the above-mentioned two-level models. Similar to any other standardized regression coefficients, the standardized fixed effect coefficients behave similarly to a correlation coefficient and have a range of [-1, 1]. Given the restricted range, Park and Pek (2022) have suggested employing a truncated normal distribution, a Beta distribution, or a uniform distribution for correlation and regression coefficients. Choosing a uniform prior with a range of [a, b] indicates that one believes any values between 98 [a, b] are equally likely to be the true standardized fixed effect. For example, one may select a uniform distribution of [0.2, 0.6] if they believe the standardized coefficient could likely be any values within this range. However, employing a noninformative uniform distribution that spans across the theoretical range of [−1, 1] is strongly discouraged due to it being an ”unrealistic representation” of sample size planning (Pek and Park, 2019; p.597). A Beta distribution can be considered to bound the standardized coefficient within [0, 1]. However, caution is advised in selecting a Beta distribution that has a mean near the boundaries and a large variance, as this may lead to a heavily skewed distribution and may not follow a typical distribution of a standardized coefficient. If historical estimates are available for reference, I recommend adopting a truncated normal distribution with a restricted range of [−1, 1] for the standardized fixed effect coefficients. To visualize the shapes of the sampling distributions, I simulated 2,500 two-level data sets based on Model 5.7 and estimated the fixed effect coefficients with restricted likelihood estimation using the lme4 R package (Bates et al., 2015). I fixed the number of clusters to 20 and manipulated the cluster size to be small or medium (5 or 20) and the true standardized fixed effect to be 0.2, 0.5, or 0.8. Figure 1 shows the resulting sampling distribution of standardized fixed effects, overlayed with normal distributions (blue, thin lines) and Beta distributions (red, thick lines) that have a mean and standard equal to the sample mean and standard deviation. The sampling distributions of the standardized fixed effects are approximately normal and well overlapped with normal distributions. On the other hand, when the true standardized fixed effects (0.2, 0.8) were near the boundaries of [0, 1], the Beta distributions were skewed and did not approximate the sampling distribution of the standardized fixed effects. Therefore, given sample estimates from historical studies, the recommended option for the input parameter distribution of a standardized fixed effect is truncated normal. 99 Figure 1: Sampling Distributions of Standardized Fixed Effect Coefficient. Note: J = number of clusters. n = cluster size. Γ10 = true standardized fixed effect coefficient. Γˆ 10 = estimated standardized fixed effect coefficient. Since ICC is bounded between [0, 1] and typically follows a skewed distribution (Hedberg and Hedges, 2014; Bhat and Beretvas, 2022), previous studies have suggested employing a 100 Beta distribution (Singh and Mukhopadhyay, 2016; Spiegelhalter, 2001; Wilson, 2023). An alternative approach suggests using a normal distribution with a restricted range above zero (Moerbeek and Teerenstra, 2015). As discussed in Section 4.4.3, the presence of singularity issues in maximum likelihood (ML) estimation can cause the sampling distribution of ICC to resemble a normal distribution with some degree of zero inflation more closely. However, in the absence of the singularity issue and theoretically, Beta distributions align with the sampling distributions of ICC more closely than a normal distribution. In line with the typical recommendation (Singh and Mukhopadhyay, 2016; Spiegelhalter, 2001; Wilson, 2023), I will adopt a Beta distribution to construct the ICC distributions for power analysis. In the meta-analysis literature, Biggerstaff and Tweedie (1997) have suggested using a Gamma distribution to approximate the distribution of the effect heterogeneity. To explore the use of Gamma distributions for effect heterogeneity, I simulated 2,500 two-level data sets based on Model 5.7 and estimated the effect heterogeneity, ωˆ = τˆ1 τˆ0 , using lme4. Similarly, I fixed the number of clusters to 20 and manipulated the cluster size to be 5 or 20 and the true effect heterogeneity to be 0.2, 0.5, or 0.8. Figure 2 shows the sampling distributions overlayed with Gamma distributions that have modes and variances equal to the sample modes and variances. Note that in small cluster size conditions, the effect heterogeneity approached infinity in a few iterations as τˆ0 approached zero due to the singularity issue. In these conditions, I trimmed out 1% of the right tail to compute the sample variance. As shown in Figure 2, Gamma distributions reasonably approximate the sampling distribution of effect heterogeneity. 101 Figure 2: Sampling Distributions of Effect Heterogeneity. Note: J = number of clusters. n = cluster size. ω10 = true effect heterogeneity. ωˆ = estimated effect heterogeneity. 102 5.2.2 Mean Power and Assurance Given that the inputs for power analysis are distributions, power also takes the form of a distribution. As discussed in Chapter II, the power distribution is often summarized with mean power and assurance. For multilevel studies, the general formula of mean power is E(Power) = Z Θ p(Reject H0|θ, J, n)p(θ)dθ, (5.11) where θ is the input parameter vector for a specific two-level design, p(Reject H0|θ, J, n) is the power function defined in 5.10 for a given number of clusters J and cluster size n, and p(θ) is the joint distribution of the input parameters. Assuming that the input parameters are independent, p(θ) is the product of all input parameter distributions. For example, for a two-level study with a level-1 predictor that has random effects, p(θ) = p(Γ01)p(Γ10)p(ρy)p(ρx)p(ω). The general formula of assurance is A(Power) = Z ΘL p(θ)dθ, (5.12) where ΘL is the parameter space of θ that has power values at or above the intended level of power, L, i.e., p(Reject H0|θ, J, n) ≥ L. In multilevel studies, the concepts of mean power and assurance maintain the same interpretations as they do in single-level studies. Mean power refers to the average probability of correctly detecting a non-zero fixed effect coefficient of interest (i.e., Γ01 for a level-2 predictor and Γ10 for a level-1 predictor). Assurance, on the other hand, represents the probability of achieving at least the intended level of power in the study. 5.3 Simulation Study I performed three simulation studies to validate the proposed Bayesian power analysis approach for the three two-level designs with (a) a level-2 continuous predictor, (b) a level-1 predictor that has a fixed effect, and (c) a level-1 predictor that has a random effect. These simulation studies aimed to assess the ability of the proposed approach to attain the intended sample size planning goal, including .80 power and 80% assurance of achieving .80 power. I will also explore the effect of ignoring uncertainty in the input parameters on the sample 103 size planning outcomes. In all three simulation studies, I generated 2,500 two-level data sets and analyzed the data using restricted maximum likelihood estimation with the lme4 package (Bates et al., 2015). 5.3.1 Design of Study 1 Per iteration, I generated historical two-level data with a level-2 continuous predictor based on Model 5.1. I fixed the grand intercept to be zero and the number of clusters to be 50. Three factors were manipulated: (a) the true level-2 fixed effect coefficient (Γ01) to be .2, .5, or .8, (b) the true ICC of the outcome (ρy) to be .2, .5, or .8, and (c) the cluster size (n) to be small or large (5 or 50), yielding a 3 × 3 × 2 factorial design. I analyzed each two-level data set with Model 5.1 and obtained the estimates of level-2 fixed effect (Γˆ 01) and ICC of the outcome (ρˆy), along with their respective standard error estimates. The standard error estimates of the fixed effect were obtained from lme4, and the standard error estimates of ICC were computed using Equation 4.7. 1 These estimates served as the reference values the reference values for planning a new study. In the next step, I employed the conventional method and the proposed Bayesian method to determine the number of clusters for a new study with a cluster size of n. Using the conventional method, I utilized the parameter estimates from a simulated historical study to calculate power with Formula 5.10 and determine the number of clusters required to achieve .80 power. On the other hand, the Bayesian approach involved constructing truncated normal distributions with means and variances equal to the Γˆ 01 and SE(Γˆ 01) for the fixed effect coefficients, and Beta distributions with modes and variances equal to ρˆy and SE(ˆρy). By inputting these distributions, I calculated the mean power and assurance based on Equations 5.11 and 5.12 respectively using numerical integration with the cubature R package (Narasimhan et al., 2023). The number of clusters was then determined based on the goal to achieve .80 mean power or 80% assurance. Next, I computed the actual power of each of the suggested number of clusters by the two methods. The actual mean power is defined as the average power across the 2,500 iterations. 1Note Formula 4.7 was developed for ICC estimates in unconditional models without including any predictors, also known as the null models. As in Hedges et al. (2012) and Hedges and Hedberg (2013), this formula is used to approximate the standard error of ICC estimates in models conditional on predictors. 104 Similarly, the actual assurance denotes the number of times the actual power of the suggested design was at or above .80 power across the iterations. 5.3.2 Design of Study 2 In Study 2, I simulated historical two-level data with a level-1 predictor that has a fixed effect across clusters based on Model 5.4. Similar to Study 1, the grand intercept was fixed at zero, the number of clusters was fixed at 50, and additionally, the true level-2 fixed effect (Γ01) was set at 0.2. I manipulated four design factors: (a) the true level-1 fixed effect (Γ10) to be 0.2, 0.5, or 0.8, (b) the true ICC of the outcome (ρy) to be .2 or .5, (c) the true ICC of the predictor (ρx) to be .2 and .5, and (d) the cluster size (n) to be small or large (5 or 50), resulting in a 3 × 2 × 2 × 2 factorial design. The data analysis procedure was the same as Study 1, with the following distinctions. In addition to Γˆ 01 and ρˆy, I also obtained the level-1 fixed effect estimate (Γˆ 10) and the ICC estimate of the predictor (ρˆx) based on Model 5.4, along with their standard error estimates. Similarly, the standard error of Γˆ 10 was obtained from lme4, and the standard error of ρˆx was calculated using Equation 4.7. I specified truncated normal distributions for the two fixed effects coefficients and Beta distributions for the two ICC estimates. These input parameter distributions were then utilized in the Bayesian method to determine the number of clusters. Due to the inefficiency in multidimensional numerical integration involving more than three parameters, I instead drew 10,000 parameter values from their respective distributions, calculated power for each set of drawn values, and computed mean power and assurance across the 10,000 power values. The evaluation criteria—mean power and assurance—were the same as in Study 1. 5.3.3 Design of Study 3 The design of Study 3 was highly similar to the design of Study 2, except that it involved the effect heterogeneity. Data were simulated according to Model 5.7. In this study, I additionally manipulated the true effect heterogeneity (ω) to be 0.2 or 0.5, leading to a 3×2×2×2×2 factorial design. In a similar fashion, the simulated historical data were analyzed, and parameters (Γˆ 01, Γˆ 10, ρˆy, ρˆx, ωˆ) were estimated with Model 5.7. The standard error of the effect heterogeneity was estimated 105 using the Delta method. I constructed the effect heterogeneity distributions with Gamma distributions that have modes and variances equal to the sample modes and variances of ωˆ. The procedures of power analysis and evaluation criteria remain the same as in Study 2. 5.3.4 Results Tables 2-5 summarize the mean estimates and empirical standard errors of the estimates for each parameter across 2,500 iterations in Studies 1-3. The mean estimates of Γ01, Γ10, and ρy were unbiased across conditions. However, the mean estimates of ω demonstrated a positive bias when the cluster size was small (n = 5) and unbiased when the cluster size was large (n = 50). Figures 3-5 present the actual mean power for the two methods in the respective simulation study. Ignoring uncertainty in the input parameters, the conventional method resulted in a mean power lower than the intended level of .80 across all conditions in the three studies. This outcome suggests that the suggested design by the conventional method had a lower than .80 average probability of detecting the true fixed effect. Contrarily, aiming at .80 mean power, the proposed Bayesian method consistently achieved a mean power of close to or at least .80 across conditions in all three studies. 106 Table 2: Mean Estimates and Empirical Standard Errors of Study 1. Γˆ 01 ρˆy J n Γ01 ρy Mean SE Mean SE .2 0.21 0.08 .20 .06 .5 0.21 0.11 .49 .07 0.2 .8 0.21 0.13 .79 .04 .2 0.50 0.07 .20 .06 .5 0.51 0.10 .49 .07 0.5 .8 0.51 0.11 .79 .04 .2 0.80 0.05 .20 .06 .5 0.80 0.07 .49 .07 5 0.8 .8 0.80 0.08 .79 .04 .2 0.20 0.06 .20 .03 .5 0.20 0.10 .49 .05 0.2 .8 0.20 0.12 .79 .03 .2 0.50 0.06 .20 .03 .5 0.50 0.09 .49 .05 0.5 .8 0.50 0.11 .79 .03 .2 0.8 0.04 .20 .03 .5 0.8 0.06 .49 .05 50 50 0.8 .8 0.80 0.07 .79 .03 Note. J = number of clusters. n = cluster size. Γ01 = true fixed effect. ρy = true ICC of the outcome. Γˆ 01 = estimated fixed effect. ρˆy = estimated ICC of the outcome. 107 Table 3: Mean Estimates and Empirical Standard Errors of Study 2. Γˆ 01 Γˆ 10 ρˆy J n Γ01 Γ10 ρy ρx Mean SE Mean SE Mean SE .2 0.20 0.19 0.20 0.06 .20 .07 .2 .5 0.20 0.12 0.20 0.08 .20 .07 0.2 .2 0.20 0.24 0.2 0.05 .49 .07 .5 .5 0.20 0.15 0.20 0.06 .50 .07 .2 0.20 0.17 0.5 0.06 .20 .06 .2 .5 0.20 0.11 0.50 0.07 .20 .06 0.5 .2 0.20 0.21 0.50 0.05 .49 .07 .5 .5 0.20 0.14 0.50 0.06 .49 .07 .2 0.20 0.13 0.80 0.04 .20 .07 .2 .5 0.20 0.10 0.80 0.07 .20 .06 .2 0.19 0.17 0.80 0.03 .49 .07 5 0.8 .5 .5 0.20 0.13 0.80 0.05 .49 .07 .2 0.20 0.14 0.20 0.02 .20 .04 .2 .5 0.20 0.09 0.20 0.03 .20 .04 0.2 .2 0.21 0.22 0.20 0.02 .50 .05 .5 .5 0.20 0.14 0.20 0.02 .50 .05 .2 0.20 0.13 0.50 0.02 .20 .03 .2 .5 0.20 0.09 0.50 0.02 .20 .04 0.5 .2 0.21 0.21 0.50 0.01 .50 .05 .5 .5 0.20 0.13 0.50 0.02 .50 .05 .2 0.20 0.10 0.80 0.01 .20 .04 .2 .5 0.20 0.08 0.80 0.02 .20 .04 .2 0.20 0.16 0.80 0.01 .49 .05 50 50 0.2 0.8 .5 .5 0.20 0.11 0.80 0.02 .49 .05 Note. J = number of clusters. n = cluster size. Γ01 = true fixed effect of the cluster means of the predictor. Γ10 = true fixed effect of the cluster-mean-centered predictor. ρy = true ICC of the outcome. ρx = true ICC of the predictor. Γˆ 01 = estimated fixed effect of the cluster means of the predictor. Γˆ 01 = estimated fixed effect of the cluster-mean-centered predictor. ρˆy = estimated ICC of the outcome. 108 Table 4: Mean Estimates and Empirical Standard Errors of Study 3. Γˆ 01 Γˆ 10 ρˆy ωˆ J n Γ01 Γ10 ρy ρx ω Mean SE Mean SE Mean SE Mean SE .2 0.2 0.18 0.20 0.07 .20 .07 .34 2.16 .2 .5 0.20 0.18 0.20 0.07 .20 .07 .62 .69 .2 .2 0.20 0.12 0.20 0.08 .20 .07 .34 .43 .5 .5 0.20 0.11 0.20 0.09 .20 .07 .77 5.24 .2 0.20 0.24 0.20 0.07 .49 .07 .21 .12 .2 .5 0.21 0.23 0.20 0.08 .49 .07 .53 .22 .2 0.20 0.15 0.20 0.07 .49 .07 .22 .15 0.2 .5 .5 .5 0.20 0.14 0.20 0.09 .49 .07 .53 .24 .2 0.20 0.17 0.50 0.06 .20 .07 .29 .49 .2 .5 0.20 0.17 0.50 0.07 .20 .07 .76 6.18 .2 .2 0.20 0.11 0.50 0.08 .20 .07 .35 .41 .5 .5 0.20 0.11 0.50 0.08 .20 .07 .63 .70 .2 0.21 0.22 0.50 0.06 .49 .07 .21 .12 .2 .5 0.20 0.21 0.50 0.07 .50 .07 .53 .22 .2 0.20 0.14 0.50 0.07 .49 .07 .21 .15 0.5 .5 .5 .5 0.20 0.13 0.50 0.08 .49 .07 .54 .25 .2 0.20 0.13 0.80 0.05 .20 .07 .29 .55 .2 .5 0.20 0.13 0.80 0.05 .20 .07 .62 1.06 .2 .2 0.20 0.10 0.80 0.07 .20 .07 .35 .42 .5 .5 0.20 0.10 0.80 0.07 .20 .07 .71 2.15 .2 0.20 0.17 0.80 0.05 .49 .07 .22 .12 .2 .5 0.20 0.16 0.80 0.05 .49 .07 .54 .23 .2 0.20 0.12 0.80 0.06 .49 .07 .22 .15 50 5 0.2 0.8 .5 .5 .5 0.20 0.12 0.80 0.07 .49 .07 .53 .25 Note. J = number of clusters. n = cluster size. Γ01 = true fixed effect of the cluster means of the predictor. Γ10 = true fixed effect of the cluster-mean-centered predictor. ρy = true ICC of the outcome. ρx = true ICC of the predictor. ω = true effect heterogeneity. $Γˆ 01 = estimated fixed effect of the cluster means of the predictor. Γˆ 01 = estimated fixed effect of the cluster-mean-centered predictor. ρˆy = estimated ICC of the outcome. ωˆ = estimated effect heterogeneity. 109 Table 5: Mean Estimates and Empirical Standard Errors of Study 3 (cont.). Γˆ 01 Γˆ 10 ρˆy ωˆ J n Γ01 Γ10 ρy ρx ω Mean SE Mean SE Mean SE Mean SE .2 0.20 0.14 0.20 0.03 .20 .04 .21 .08 .2 .5 0.20 0.14 0.20 0.05 .20 .04 .52 .18 .2 .2 0.20 0.09 0.20 0.04 .20 .04 .21 .09 .5 .5 0.20 0.09 0.20 0.05 .20 .04 .53 .19 .2 0.20 0.22 0.20 0.04 .49 .05 .21 .07 .2 .5 0.19 0.21 0.20 0.06 .49 .05 .53 .16 .2 0.20 0.14 0.20 0.05 .49 .05 .21 .07 0.2 .5 .5 .5 0.20 0.13 0.20 0.07 .50 .05 .52 .17 .2 0.20 0.13 0.50 0.03 .20 .04 .21 .08 .2 .5 0.20 0.13 0.50 0.04 .20 .04 .52 .18 .2 .2 0.20 0.08 0.50 0.03 .20 .04 .21 .09 .5 .5 0.20 0.09 0.50 0.04 .20 .04 .52 .18 .2 0.21 0.19 0.50 0.04 .49 .05 .21 .07 .2 .5 0.20 0.19 0.50 0.06 .50 .05 .52 .16 .2 0.20 0.13 0.50 0.04 .49 .05 .21 .07 0.5 .5 .5 .5 0.20 0.12 0.50 0.07 .49 .05 .52 .16 .2 0.20 0.10 0.80 0.02 .20 .04 .21 .08 .2 .5 0.20 0.10 0.80 0.03 .20 .04 .52 .17 .2 .2 0.20 0.08 0.80 0.03 .20 .04 .21 .10 .5 .5 0.20 0.07 0.80 0.04 .20 .04 .53 .19 .2 0.20 0.15 0.80 0.03 .49 .05 .21 .07 .2 .5 0.20 0.14 0.80 0.05 .50 .05 .53 .16 .2 0.20 0.12 0.80 0.04 .50 .05 .21 .07 50 50 0.2 0.8 .5 .5 .5 0.20 0.11 0.80 0.06 .50 .05 .53 .16 Note. J = number of clusters. n = cluster size. Γ01 = true fixed effect of the cluster means of the predictor. Γ10 = true fixed effect of the cluster-mean-centered predictor. ρy = true ICC of the outcome. ρx = true ICC of the predictor. ω = true effect heterogeneity. $Γˆ 01 = estimated fixed effect of the cluster means of the predictor. Γˆ 01 = estimated fixed effect of the cluster-mean-centered predictor. ρˆy = estimated ICC of the outcome. ωˆ = estimated effect heterogeneity. 110 Figure 3: Mean Power of the Conventional Method and the Bayesian Method That Aims at .80 Power in Study 1. Note: J = number of clusters. n = cluster size. Γ10 = true fixed effect of the predictor. ρy = true ICC of the outcome. 111 Figure 4: Mean Power of the Conventional Method and the Bayesian Method That Aims at .80 Power in Study 2. Note: J = number of clusters. n = cluster size. Γ10 = true fixed effect of the predictor. ρy = true ICC of the outcome. rhox = true ICC of the predictor. 112 Figure 5: Mean Power of the Conventional Method and the Bayesian Method That Aims at .80 Power in Study 3. Note: J = number of clusters. n = cluster size. Γ10 = true fixed effect of the predictor. ρy = true ICC of the outcome. rhox = true ICC of the predictor. ω = true effect heterogeneity. The actual assurance for the two methods is summarized in Figures 6-8 across the three simulation studies respectively. Consistent with the literature (e.g., Anderson et al., 2017), the conventional method, without addressing uncertainty, resulted in an assurance of about 50% across all conditions in the three studies. In other words, the conventional method suggested a design with .80 power about 50% of the time. On the other hand, the proposed Bayesian method with a goal of 80% assurance indeed achieved this goal across conditions in all simulation studies. These results support the validity of the proposed method in achieving 113 the intended sample size planning goal. Figure 6: Assurance of the Conventional Method and the Bayesian Method That Aims at 80% Assurance in Study 1. Note: J = number of clusters. n = cluster size. Γ10 = true fixed effect of the predictor. ρy = true ICC of the outcome. 114 Figure 7: Assurance of the Conventional Method and the Bayesian Method That Aims at 80% Assurance in Study 2. Note: J = number of clusters. n = cluster size. Γ10 = true fixed effect of the predictor. ρy = true ICC of the outcome. rhox = true ICC of the predictor. 115 Figure 8: Assurance of the Conventional Method and the Bayesian Method That Aims at 80% Assurance in Study 3. Note: J = number of clusters. n = cluster size. Γ10 = true fixed effect of the predictor. ρy = true ICC of the outcome. rhox = true ICC of the predictor. ω = true effect heterogeneity. 5.4 Discussion Power analysis for designing multilevel studies has been particularly challenging due to the complexity of the analysis and the difficulty in selecting input parameter values (Scherbaum and Pesner, 2018; Luo et al., 2021). Common approaches for multilevel power analysis include simulation-based approaches, which, however, require a steep learning curve (Lane and Hennes, 2018). While formula-based approaches are available for some multilevel designs, 116 selecting input parameter values is often difficult (Scherbaum and Pesner, 2018). In particular, the variance components at the between- and within levels are unintuitive to define. To address this difficulty, this chapter introduces reparamterized formulas aimed at simplifying power analysis for three two-level designs. These formulas require input parameters that are easier to interpret and generalizable from historical studies. There is also a relatively large body of research on these parameters, such as large-scale studies or meta-analyses on ICC estimates (e.g., Hedges and Hedberg, 2013; Kivlighan et al., 2020; Glassman et al., 2015; Dong et al., 2016). Another challenge with multilevel power analysis is the uncertainty in the input parameters. For example, numerous studies have raised concern about the need to estimate the input parameters a priori (e.g., Snijders, 2001; Scherbaum and Pesner, 2018; Konstantopoulos, 2010). The accuracy of the power analysis also depends on the “quality of the estimates” of the input parameters (Scherbaum and Pesner, 2018, p. 331). As shown in the present simulation results, ignoring uncertainty in the input parameters led to an average probability of detecting a fixed effect lower than the intended level across replications. Given the sampling variability, the historical estimates could often be poor estimates of the true value. Taking these estimates as the best guesses could result in a lower average probability of detecting an existing effect. To incorporate uncertainty in the input parameters, I extended the Bayesian procedure for power analysis to three multilevel designs. I explored options to approximate the distributions of the input parameters. The general recommendations are to employ a truncated normal distribution for a standardized fixed effect coefficient, a Beta distribution for an ICC, and a Gamma distribution for an effect heterogeneity. Although in this chapter I focused on directly constructing the input parameter distributions, the framework discussed in Chapter 2 also applies. Specifically, researchers can incorporate historical information in a likelihood function and prior beliefs about a parameter in a prior distribution. The product of the likelihood function and prior distribution results in the posterior distribution, which serves as the input distribution for power analysis. The recommended distributions can serve as a guideline in choosing the likelihood function and prior distribution. The present simulation studies provided evidence of the validity of the proposed in determining sample size for the three two-level designs. 117 5.4.1 Limitations and Future Directions In the simulation study, I utilized the Delta method to compute the standard errors of the effect heterogeneity estimates based on the raw data. However, researchers often rely on reported estimates and statistics in historical studies, where raw data are inaccessible. Whereas the standard errors of the effect heterogeneity are rarely reported, it requires further investigation into how best to estimate the sampling variability in the effect heterogeneity estimate given the available statistics. Alternatively, the empirical standard error estimates of effect heterogeneity in Tables 4 and 5 can serve as a reference for determining the amount of uncertainty for an effect heterogeneity estimate. While this chapter covered three common two-level designs with a continuous level-1 or level-2 predictor, multilevel analyses often include covariates at either or both levels. In the cluster-randomized trials and multisite randomized trials literature (i.e., with a binary level-1 or level-2 predictor), common input parameters include covariate R2 , which quantifies the variance explained by level-1 or level-2 covariates. Incorporating a covariate can increase the power of detecting a fixed effect and thus reduce the sample size requirement for a multilevel design (Raudenbush et al., 2007). Therefore, a future direction is to extend the standard error formulas presented in this chapter to include also covariate R2 . One limitation of formula-based approaches is their lack of flexibility in accommodating complex multilevel models. Snijders (2005) noted that there are no clear formulas for more general models that involve multiple correlated predictors, including some with fixed or random effects. For models with binary predictors (e.g., cluster randomized trials), recent advancements in closed-form formulas have included power analysis for cross-level moderation (Spybrook et al., 2016) and cross-level mediation (Kelcey et al., 2021). Extending these closed-form formulas for continuous predictors could greatly benefit research in behavioral and psychological sciences. In cases where closed-form formulas are unavailable, simulation-based approaches provide flexibility in power analysis with more complex designs. Resources as such tutorials (e.g., Arend and Schafer, ¨ 2019) and software tools (e.g., mlmpower by Enders et al., 2023) for simulation-based approaches have become increasingly accessible. To address uncertainty, the proposed Bayesian procedure is also adaptable to these simulation-based approaches. 118 In essence, the procedure entails using parameter distributions instead of fixed values for simulations. Rather than assigning fixed values to parameters, each iteration involves drawing a parameter value from the distribution, while the subsequent steps follow the original simulation approaches. By drawing variable parameter values from their respective distributions, the resulting power distribution and sample size planning outcomes account for the uncertainty. 119 CHAPTER VI CONCLUSION Scientific research involves accumulating knowledge and historical findings, which continuously update and shape our current understanding of phenomena. The proposed Bayesian procedure exercises the iterative process of updating current beliefs with historical information. With this procedure, researchers can integrate prior beliefs and historical findings into probability distributions of input parameters of power analysis. Compared to a guess value, the probability distribution draws on accumulated knowledge to provide a richer understanding of the input parameters. By leveraging these probability distributions, researchers can make informed decisions on sample size that appropriately address uncertainty and bias. The Bayesian procedure mirrors the conventional procedure for power analysis, with a key difference in the use of the probability distributions. To conclude, Figure 1 is a flow chart of the Bayesian procedure. In line with the conventional procedure, Steps 1-2 involve the researchers’ preparation work, including the collection of historical information. There are three common situations in power analysis, where researchers have (a) historical information for reference and no strong beliefs about the true parameter value, (b) lack of historical information and some beliefs about the true parameter value, and (c) a combination of information and beliefs. In Step 3, the conventional procedure typically requires researchers to select a best-guess value to represent all sources of information. By contrast, the Bayesian procedure offers flexibility in combining information and beliefs for different situations. Given researchers’ inputs of available information and beliefs, the R package, BACpowr, guides and handles the construction of the parameter distributions, as well as determining sample size according to the selected power analysis goal. 120 Figure 1: Bayesian Procedure for Classical Power Analysis This dissertation offers three advancements in the Bayesian procedure for classical power analysis, which include 1. Addressing publication bias and uncertainty in effect size estimates; 2. Constructing intraclass correlation (ICC) distributions for multilevel power analysis based on meta-analytical results; and 3. Designing common multilevel designs with a continuous predictor, while addressing uncertainty. In Chapter III, I proposed a publication-bias-and-uncertainty-adjusted Bayesian (PUB) method to account for both publication bias and uncertainty for power analysis. While the Bayesian method effectively addresses uncertainty, it has initially assumed the absence of publication in the historical information collected in Step 2. Considering the known presence of publication bias in the literature, the PUB method adjusts the probability distribution to reflect 121 the truncation of effect size values due to statistical nonsignificance. The results of the simulation study demonstrate that the PUB method adequately maintains the mean power and assurance at the intended level, providing support for its efficacy in handling publication bias In Step 3, the Bayesian procedure with random-effects meta-analysis (RMA) allows the incorporation of different sources of uncertainty into a probability distribution of effect size, leveraging information from multiple studies. While the focus of RMA has been on effect size (e.g., Du and Wang, 2016; Kontopantelis and Reeves, 2012a), in Chapter IV, I extended this procedure to meta-analyze ICC estimates with models that accommodate the distinct features of ICC. Specifically, I proposed Bayesian random-effects meta-analysis (BRMA) models that address the skewness in the ICC distribution and the theoretical limits of ICC. In a simulation study, I found that both BRMA and standard random-effects meta-analysis (SRMA) performed well in pooling ICC estimates, despite the challenges due to the distinct properties of ICC. The outcome of BRMA yields the ICC distribution that can directly serve as the input for power analysis. Moreover, considering SRMA may be preferable at times for meta-analyzing ICC estimates, I offered and discussed an alternative option to construct the ICC distribution for power analysis based on the estimates from SRMA. A priori power analyses for multilevel studies have rarely been performed due to the complexity of the analysis and the difficulty in selecting input parameter values (Scherbaum and Pesner, 2018). To address this challenge, I have redefined the standard error formulas for three common multilevel designs in more interpretable parameters, which have been well-studied in the literature (e.g., Hedges and Hedberg, 2013; Dong et al., 2016). Another challenge lies in the uncertainty in each of the parameters. As shown in the simulation study, neglecting uncertainty in the parameters could lead to reduced average power and assurance. To account for the uncertainty, I have expanded the Bayesian procedure for these multilevel designs. Specifically, I explored possible theoretical distributions to approximate the distribution of each input parameter and provided recommendations. These advancements enable researchers to address uncertainty for various types of designs and have better-informed decisions in sample size planning. 122 REFERENCES Ahlen, J., Lenhard, F., & Ghaderi, A. (2015). Universal prevention for anxiety and depressive symptoms in children: A meta-analysis of randomized and cluster-randomized trials. The Journal of Primary Prevention, 36(6), 387–403. https://doi.org/10.1007/s10935-015-0405-4 Anderson, S. F. (2021). Using prior information to plan appropriately powered regression studies: A tutorial using BUCSS. Psychological Methods, 26(5), 513–526. https://doi.org/10.1037/ met0000366 Anderson, S. F., & Kelley, K. (2020). BUCSS: Bias and uncertainty corrected sample size. manual. https://CRAN.R-project.org/package=BUCSS Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science, 28(11), 1547–1562. https://doi.org/10.1177/0956797617723724 Anderson, S. F., & Maxwell, S. E. (2017). Addressing the “Replication Crisis”: Using original studies to design replication studies with appropriate statistical power. Multivariate Behavioral Research, 52(3), 305–324. https://doi.org/10.1080/00273171.2017.1289361 Arend, M. G., & Schafer, T. (2019). Statistical power in two-level models: A tutorial based on Monte ¨ Carlo simulation. Psychological Methods, 24(1), 1–19. https://doi.org/10.1037/met0000195 Baker, R., & Jackson, D. (2008). A new approach to outliers in meta-analysis. Health Care Management Science, 11(2), 121–131. https://doi.org/10.1007/s10729-007-9041-8 Bates, D., Machler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using ¨ lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01 Bayes, T., & Price, R. (1763). An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a letter to John Canton, A. M. F. R. S. Philosophical Transactions of the Royal Society of London, 53, 370–418. https: //doi.org/10.1098/rstl.1763.0053 Beath, K. J. (2014). A finite mixture method for outlier detection and robustness in meta-analysis. Research Synthesis Methods, 5(4), 285–293. https://doi.org/10.1002/jrsm.1114 Bhat, B. H., & Beretvas, S. N. (2022). Meta-analytic pooling of intraclass correlation coefficient estimates. Multivariate Behavioral Research, 1–1. https://doi.org/10.1080/00273171.2021. 2009326 Biggerstaff, B. J., & Tweedie, R. L. (1997). Incorporating variability in estimates of heterogeneity in the random effects model in meta-analysis. Statistics in Medicine, 16(7), 753–768. https: //doi.org/10.1002/(SICI)1097-0258(19970415)16:7⟨753::AID-SIM494⟩3.0.CO;2-G Blazquez-Rinc ´ on, D., S ´ anchez-Meca, J., Botella, J., & Suero, M. (2023). Heterogeneity estimation ´ in meta-analysis of standardized mean differences when the distribution of random effects departs from normal: A Monte Carlo simulation study. BMC Medical Research Methodology, 23(1), 19. https://doi.org/10.1186/s12874-022-01809-0 Burkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan. ¨ Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01 Campanha, C., Minati, L., Fregni, F., & Boggio, P. S. (2011). Responding to unfair offers made by a ˜ friend: Neuroelectrical activity changes in the anterior medial prefrontal cortex. The Journal of Neuroscience, 31(43), 15569–15574. https://doi.org/10.1523/JNEUROSCI.1253-11.2011 123 Chalmers, R. P., & Adkins, M. C. (2020). Writing effective and reliable Monte Carlo simulations with the SimDesign package. The Quantitative Methods for Psychology, 16(4), 248–280. https://doi.org/10.20982/tqmp.16.4.p248 Chambers, C. D., & Tzavella, L. (2021). The past, present and future of Registered Reports. Nature Human Behaviour, 6(1), 29–42. https://doi.org/10.1038/s41562-021-01193-7 Chhabra, H., Yavari, F., Nitsche, M. A., & MA, Y. (2023, December). Enhancing the efficacy of extinction via induction of late phase plasticity by spaced intervention. OSF. https://doi.org/ 10.17605/OSF.IO/7MH92 Chiang, J. J., Turiano, N. A., Mroczek, D. K., & Miller, G. E. (2018). Affective reactivity to daily stress and 20-year mortality risk in adults with chronic illness: Findings from the National Study of Daily Experiences. Health Psychology, 37(2), 170–178. https://doi.org/10.1037/ hea0000567 Chung, Y., Rabe-Hesketh, S., Dorie, V., Gelman, A., & Liu, J. (2013). A nondegenerate penalized likelihood estimator for variance parameters in multilevel models. Psychometrika, 78(4), 685– 709. https://doi.org/10.1007/s11336-013-9328-2 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed). L. Erlbaum Associates. DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled clinical trials, 7(3), 177–188. Dong, N., & Maynard, R. (2013). PowerUp! : A tool for calculating minimum detectable effect sizes and minimum required sample sizes for experimental and quasi-experimental design studies. Journal of Research on Educational Effectiveness, 6(1), 24–67. https://doi.org/10.1080/ 19345747.2012.673143 Dong, N., Reinke, W. M., Herman, K. C., Bradshaw, C. P., & Murray, D. W. (2016). Meaningful effect sizes, intraclass correlations, and proportions of variance explained by covariates for planning two- and three-level cluster randomized trials of social and behavioral outcomes. Evaluation Review, 40(4), 334–377. https://doi.org/10.1177/0193841X16671283 Du, H., & Wang, L. (2016). A bayesian power analysis procedure considering uncertainty in effect size estimates from a meta-analysis. Multivariate Behavioral Research, 51(5), 589–605. https://doi.org/10.1080/00273171.2016.1191324 Enders, C. K., Keller, B. T., & Woller, M. P. (2023). A simple Monte Carlo method for estimating power in multilevel designs. Psychological Methods. https://doi.org/10.1037/met0000614 Enders, C. K., & Tofighi, D. (2007). Centering predictor variables in cross-sectional multilevel models: A new look at an old issue. Psychological Methods, 12(2), 121–138. https://doi.org/10. 1037/1082-989X.12.2.121 Fan, W., & Bains, L. (2008). The effects of teacher instructional practice on kindergarten mathematics achievement: A multi-level national investigation. Fisher, R. A. (1970). Statistical methods for research workers. In Breakthroughs in statistics: Methodology and distribution (pp. 66–70). Springer. Fisher, Z., Tipton, E., & Zhipeng, H. (2017). Robumeta: Robust variance meta-regression. https : //CRAN.R-project.org/package=robumeta Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203), 1502–1505. https://doi.org/10.1126/science.1255484 Gabry, J., & Ceˇ snovar, R. (2022). ˇ Cmdstanr: R interface to ’CmdStan’. 124 Gelman, A. (2018). The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. Personality and Social Psychology Bulletin, 44(1), 16–23. https://doi.org/10.1177/0146167217729162 Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641–651. https://doi.org/ 10.1177/1745691614551642 Gillett, R. (1994). An average power criterion for sample size estimation. The Statistician, 43(3), 389. https://doi.org/10.2307/2348574 Glassman, J. R., Potter, S. C., Baumler, E. R., & Coyle, K. K. (2015). Estimates of Intraclass Correlation Coefficients From Longitudinal Group-Randomized Trials of Adolescent HIV/STI/Pregnancy Prevention Programs. Health Education & Behavior, 42(4), 545–553. https://doi.org/10. 1177/1090198114568308 Hall, J. A., & Rosenthal, R. (2018). Choosing between random effects models in meta-analysis: Units of analysis and the generalizability of obtained results. Social and Personality Psychology Compass, 12(10), e12414. https://doi.org/10.1111/spc3.12414 Harrer, M., Cuijpers, P., A, F. T., & Ebert, D. D. (2021). Doing meta-analysis with R: A hands-on guide (1st ed.). Chapman & Hall/CRC Press. Hedberg, E. C., & Hedges, L. V. (2014). Reference values of within-district intraclass correlations of academic achievement by district characteristics: Results from a meta-analysis of districtspecific values. Evaluation Review, 38(6), 546–582. https://doi.org/10.1177/0193841X1455 4212 Hedges, L. V. (1984, Spring). Estimation of Effect Size under Nonrandom Sampling: The Effects of Censoring Studies Yielding Statistically Insignificant Mean Differences. Journal of Educational Statistics, 9(1), 61. https://doi.org/10.2307/1164832 Hedges, L. V., & Hedberg, E. C. (2013). Intraclass correlations and covariate outcome correlations for planning two- and three-level cluster-randomized experiments in education. Evaluation Review, 37(6), 445–489. https://doi.org/10.1177/0193841X14529126 Hedges, L. V., Hedberg, E. C., & Kuyper, A. M. (2012). The variance of intraclass correlations in three- and four-level models. Educational and Psychological Measurement, 72(6), 893–909. https://doi.org/10.1177/0013164412445193 Hedges, L. V., Tipton, E., & Johnson, M. C. (2010). Robust variance estimation in meta-regression with dependent effect size estimates. Research Synthesis Methods, 1(1), 39–65. https://doi. org/10.1002/jrsm.5 Higgins, J. P. T., Thompson, S. G., & Spiegelhalter, D. J. (2009). A re-evaluation of random-effects meta-analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(1), 137–159. https://doi.org/10.1111/j.1467-985X.2008.00552.x Hox, J. J., & Roberts, J. K. (Eds.). (2011). Handbook of advanced multilevel analysis. Routledge. OCLC: 705730688. Iyengar, S. S., & Lepper, M. R. (2000). When choice is demotivating: Can one desire too much of a good thing? Journal of Personality and Social Psychology, 79(6), 995–1006. https://doi.org/ 10.1037/0022-3514.79.6.995 Jansen, J. P., Crawford, B., Bergman, G., & Stam, W. (2008). Bayesian meta-analysis of multiple treatment comparisons: An introduction to mixed treatment comparisons. Value in Health, 11(5), 956–964. https://doi.org/10.1111/j.1524-4733.2008.00347.x Kelcey, B., Xie, Y., Spybrook, J., & Dong, N. (2021). Power and Sample Size Determination for Multilevel Mediation in Three-Level Cluster-Randomized Trials. Multivariate Behavioral Research, 56(3), 496–513. https://doi.org/10.1080/00273171.2020.1738910 125 Kish, L. (1965). Survey sampling. J. Wiley. Kivlighan, D. M., Aloe, A. M., Adams, M. C., Garrison, Y. L., Obrecht, A., Ho, Y. C. S., Kim, J. Y. C., Hooley, I. W., Chan, L., & Deng, K. (2020). Does the group in group psychotherapy matter? A meta-analysis of the intraclass correlation coefficient in group treatment research. Journal of Consulting and Clinical Psychology, 88(4), 322–337. https://doi.org/10.1037/ccp0000474 Konishi, C., Hymel, S., Zumbo, B. D., & Zhen Li. (2010). Do school bullying and student—teacher relationships matter for academic achievement? A multilevel analysis. Canadian Journal of School Psychology, 25(1), 19–39. https://doi.org/10.1177/0829573509357550 Konstantopoulos, S. (2010). Power analysis in two-level unbalanced designs. The Journal of Experimental Education, 78(3), 291–317. https://doi.org/10.1080/00220970903292876 Kontopantelis, E., & Reeves, D. (2012a). Performance of statistical methods for meta-analysis when true study effects are non-normally distributed: A simulation study. Statistical Methods in Medical Research, 21(4), 409–426. https://doi.org/10.1177/0962280210392008 Kontopantelis, E., & Reeves, D. (2012b). Performance of statistical methods for meta-analysis when true study effects are non-normally distributed: A comparison between DerSimonian–Laird and restricted maximum likelihood. Statistical Methods in Medical Research, 21(6), 657– 659. https://doi.org/10.1177/0962280211413451 Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603. https://doi.org/10.1037/a0029146 Kruschke, J. K. (2015). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan (Edition 2). Academic Press. Lai, M. H. C., & Kwok, O.-m. (2015). Examining the rule of thumb of not using multilevel modeling: The “design effect smaller than two” rule. The Journal of Experimental Education, 83(3), 423–438. https://doi.org/10.1080/00220973.2014.907229 Lane, S. P., & Hennes, E. P. (2018). Power struggles: Estimating sample size for multilevel relationships research. Journal of Social and Personal Relationships, 35(1), 7–31. https://doi.org/10. 1177/0265407517710342 Lee, K. J., & Thompson, S. G. (2008). Flexible parametric models for random-effects distributions. Statistics in Medicine, 27(3), 418–434. https://doi.org/10.1002/sim.2897 Linn, M. C. (1985). Emergence and characterization of sex differences in spatial ability: A metaanalysis. Child Development, 56(6), 1479–1498. Liu, X., & Wang, L. (2019). Sample size planning for detecting mediation effects: A power analysis procedure considering uncertainty in effect size estimates. Multivariate Behavioral Research, 54(6), 822–839. https://doi.org/10.1080/00273171.2019.1593814 Luo, W., Li, H., Baek, E., Chen, S., Lam, K. H., & Semma, B. (2021). Reporting practice in multilevel modeling: A revisit after 10 years. Review of Educational Research, 91(3), 311–355. https: //doi.org/10.3102/0034654321991229 McNabb, C. B., & Murayama, K. (2021). Unnecessary reliance on multilevel modelling to analyse nested data in neuroscience: When a traditional summary-statistics approach suffices. Current Research in Neurobiology, 2, 100024. https://doi.org/10.1016/j.crneur.2021.100024 McShane, B. B., & Bockenholt, U. (2016). Planning sample sizes when effect sizes are uncertain: ¨ The power-calibrated effect size approach. Psychological Methods, 21(1), 47–60. https://doi. org/10.1037/met0000036 Meliksetian, A., Wolna, A., & Wodniecka, Z. (2023, December). Testing the magnitude of L2 aftereffect on stimuli sets with varying proportion of cognates and noncognates. OSF. https://doi. org/10.17605/OSF.IO/VR3XY 126 Moerbeek, M., & Teerenstra, S. (2015). Power analysis of trials with multilevel data. CRC Press. OCLC: 931884050. Moore, D. A. (2016). Preregister if you want to. American Psychologist, 71(3), 238–239. https://doi. org/10.1037/a0040195 Muthen, B. O., & Satorra, A. (1995). Complex sample data in structural equation modeling. ´ Sociological Methodology, 25, 267–316. https://doi.org/10.2307/271070 Narasimhan, B., Johnson, S. G., Hahn, T., Bouvier, A., & Kieu, K. (2023). ˆ Cubature: Adaptive multivariate integration over hypercubes. manual. https://CRAN.R-project.org/package=cubature Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results. Social Psychology, 45(3), 137–141. https : / / doi . org / 10 . 1027 / 1864 - 9335 / a000192 Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716 Open Science Collaboration. (2017, February 21). Maximizing the reproducibility of your research. In S. O. Lilienfeld & I. D. Waldman (Eds.), Psychological science under scrutiny (1st ed., pp. 1–21). Wiley. https://doi.org/10.1002/9781119095910.ch1 Park, J., & Pek, J. (2019). HybridPower: An R package for a Bayesian-frequentist hybrid approach to power analysis. Multivariate Behavioral Research, 54(1), 151–152. https://doi.org/10.1080/ 00273171.2018.1557032 Park, J., & Pek, J. (2022). Conducting Bayesian-classical hybrid power analysis with R package Hybridpower. Multivariate Behavioral Research, 1–17. https://doi.org/10.1080/00273171. 2022.2038056 Pek, J., & Park, J. (2019). Complexities in power analysis: Quantifying uncertainties with a Bayesianclassical hybrid approach. Psychological Methods, 24(5), 590–605. https://doi.org/10.1037/ met0000208 Plummer, M. (2023). Rjags: Bayesian graphical models using MCMC. manual. https://CRAN.Rproject.org/package=rjags Pustejovsky, J. (2022). clubSandwich: Cluster-robust (sandwich) variance estimators with smallsample corrections. manual. https://CRAN.R-project.org/package=clubSandwich Pustejovsky, J., & Tipton, E. (2022). Meta-analysis with robust variance estimation: Expanding the range of working models. Prevention Science, 23(3), 425–438. https://doi.org/10.1007/ s11121-021-01246-3 R Core Team. (2023). R: A language and environment for statistical computing. manual. Vienna, Austria, R Foundation for Statistical Computing. https://www.R-project.org/ Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed). Sage Publications. Raudenbush, S. W., Martinez, A., & Spybrook, J. (2007). Strategies for improving precision in grouprandomized experiments. Educational Evaluation and Policy Analysis, 29(1), 5–29. https: //doi.org/10.3102/0162373707299460 Riley, R. D., Higgins, J. P. T., & Deeks, J. J. (2011). Interpretation of random effects meta-analyses. BMJ, 342(7804), 964–067. https://doi.org/10.1136/bmj.d549 Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. https://doi.org/10.1037/0033-2909.86.3.638 127 Rothstein, H., Sutton, A. J., & Borenstein, M. (Eds.). (2005). Publication bias in meta-analysis: Prevention, assessment and adjustments. Wiley. OCLC: ocm61211258. Rubio-Aparicio, M., Lopez-L ´ opez, J. A., S ´ anchez-Meca, J., Mar ´ ´ın-Mart´ınez, F., Viechtbauer, W., & Van Den Noortgate, W. (2018). Estimation of an overall standardized mean difference in random-effects meta-analysis if the distribution of random effects departs from normal. Research Synthesis Methods, 9(3), 489–503. https://doi.org/10.1002/jrsm.1312 Sampaio, W. M., & Boggio, P. (2023, December). Homophily effect over perceived reputation on cooperation in the public goods. OSF. https://doi.org/10.17605/OSF.IO/89YQS Sanchez-Meca, J., & Mar ´ ´ın-Mart´ınez, F. (2008). Confidence intervals for the overall effect size in random-effects meta-analysis. Psychological Methods, 13(1), 31–48. https : / / doi . org / 10 . 1037/1082-989X.13.1.31 Sarkodie, S. K., Wason, J. M., & Grayling, M. J. (2023). A hybrid approach to comparing parallelgroup and stepped-wedge cluster-randomized trials with a continuous primary outcome when there is uncertainty in the intra-cluster correlation. Clinical Trials, 20(1), 59–70. https://doi. org/10.1177/17407745221123507 Sattelmayer, L., Jan, M., & Tallent, T. (2023, December). Do symbolic policies affect support for costly policies ? A survey experiment. OSF. https://doi.org/10.17605/OSF.IO/9X4NY Saxbe, D. E., Edelstein, R. S., Lyden, H. M., Wardecker, B. M., Chopik, W. J., & Moors, A. C. (2017). Fathers’ decline in testosterone and synchrony with partner testosterone during pregnancy predicts greater postpartum relationship investment. Hormones and Behavior, 90, 39– 47. https://doi.org/10.1016/j.yhbeh.2016.07.005 Scherbaum, C. A., & Pesner, E. (2018). Power analysis for multilevel research. In S. E. Humphrey & J. M. LeBreton (Eds.), The handbook of multilevel theory, measurement, and analysis. (pp. 329–352). American Psychological Association. https://doi.org/10.1037/0000115-015 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2021). Pre-registration: Why and How. Journal of Consumer Psychology, 31(1), 151–162. https://doi.org/10.1002/jcpy.1208 Singh, S. P., & Mukhopadhyay, S. (2016). Bayesian optimal cluster designs. Statistical Methodology, 32, 36–52. https://doi.org/10.1016/j.stamet.2016.02.002 Snijders, T. A. (2001). Sampling. In A. H. Leyand & H. Goldstein (Eds.), Multilevel modeling of health statistics (pp. 159–174). Wiley. Snijders, T. A. (2005). Power and sample size in multilevel modeling, 8. Snijders, T. A., & Bosker, R. J. (2012). Multilevel analysis: An introduction to basic and advanced multilevel modeling (2nd ed). Sage Publications. Spiegelhalter, D. J. (2001). Bayesian methods for cluster randomized trials with continuous responses, 18. Spiegelhalter, D. J., Abrams, K. R., & Myles, J. P. (2004). Randomised Controlled Trials. In Bayesian approaches to clinical trials and health care evaluation (pp. 182–249). Wiley. Spiegelhalter, D. J., & Freedman, L. S. (1986). A predictive approach to selecting the size of a clinical trial, based on subjective clinical opinion. Statistics in Medicine, 5(1), 1–13. https://doi.org/ 10.1002/sim.4780050103 128 Spybrook, J., Kelcey, B., & Dong, N. (2016). Power for detecting treatment by moderator effects in two- and three-level cluster randomized trials. Journal of Educational and Behavioral Statistics, 41(6), 605–627. https://doi.org/10.3102/1076998616655442 Sutton, A., & Abrams, K. (2001). Bayesian methods in meta-analysis and evidence synthesis. Statistical Methods in Medical Research, 10(4), 277–303. https://doi.org/10.1191/0962280016782 27794 Taylor, D. J., & Muller, K. E. (1996). Bias in linear model power and sample size calculation due to estimating noncentrality. Communications in Statistics - Theory and Methods, 25(7), 1595– 1610. https://doi.org/10.1080/03610929608831787 Tse, W. W.-Y., & Lai, M. H. C. (2022). Sample size planning for two-level multisite randomized trials with A hybrid classical-bayesian approach. Tse, W. W.-Y., & Lai, M. H. C. (2023). Incorporating uncertainty in power analysis: A hybrid classical-Bayesian approach for designing two-level cluster randomized trials. Ukoumunne, O. C. (2002). A comparison of confidence interval methods for the intraclass correlation coefficient in cluster randomized trials. Statistics in Medicine, 21(24), 3757–3774. https : //doi.org/10.1002/sim.1330 Van Den Akker, O. R., Van Assen, M. A. L. M., Bakker, M., Elsherif, M., Wong, T. K., & Wicherts, J. M. (2023). Preregistration in practice: A comparison of preregistered and non-preregistered studies in psychology. Behavior Research Methods. https://doi.org/10.3758/s13428-023- 02277-0 Van ’T Veer, A. E., & Giner-Sorolla, R. (2016). Pre-registration in social psychology—A discussion and suggested template. Journal of Experimental Social Psychology, 67, 2–12. https://doi. org/10.1016/j.jesp.2016.03.004 Vasishth, S., Mertzen, D., Jager, L. A., & Gelman, A. (2018). The statistical significance filter leads ¨ to overoptimistic expectations of replicability. Journal of Memory and Language, 103, 151– 175. https://doi.org/10.1016/j.jml.2018.07.004 Viechtbauer, W. (2005). Bias and efficiency of meta-analytic variance estimators in the randomeffects model. Journal of Educational and Behavioral Statistics, 30(3), 261–293. https : / / doi.org/10.3102/10769986030003261 Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3). https://doi.org/10.18637/jss.v036.i03 Viechtbauer, W., & Lopez-L ´ opez, J. A. (2022). Location-scale models for meta-analysis. ´ Research Synthesis Methods, 13(6), 697–715. https://doi.org/10.1002/jrsm.1562 Williamson, S. F., Tishkovskaya, S. V., & Wilson, K. J. (2023, August 22). Hybrid sample size calculations for cluster randomised trials using assurance. arXiv: 2308.11278 [stat]. Retrieved March 21, 2024, from http://arxiv.org/abs/2308.11278 Wilson, K. J. (2023). Bayesian design and analysis of two-arm cluster randomized trials using assurance. Statistics in Medicine, 42(25), 4517–4531. https://doi.org/10.1002/sim.9871 129 APPENDIX A Illustration of the Bayesian Power Analysis Procedure library(metafor) # Du & Wang’s (2016) example --------------------------------------------------- # Effect size estimates (hedge’g values) g <- c(.82, .86, -.08, .64, .57, .31, .67, .47, .36, .07, .81, .03, .29, .1, .83, .74, .9, .9, 1.12, .96, 1, 1, 1.11, 1.28, 1.06, 1.06, .91, 1.12, .7) # Sample sizes of group 1 n1 <- c(237, 168, 30, 30, 46, 174, 42, 71, 74, 77, 72, 237, 658, 45, 191, 294, 249, 202, 138, 70, 54, 123, 23, 125, 418, 435, 241, 102, 22) # Sample sizes of group 2 n2 <- c(241, 155, 30, 30, 44, 173, 29, 58, 55, 59, 67, 232, 354, 48, 201, 296, 244, 233, 130, 68, 56, 94, 28, 51, 370, 510, 424, 208, 95) # combine the group sizes together n <- cbind(n1, n2) # planned sample size nplan <- c(60, 60) # Du & Wang’s (2016) pas() function -------------------------------------------- source("https://www3.nd.edu/˜lwang4/power/pas.R") du_wang_assurance <- pas(effectsize = g, nactual = n, random = TRUE, nplan = nplan, cpower = 0.8, type = "d", nrep = 10, adjust = TRUE) # Proposed approach with meta-analytic results from ‘metafor‘ ------------------ ## Helper functions { # Power function for independent sample t-test 130 pow_func <- function(d, n1, n2, alpha = .05) { df <- n1 + n2 - 2 cv <- stats::qt(1 - alpha / 2, df) ncp <- d * sqrt(n1 * n2 / (n1 + n2)) stats::pt(cv, df = df, ncp = ncp, lower.tail = FALSE) + stats::pt(-cv, df = df, ncp = ncp, lower.tail = TRUE) } # Mean power function mean_pow <- function(d, d_sd, n1, n2) { cubature::cuhre( function(x) pow_func(x, n1, n2) * stats::dnorm(x, d, d_sd), lowerLimit = -Inf, upperLimit = Inf )$integral } # Inverse power function inv_pow <- function(n1, n2, power = .80, alpha = .05) { df <- n1 + n2 - 2 cv <- stats::qt(1 - alpha / 2, df) inv <- function(delta) { ncp <- delta * sqrt(n1 * n2/(n1 + n2)) stats::pt(cv, df, ncp, lower.tail = FALSE) + stats::pt(-cv, df, ncp, lower.tail = TRUE) - power } stats::uniroot(inv, c(0, 100))$root } # Assurance function assurance <- function(d, d_sd, n1, n2, power = .8, alpha = .05) { d_star <- inv_pow(n1, n2, power, alpha) pnorm(d_star, d, d_sd, lower.tail = FALSE) + pnorm(-d_star, d, d_sd, lower.tail = TRUE) } } # Perform random-effects meta-analysis with ‘metafor‘ within_var <- (n1 + n2) / (n1 * n2) + gˆ2 / (2 * (n1 + n2)) res <- rma(g, within_var) hat_mu <- res$b # overall effect size hat_tausq <- res$tau2 # between-study variance 131 hat_sigmasq <- res$seˆ2 # within-study variance # Calculate mean power proposed_mean_power <- mean_pow( d = hat_mu, d_sd = sqrt(hat_tausq + hat_sigmasq), n1 = nplan[1], n2 = nplan[2] ) # Calculate assurance proposed_assurance <- assurance( d = hat_mu_d, d_sd = sqrt(hat_tausq + hat_sigmasq), n1 = nplan[1], n2 = nplan[2] ) # Comparison ------------------------------------------------------------------- # Mean power du_wang_assurance[2] # Du & Wang (2016) proposed_mean_power # Proposed approach # Assurance du_wang_assurance[1] proposed_assurance 132 APPENDIX B An Illustrative Example of Meta-Analyzing ICC Below is the R script for performing SRMA and BRMA for meta-analyzing ICC estimates of Grade 8 Mathematics Achievement, which were reported in Hedges and Hedberg (2013). # Load libraries library(metafor) library(cmdstanr) # Helper function to obtain the shape parameters of a Beta distribution beta_ab <- function(mu, sigma = NULL) { alpha <- mu * (mu * (1 - mu) / sigmaˆ2 - 1) beta <- (1 - mu) * (mu * (1 - mu) / sigmaˆ2 - 1) c(alpha, beta) return(c(alpha, beta)) } # ICC estimates of Grade 8 Mathematics Achievement (Hedges & Hedberg, 2013) icc_dat <- data.frame( icc = c(.134, .202, .204, .398, .183, .119, .358, .259, .264, .037, .199), icc_se = c(.011, .011, .013, .013, .013, .009, .017, .014, .013, .005, .011), state = c("Arkansas", "Arizona", "Colorado", "Florida", "Kansas", "Kentucky", "Louisiana", "Massachusetts", "North Carolina", "West Virginia", "Wisconsin") ) # Standard random-effects meta-analysis fit_srma <- rma.mv(yi = icc, V = icc_seˆ2, random = list(˜ 1 | state), method = "REML", test = "t", data = icc_dat) overall_icc_srma <- as.numeric(fit_srma$b) tausq_srma <- fit_srma$sigma2 sigmasq_srma <- as.numeric(fit_srma$vb) # Bayesian random-effects meta-analysis mod_beta <- cmdstan_model("meta_icc_beta.stan") 133 stan_dat <- list( J = nrow(icc_dat), y = icc_dat$icc, sigma = icc_dat$icc_se ) capture.output( fit_brma <- mod_beta$sample(stan_dat, adapt_delta = .95, iter_sampling = 4000) ) post_draws <- as.numeric(fit_brma$draws(variable = "theta_new")) # Mean power mean(pow_draws) hcbr::ep_crt2(J = 50, n = 50, delta = .5, delta_sd = 0, rho = srma_est_g8[1], rho_sd = sqrt(sum(srma_est_g8[2:3]))) # Assurance mean(pow_draws > .8) hcbr::al_crt2(J = 50, n = 50, delta = .5, delta_sd = 0, rho = srma_est_g8[1], rho_sd = sqrt(sum(srma_est_g8[2:3]))) The Stan script, named meta icc beta.stan, for running BRMA is as follows. data { int<lower=0> J; // number of studies array[J] real<lower=0> y; // estimated ICC array[J] real<lower=0> sigma; // s.e. of ICC } parameters { array[J] real<lower=0, upper=1> theta; real<lower=0, upper=1> mu; // overall mean ICC real<lower=0> kappa; // overall concentration } model { // use flat prior for mu y ˜ normal(theta, sigma); // each observation is beta theta ˜ beta_proportion(mu, kappa); // prior; Beta2 dist kappa ˜ gamma(.01, .01); // prior recommended by Kruschke } generated quantities { 134 real<lower=0> tausq; tausq = mu * (1 - mu) / (kappa + 1); real<lower=0> theta_new; theta_new = beta_proportion_rng(mu, kappa); } 135
Abstract (if available)
Abstract
Power analysis requires a priori knowledge about the true values of input parameters, such as an effect size. Since ignoring uncertainty in the parameters may result in reduced statistical power in subsequent studies, previous studies have developed a Bayesian procedure to account for uncertainty for classical power analysis. This procedure enables researchers to integrate prior beliefs and historical information from various sources into probability distributions. By utilizing these distributions, researchers can make informed decisions on sample size that appropriately address uncertainty.
This dissertation introduces three advancements in the Bayesian procedure for classical power analysis. The existing Bayesian procedure has assumed that all historical findings are published, regardless of their statistical significance. Given the prevalence of publication bias in the behavioral and psychological sciences literature, I developed a Bayesian method that addresses both publication bias and uncertainty.
In addition to effect size, power analysis for multilevel studies involves more parameters, such as the intraclass correlation (ICC), which measures the strength of association among units within a cluster. While the Bayesian procedure allows the integration of synthesized results to construct the probability distribution, the focus has been on effect size estimates. Therefore, I extended the Bayesian procedure to meta-analyze ICC estimates and utilize the resulting probability distribution for power analysis.
Whereas there has been growing popularity in multilevel studies, the existing Bayesian procedure has been developed for single-level designs and a few multilevel designs with binary predictors (e.g., cluster randomized trials). To bridge this gap, I expanded the Bayesian procedure for power analysis of three two-level designs with continuous predictors. These advancements offer researchers methodologies to address uncertainty and bias in designing various types of studies.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Incorporating uncertainty in design parameters: a hybrid classical-Bayesian power analysis approach for two-level cluster randomized trials
PDF
A Bayesian region of measurement equivalence (ROME) framework for establishing measurement invariance
PDF
A systematic review of measurement invariance research of the CES-D scale across gender
PDF
Bridging gaps, building futures: a meta-analysis of collaborative learning and achievement for Black and Latinx students
PDF
Evaluating standard error estimators for multilevel models on small samples with heteroscedasticity and unbalanced cluster sizes
PDF
A meta-analysis of formative assessment on academic achievement among Black and Hispanic students
PDF
Unveiling the visible impact: a meta-analysis on inquiry-based teaching and the effects on Black and Latinx student achievement
PDF
Bayesian hierarchical models in genetic association studies
PDF
The effects of inclusive education on academic achievement for special education and general education students of color: a meta-analysis
PDF
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
PDF
Topics in selective inference and replicability analysis
PDF
Using integrative data analysis to evaluate gender differences in effects of multisystemic therapy for justice-involved youth
PDF
Emissions markets, power markets and market power: a study of the interactions between contemporary emissions markets and deregulated electricity markets
PDF
Hybrid physics-based and data-driven computational approaches for multi-scale hydrologic systems under heterogeneity
PDF
Shrinkage methods for big and complex data analysis
PDF
RANS simulations for flow-control study synthetic jet cavity on NACA0012 and NACA65(1)412 airfoils.
PDF
Computational validation of stochastic programming models and applications
PDF
Physics-informed machine learning techniques for the estimation and uncertainty quantification of breath alcohol concentration from transdermal alcohol biosensor data
PDF
Algorithms for stochastic Galerkin projections: solvers, basis adaptation and multiscale modeling and reduction
PDF
Essays on information design for online retailers and social networks
Asset Metadata
Creator
Tse, Wing Yee
(author)
Core Title
Pooling historical information while addressing uncertainty and bias for power analysis: a Bayesian approach for designing single-level and multilevel studies
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Psychology
Degree Conferral Date
2024-08
Publication Date
08/09/2024
Defense Date
05/21/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian statistics,meta-analysis,multilevel modeling,OAI-PMH Harvest,power analysis,publication bias,uncertainty
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lai, Hok Chio (
committee chair
), Anderson, Samantha (
committee member
), Beam, Christopher (
committee member
), John, Richard (
committee member
), Patall, Erika (
committee member
)
Creator Email
wingyeet@usc.edu,winnietse.wingyee@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113998TF2
Unique identifier
UC113998TF2
Identifier
etd-TseWingYee-13366.pdf (filename)
Legacy Identifier
etd-TseWingYee-13366
Document Type
Dissertation
Format
theses (aat)
Rights
Tse, Wing Yee
Internet Media Type
application/pdf
Type
texts
Source
20240813-usctheses-batch-1196
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
Bayesian statistics
meta-analysis
multilevel modeling
power analysis
publication bias
uncertainty