Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Disease risk estimation from case-control studies with sampling
(USC Thesis Other)
Disease risk estimation from case-control studies with sampling
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DISEASE RISK ESTIMATION FROM CASE-CONTROL STUDIES WITH SAMPLING by Ge Wen A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOSTATISTICS) May 2018 Copyright 2018 Ge Wen Dedications To my beloved family. ii Acknowledgements Life is a journey with ups and downs. My PhD years is one of the most important part of it. I'm very grateful to all the people who have ever advised me, supported me, helped me, trusted me, believed in me and cared about me, by any means, at any time. Without you, I would not have travled thus far in my life, and your names are forever carved in my heart. iii Table of Contents Dedications ii Acknowledgements iii List of Tables vii List of Figures ix Abstract x Chapter 1: Introduction 1 1.1 Case-Control Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Disease Risk Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Motivation of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Summary of the Doctoral Work . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Review of Related Work 10 2.1 Analysis Methods for Standard Case-Control Studies . . . . . . . . . . . . 11 2.1.1 Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Unconditional Logistic Regression . . . . . . . . . . . . . . . . . . 12 2.1.3 Conditional Logistic Regression . . . . . . . . . . . . . . . . . . . . 14 2.2 Disease Risk Estimation in Cohort Data . . . . . . . . . . . . . . . . . . . 16 2.2.1 The Denition of Disease Risk in Cohort Data . . . . . . . . . . . 16 2.2.2 Estimating Disease Risk in Incidence Cohort Data . . . . . . . . . 17 2.2.3 Estimating Disease Risk in Nested Case-control Data . . . . . . . 18 Chapter 3: Methodology Development Background 20 3.1 General Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Case-control Designs with Complex Sampling . . . . . . . . . . . . . . . . 21 3.2.1 The Classical Case-control Study . . . . . . . . . . . . . . . . . . . 21 3.2.2 Case-control Studies with Complex Sampling Designs . . . . . . . 23 3.2.3 Examples of Case-control Designs with Complex Sampling . . . . . 24 3.2.3.1 Size Matching . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.3.2 Independent Bernoulli Trial Sampling of Controls . . . . 25 3.2.3.3 Case-base Sampling . . . . . . . . . . . . . . . . . . . . . 25 iv 3.2.3.4 Counter Matching . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Conditional Logistic Analysis of Case-Control Studies with Complex Sam- pling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Notation and Model . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.2 Examples of Case-control Designs with Complex Sampling . . . . . 28 3.3.2.1 Size Matching . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.2.2 Independent Bernoulli Trial Sampling of Controls . . . . 29 3.3.2.3 Case-base Sampling . . . . . . . . . . . . . . . . . . . . . 31 3.3.2.4 Counter Matching . . . . . . . . . . . . . . . . . . . . . . 31 3.3.3 Computation Methods on the Odds Ratio . . . . . . . . . . . . . . 32 3.3.3.1 Exploit Log-linear Form . . . . . . . . . . . . . . . . . . . 32 3.3.3.2 Unconditional Logistic Regression . . . . . . . . . . . . . 33 3.4 The Connection between Case-control and Survey Sampling Designs . . . 34 Chapter 4: New Estimation Methods for Baseline Odds and Risk 38 4.1 General Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 A Natural Horwitz-Thompson Baseline Odds Estimator . . . . . . . . . . 39 4.3 A Rao-Blackwellized Baseline Odds Estimator . . . . . . . . . . . . . . . . 42 4.3.1 Rao-Blackwellized Estimator Y ~ R for Y R . . . . . . . . . . . . . . . 42 4.3.2 Computation of Y ~ R and c var[Y ~ R ] . . . . . . . . . . . . . . . . . . . 46 4.3.3 Rao-Blackwellized Baseline Odds Estimator in Size Matching Sam- pling Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.3.1 Baseline Odds Coecient ^ . . . . . . . . . . . . . . . . . 47 4.3.3.2 Variance Estimate for Baseline Odds Coecient ^ . . . . 48 4.3.3.3 Computation of the Baseline Odds and its Variance Esti- mate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 Disease Risk Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.1 Point Estimator of Disease Risk . . . . . . . . . . . . . . . . . . . . 53 4.4.2 Condence Interval Estimator of Disease Risk . . . . . . . . . . . . 54 Chapter 5: Simulation Studies 55 5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1.1 Study Base Generation and Case-Control Sampling . . . . . . . . . 55 5.1.2 Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.1 Baseline Odds Coecient Estimation . . . . . . . . . . . . . . . . . 56 5.2.2 Condence Interval of Disease Risk . . . . . . . . . . . . . . . . . . 66 Chapter 6: Real Data Application: MEPEDS Study 68 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.2.1 MEPEDS Study Design, Population and Methods . . . . . . . . . 69 6.2.2 Denition, Determination and Impact of Strabismus . . . . . . . . 70 6.2.3 Prevalence and Risk Factors of Strabismus . . . . . . . . . . . . . 72 6.3 Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 v 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4.1 Study Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4.2 Estimated Prevalence of Strabismus Associated with Maternal Smok- ing during Pregnancy . . . . . . . . . . . . . . . . . . . . . . . . . 74 Chapter 7: Real Data Application: LALES Study 78 7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.2.1 LALES Study Design, Population and Methods . . . . . . . . . . . 78 7.2.2 Denition, Determination and Impact of Diabetic Retinopathy . . 80 7.2.3 Prevalence, Incidence and Risk Factors of Diabetic Retinopathy . . 81 7.3 Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.4.1 Study Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.4.2 Four-year Incidence Rate of Diabetic Retinopathy Estimation . . . 84 Chapter 8: Future Directions 88 8.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Bibliography 90 Appendix A Detail Deduction of the Estimated Variance of Y ~ R . . . . . . . . . . . . . . . . 97 A.1 Estimate the Variance of Y ~ R . . . . . . . . . . . . . . . . . . . . . . . . . . 97 vi List of Tables 3.1 Observed 2x2 table in classic case-control studies . . . . . . . . . . . . . . 22 3.2 Case-base Sampling Design with Binary Outcome and Exposure. . . . . . 25 3.3 General counter-matching with the case in sampling stratum j. . . . . . . 26 5.1 Simulation results for 1000 trials, n = 1000, 1:1 size matching, 0 = 0:01( =4:605), ZN(0; 1) . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 Simulation results for 1000 trials, n = 1000, 1:1 size matching, 0 = 0:1( =2:303), ZN(0; 1) . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3 Simulation results for 0 =0.2, or z =1 . . . . . . . . . . . . . . . . . . . . . 59 5.4 Simulation results for 0 =0.2, or z =2 . . . . . . . . . . . . . . . . . . . . . 60 5.5 Simulation results for 0 =0.2, or z =5 . . . . . . . . . . . . . . . . . . . . . 60 5.6 Simulation results for 0 =0.1, or z =1 . . . . . . . . . . . . . . . . . . . . . 61 5.7 Simulation results for 0 =0.1, or z =2 . . . . . . . . . . . . . . . . . . . . . 61 5.8 Simulation results for 0 =0.1, or z =5 . . . . . . . . . . . . . . . . . . . . . 62 5.9 Simulation results for 0 =0.05, or z =1 . . . . . . . . . . . . . . . . . . . . 62 5.10 Simulation results for 0 =0.05, or z =2 . . . . . . . . . . . . . . . . . . . . 63 5.11 Simulation results for 0 =0.05, or z =5 . . . . . . . . . . . . . . . . . . . . 63 5.12 Simulation results for 0 =0.02, or z =1 . . . . . . . . . . . . . . . . . . . . 64 5.13 Simulation results for 0 =0.02, or z =2 . . . . . . . . . . . . . . . . . . . . 64 5.14 Simulation results for 0 =0.02, or z =5 . . . . . . . . . . . . . . . . . . . . 64 5.15 Disease Risk Estimates for Simulated Exposure of B(0,1) under 0 =0.2 and or z =2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1 Strabismus Prevalence and Subtypes by Ethnicity in the Multi-Ethnic Pe- diatric Eye Disease Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 vii 6.2 Strabismus Prevalence Stratied by Age (Months) in the Multi-Ethnic Pe- diatric Eye Disease Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3 Characteristics of Children in the Analysis Sample by Ethnicity (n (%)) . 75 6.4 Point and Variance Estimation of Baseline Odds Coecient in the Analysis Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.5 Estimated Prevalence of Strabismus in Hispanic Children 6-72 months old 76 6.6 Estimated Prevalence of Strabismus in African American Children 6-72 months old . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.7 Estimated Prevalence of Strabismus in Non-Hispanic White Children 6-72 months old . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.1 Age- and Gender-Specic Prevalence (95% Condence Interval) of Diabetic Retinopathy in Latinos with Denite Diabetes . . . . . . . . . . . . . . . 82 7.2 Estimated Four-Year Incidence of Any Diabetic Retinopathy Stratied by Age and Duration of Diabetes at Baseline) . . . . . . . . . . . . . . . . . . 83 7.3 Characteristics of Diabetic Participants at risk for Diabetic Retinopathy at baseline in the Analysis Sample (n (%)) . . . . . . . . . . . . . . . . . . 85 7.4 Point and Variance Estimation of Baseline Odds Coecient in the Analysis Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.5 Estimated 4-year Incidence Rate of Diabetic Retinopathy by Gender . . . 86 7.6 Estimated 4-year Incidence Rate of Diabetic Retinopathy at Dierent Age at Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.7 Estimated 4-year Incidence Rate of Diabetic Retinopathy at Dierent Lengths of Diabetic History at Baseline . . . . . . . . . . . . . . . . . . . . . . . . 87 7.8 Estimated Four-year Incidence Rate of Diabetic Retinopathy at Dierent Hemoglobin A1c Levels at Baseline . . . . . . . . . . . . . . . . . . . . . . 87 viii List of Figures 3.1 The classic case-control model. . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Finite population model for case-control studies. . . . . . . . . . . . . . . 24 ix Abstract Case-control is a useful study design for estimating relative risks arising from exposure distributions between diseased and non-diseased populations. Sampling can occur within an ongoing longitudinal cohort (incident case) or cross-sectional observational (prevalent case) study, to estimate relative risks cost-eciently. Statistical methods for estimating relative risks based on a case-control study are available while that for absolute disease probability associated with certain exposure proles requires development. In this dissertation, I described the methodology development for disease risk esti- mators and corresponding variance estimators from survey-sampling based case-control samples. I performed simulation studies to evaluate the proposed estimators and demon- strated their usefulness in real data from two large scale projects. The new estimators were able to use samples economically and derive risk inference comparable to full sam- ple assessments. The methods are applicable to a wide range of epidemiologic studies to derive point and condence interval estimates of baseline and comparator disease risks. x Chapter 1 Introduction 1.1 Case-Control Studies Case-control study designs are widely used in epidemiology [2, 5]. Dating back to at least 1855, the prototype design was used by Whitehead et al. in the investigation of cholera and water consumption from the contaminated Broad Street pump [88]. In the study, Henry Whitehead interviewed both the families of cholera deaths and survivors and their pump water consumption. This study was considered the rst case-control study because Whitehead et al. employed the signature method to compare cholera deaths (cases) with persons from the same community but free of disease (controls), using the exposure information ascertained from the interview[60], and in the aim to discover a potential association between disease occurrence and exposure. In the 1950s, several case-control studies investigating the relationship between smok- ing and lung cancer opened the era of the modern case-control study, which featured more rened and rigorous designs, execution and analyses.[20, 40, 65, 90]. The case- control study design has been frequently used in investigations into rare and long-latency 1 diseases, in cancer epidemiology, and in general exploratory research on potentially as- sociated factors [12, 35]. Exposure factors may either be risk or protective, elevating or reducing the risk of disease of interest. A case-control study is able to quantify such risk associations and the associated factors individually and collectively. Over the decades, signicant scientic discoveries have arisen using case-control meth- ods, including the discovery of the associations between cigarette smoking and lung cancer in the 1950s, diethylstilbestrol and vaginal adenocarcinoma in 1971, post-menopausal es- trogens and endometrial cancer in 1975, tampon use and toxic shock syndrome in 1980, aspirin and Reye's Syndrome in 1982, AIDS and sexual practices in 1985, and between vaccine eectiveness and diet and cancer in the 1990s and 2000s [20, 30, 40, 54, 65, 67, 68, 73, 83, 90]. Compared to other research strategies, most notably cohort studies, the major ad- vantages of a case-control study include requiring fewer subjects, relatively easy and fast conduct, inexpensive, and well-suited for exploratory investigations of certain diseases. Therefore it is a commonly used tool to assess the risks of disease from environmental and industrial exposures, to conduct post marketing surveillance of adverse or benecial drug eects, to discover benecial factors that prevent certain disease in humans and also in exploratory or conrmatory genetic studies. 1.2 Disease Risk Estimation Risk estimation is among the main research objectives in clinical epidemiology, for example, to determine the rates of disease by person, place and time, i.e. estimating the absolute risk (incidence, cumulative incidence, etc); to identify the risk factors for disease, 2 i.e. estimating the relative risk (or odds ratio / rate ratio); to develop approaches for disease prevention, i.e. estimating the attributable risk/fraction [2, 5]. For certain study designs, the absolute risk quantities are well dened and their esti- mation methods are well developed. For example, in a cohort study, a group of unaected individuals are rst identied and the exposure data are collected at baseline. The sub- jects are then followed for a specied period of time, within which some of the subjects develop disease. In the closed cohort study design, absolute risk (incidence) usually refers to the incidence proportion, which is the number of new cases of a disease occurring in a specied time period divided by the number of individuals at risk of developing the disease during the same time. The relative risk is estimated by the ratio of incidence pro- portions for the exposed group to that of the unexposed group. The statistical procedures to estimate them are well developed [14]. In case-control studies, the analog concept of absolute risk is the disease occurrence in the current sample. We use a term \disease risk" to dierentiate it from the absolute risk in longitudinal cohort studies. Estimation of disease risk with case-control studies is a non-trivial topic and barely elaborated on. Under the case-control scheme, researchers rst identify aected and unaected individuals, and then collect risk factor data retro- spectively. Statistical methods to estimate the odds ratio, which is the ratio of the odds that cases were exposed to the odds that the controls were exposed are well known [12], and with the rare disease assumption it provides a good estimate of the corresponding relative risk. However, statistical methods for estimating disease risk, and for character- izing how variable the estimate itself is, are not generally available [39]. The diculty resides in the xed disease status in case-control studies as compared to cohort studies. 3 Yet, it is important and useful to estimate disease risk in case-control studies, even though it is a surprisingly neglected area of biostatistical research. Benichou and Gail [8] recognized the problem and studied methods for estimating the absolute risk (incidence) of developing disease given a set of covariate values over a specied time interval using case- control data nested within a cohort. Langholz and Borgan [39] extended the methods for estimating relative mortality [11] to absolute risk (incidence) estimation from individually matched nested case-control studies. Nonetheless, no general methods currently exist for estimating disease risk (prevalence or cumulative incidence) with case-control studies as well as the variance estimates of such estimators. In particular, estimation theories are lagged behind for case-control studies based on a cross-sectional study. We developed such methods for estimating disease risk in general case-control designs and in complex sampling schemes to fulll many practical needs. Our methodology development was practically motivated by and has been applied to the Multi-Ethnic Pediatric Eye Disease Study (MEPEDS) project [78] and The Los Angeles Latino Eye Study (LALES) project [76]. 1.3 Motivation of the Study It is often convenient to use prevalent cases rather than incident cases for case-control studies, which is typically a study based on an existing cross-sectional study [63]. In this setting, cases of disease are selected at one xed point in time. Such designs may be useful, when it is impractical to sample only incident cases, for example, in studies of congenital malformations (birth defects) [3, 21, 66]. When conditions hold that the prevalence odds ratio in the population is equal to the incidence rate ratio, the odds ratio 4 from a case-control study based on prevalent cases can unbiasedly estimate the incidence rate ratio. Despite this, the prevalence design was criticized for its inability to distinguish expo- sure eects on disease incidence from the exposure association with many other disease aspects, such as disease duration and migration. The bias caused by uncontrolled factors, such as temporal changes in the exposed or unexposed population or migration in or out of the prevalence pool further complicates the issue[63]. In practice, there are elds of studies that prevalent cases are commonly used, such as the studies of chronic conditions with ill-dened onset times with limited eects on mortality, such as obesity, Parkinson's disease, multiple sclerosis, and studies of health services utilization [31, 41, 44, 61, 74, 91]. In fact, one of the motivations of our methodology development is to use the available prevalent cases from the Multi-Ethnic Pediatric Eye Disease Study (MEPEDS)[78] to en- able more ecient risk factor assessment using case-control designs. The MEPEDS study population consisted of four ethnic groups of children (African-American, Hispanic, Non- Hispanic White and Asian) aged 6 to 72 months in 100 census tracts in and around the cities of Inglewood, Riverside, Glendale and Monterey Park in California. A well-trained interview team conducted a door-to-door census of all dwelling units within targeted cen- sus tracts to identify eligible children (aged 5 to 70 months on the day of the household screening) whose parents consented to participate. After a brief in-home interview of ba- sic demographic information and history of known eye conditions, eligible children were scheduled for a comprehensive eye examination at the local MEPEDS examination cen- ter, performed by MEPEDS optometrists or ophthalmologists trained and certied using standardized protocols. A more detailed in-person interview with the child's parent was 5 also administered at the local clinic to determine demographic characteristics, family his- tory of eye diseases, and ocular and medical history of the participants, lifestyle, quality of life and so on. The MEPEDS study completed data collection on 9197 eligible children in total and is a valuable resource to study various childhood eye conditions. Despite the wealth of data already collected, when it comes to specic questions, the need for additional data may emerge. For example, in MEPEDS, maternal smoking during pregnancy was found to be a signicant risk factor for multiple childhood eye conditions including hyperopia, astigmatism, esotropia, and exotropia [10, 19, 51, 72]. This nding deserved further exploration whether there was a dose relationship between cigarettes smoked and eye disease prevalence, whether quitting smoking before or during pregnancy modied the risk, whether the length of maternal smoking history, age of starting smoking, exposure to secondary smoking aected disease risk. An additional smoking habit questionnaire would help answer these specic inquires and advance understanding of the risks of maternal smoking on children's eye health. That requires a case-control study sampled within the MEPEDS as a fast and cost-eective way for follow-up assessment. For another example, it may be possible to assess genetic risks for eye disease using cost ecient designs. Indeed, a case-control study based on MEPEDS prevalent cases can satisfy such needs and enhance our understanding of genetic predictors' eects on the subject population with minimal additional cost. To realize all these potentials of case-control designs, we need to address our inability to estimate disease risk with general case-control studies. It thus stimulated us to propose several new methods in this dissertation to bridge this gap. These methods have been 6 useful in our purpose to reanalyze the MEPEDS project data and are generally applicable to other cross-sectional or cohort studies with similar research objectives. 1.4 Summary of the Doctoral Work We brie y summarize our new methods here and leave the details to be developed throughout the following chapters. Our new methods to estimate disease risk using case- control data are based on the theories of survey sampling methods and the Rao-Blackwell theorem. In summary, we rst obtained an unbiased point estimator of disease risk using sam- pling theory for case-control studies[15], as well an unbiased estimator of the point esti- mator's variance. We then obtained the Rao-Blackwell version of the unbiased estimator and the variance estimator. The Rao-Blackwell theorem ensures the minimum variance quality of the Rao-Blackwellized estimator, and thus we obtained the most ecient and unbiased estimator for disease risk in case-control studies. Its superior performance was veried by simulation studies and by comparing to other estimators in a variety of set- tings. The most important advantage from our new methods is to use case-control studies to estimate prevalence and cumulative incidence which usually require large observation studies. For instance, given cases of a chronic condition in a large study, we would like to investigate the associations of these conditions with genotype covariates or additional uncollected exposures. A thorough genotype screening or additional exposure data col- lection of the entire study base would be the most reliable approach yet is also cost prohibitive. In contrast, a case-control design with only cases and a limited number of 7 sampled controls are required for additional follow-ups, makes it cost eective and easy to conduct. There have been diculties in estimating disease risk (prevalence or cumu- lative incidence) within such a design, but with our new methods it is fully feasible and convenient. We demonstrated its usefulness by applying the methods to the MEPEDS and LALES projects. Another signicant benet of the new approach is eliminating the bias when using the odds ratio for estimating relative risk in certain case-control study designs. It has been well known that, when the rare disease assumption does not hold, odds ratio may seriously overestimate the true risk eect of an exposure [92], which constrains the application of a logistic model approach to case-control studies. Various methods, either empirical or ad- hoc, have been developed to remedy this drawback with only limited success [62, 64, 92]. Our proposed new methods for disease risk estimation provides a way to estimate the relative risk without any ad-hoc conversion between odds ratio and relative risks. In fact, with the proposed approach, disease risk can be estimated with high precision for both exposed and unexposed subjects. In addition, the relative risk can be calculated without concern with underlying incidence or prevalence rates. The dissertation is organized as follows: In Chapter 1, we provided an introduction of related epidemiological and biostatistical concepts, the theoretical and practical motives of this doctoral work. In Chapters 2 and 3, we formalize the problem and review the related work and statistical aspects to provide a backdrop for methodology development. In Chapter 4, we show the methodology development with technical details. In Chapter 5, we validated the methods with extensive simulation studies. In Chapters 6 and 7, using 8 MEPEDS and LALES data, we show real data applications of the developed methods. In Chapter 8, we conclude the work with remarks for future directions. 9 Chapter 2 Review of Related Work In this chapter, we review the related work regarding disease risk estimation in case- control studies. As discussed in the previous chapter, while the statistical methods for estimating the odds ratio (relative risk) are well known in standard case-control studies, the statistical methods for estimating disease risk (and how variable the estimate itself is) are not generally available. Estimation of absolute disease risk is well developed for inci- dence cohort data, from which we can construct nested case-control studies. Nonetheless, the nature of nested case-control studies are based on incident cases and therefore the estimating methods are dierent and not comparable. In summary, there has been little work directly addressing baseline (absolute) risk estimation in prevalence case based case- control studies. We present the available methods for relative risk estimation in standard case-control studies and for absolute disease risk estimation in cohort studies. 10 2.1 Analysis Methods for Standard Case-Control Studies 2.1.1 Contingency Tables Traditional methods for analyzing case-control studies are based on a grouping or cross-classication of the data. For a single exposure variable, a 22 or 2K contingency table is tabulated, and the statistical inference including odds ratio and condence interval estimates are made based on the table. When it comes to a more realistic setting to control for confounding, a series of 2 2 or 2K tables are constructed according to the strata of the confounding factor. A summary (or common) odds ratio and its variance can be easily calculated using logit estimate [89], maximum likelihood estimate, or Mantel- Haenszel estimate [42, 43]. A test of homogeneity is also used to examine if the odds ratio is constant from stratum to stratum and to determine if the summary measure of association is valid for making inference. Although the above techniques served epidemiologists well in the past decades, their limitations are also very obvious. For simple scenarios, the researchers can work closely with the data, tabulate them into correct forms, calculate the estimates individually in strata as well as in summary, perform overall global test, homogeneity test, trend test, and others. But it is very dicult to do so in more complicated settings, for example, if one has to control for many potentially confounding factors, or to study continuous variables instead of categorical variables, or to analyze the joint eects of multiple risk factors, and in many other occasions. 11 2.1.2 Unconditional Logistic Regression Modern programming techniques and appropriate statistical software allow us to apply regression models to case-control study analysis, which alleviates the limitations of the traditional methods described above. Similar to cross-sectional studies or cohort studies, the linear logistic model for study- ing binary outcomes, can be used in case-control studies. Specically, it models the logit transformation of the disease probability (or log odds), as a linear function of regression variables. LetP denotes the disease risk, thenP=(1P ) denotes the disease odds and x denotes exposure variables (yes=1 or no=0). Thus, the logistic model can be expressed as: logitP =log P 1P = +x Hence the disease probability (i.e. baseline disease risk) can be inferred as: P exposed = exp( +) 1 +exp( +) and P unexposed = exp() 1 +exp() More generally, let y denote the outcome status (1=case, 0=control), and we model over a series of K regression variables x = (x 1 ;:::;x K ), then the logistic model can be written as: logit pr(y=1jx) = + K X i=1 i x i 12 and pr(y=1jx) = exp( + P i x i ) 1 +exp( + P i x i ) The model also ts prospective studies well because the disease status is not known in advance or at the time of exposure information was collected. That is, the exposure variables are xed, and disease is random. However, this is not true in case-control studies. As the design selects samples based on known disease status, and then collects the exposure data of the selected samples. In other words, the disease outcome is xed, while the exposure variables are retrospective. However, if the inclusion or exclusion of each case or control is independent, we can justify that the above logistic model is applicable to case-control studies. Let 1 denotes the probability that a diseased person is selected from n total study base as a case, 0 the probability that a disease-free person is included in the same sample as a control, and z the status of whether or not selected for the sample (1=selected, 0=not), then 1 and 0 can be expressed as: 1 =pr(z = 1jy = 1) and 0 =pr(z = 0jy = 0) Using Bayes' Theorem, we can construct the conditional probability that a person develops disease, having observed he/she has the covariate variables x and being present in the case-control sample, that is pr(y = 1jx;z = 1): 13 = pr(y = 1; x;z = 1) pr(x;z = 1) = pr(z = 1jy = 1; x)pr(yjx)pr(x) pr(z = 1; xjy = 0)pr(y = 0) +pr(z = 1; xjy = 1)pr(y = 1) = pr(z = 1jy = 1; x)pr(yjx)pr(x) pr(z = 1jx;y = 0)pr(xjy = 0)pr(y = 0) +pr(z = 1jx;y = 1)pr(xjy = 1)pr(y = 1) = pr(z = 1jy = 1; x)pr(yjx) pr(z = 1jx;y = 0)pr(y = 0jx) +pr(z = 1jx;y = 1)pr(y = 1jx) = 1 exp( + P i x i ) 0 + 1 exp( + P i x i ) = exp( + P i x i ) 1 +exp( + P i x i ) where = +log( 1 = 0 ). It shows that despite the dierence in sampling designs in cohort and case-control studies, the logistic regression model can still estimate the same s, although the (the log baseline odds estimate) is dierent. Notably, accurate estimate of the baseline odds using case-control studies is of primary interest of this work, which will be described in detail in the methodology development chapter. 2.1.3 Conditional Logistic Regression An alternative approach is to consider the disease status as xed and the exposure variables as random, and in this way, construct the likelihood of the data using a con- ditional distribution. In most case-control studies, disease status is known and the size of case and control samples are also known. Suppose we have a sample of n 1 cases and n 0 controls (n =n 1 +n 0 ), with unordered exposure vectors x 0 1 ; x 0 2 ;:::; x 0 n , i.e. it is not specied which of the exposure vectors pertain to cases and which pertain to controls. 14 There will be n n 1 ways to assign the exposure vectors into a group of size n 1 and a com- plementary group of sizen 0 . Conditioned on this distribution, the probability to observe a certain exposure prole x 1 ;:::; x n for the case-control sample (the rst n 1 x are cases) is: Q n 1 j=1 pr(x j jy = 1) Q n j=n 1 +1 pr(x j jy = 0) P( n n 1 ) l=1 Q n 1 j=1 pr(x 0 l;j jy = 1) Q n j=n 1 +1 pr(x 0 l;j jy = 0) The above likelihood is computable and can be reduced to Q n 1 j=1 exp( P k x jk ) P l Q n 1 j=1 exp( P k x 0 l;jk ) wherex jk denotes thek th exposure variable for thej th subject andx 0 l;jk indicates all the possible assignments of exposure vector of n 1 subjects out of n possibilities. The construction of the above likelihood ts well with regard to the design of case- control studies. (Givenn 1 cases andn 0 controls, the conditioning event is the n observed exposure proles.) One important feature of this conditional likelihood is that it elimi- nates the log baseline odds parameter (), and only depends on the odds ratio parameters (s). Thus this method alone would not allow estimation of the baseline odds, or the baseline risks or odds of disease for each individual. In individually matched case-control studies (case-control sets divided into very ne strata), conditional logistic regression is the correct method to carry out analysis, because the unconditional logistic regression model needs to estimate both and parameters, and in this scenario, the number of parameters may be of the same order of the mag- nitude of the case-control strata and much greater than the number of parameters, prohibiting such estimation. Thus unconditional logistic regression models cannot yield 15 unbiased estimates in such cases. Note that in this work, we have so far analyzed un- matched case-control data based on a nite population sampling. The details with regard to extension to matched case-control data analysis are left to future eorts. 2.2 Disease Risk Estimation in Cohort Data 2.2.1 The Denition of Disease Risk in Cohort Data In disease incident cohort and nested case-control studies (with follow-up data), the absolute disease risk can be estimated as the probability of developing a given disease over a specied time interval and given a set of covariates. This quantity is also referred to as crude probability [47], crude incidence [38], cumulative incidence [27], cumulative incidence risk [56], absolute incidence risk [55], and absolute (cause-specic) risks [6, 7, 22]. For the accumulated risk up to age t due to a specic cause k, the absolute disease risk is dened as: ((0;t];k) = Z t 0 S(u)h k (u)du , where S is the survival function, h k is the cause-specic hazard. In a more generalized form with two competing hazards, the risk of disease for a subject who is free of the disease of interest at agea and is diagnosed with the disease in a subsequent age interval (a;t] is: ((a;t];h 1 ;h 2 ) = Z t a h 1 (u) expf Z u a fh 1 (v) +h 2 (u)gdvgdu Many approaches have been proposed to estimate this quantity. Gail, Brinton and Byar [24] used population based case-control data to estimate the probability that a woman with a given age and risk factors would develop breast cancer over a certain 16 time interval. Aalen and Johansen (1978) [1], Gray (1988)[27], Matthews (1988)[45] and Keidling and Andersen (1989) [4] have also considered estimates of absolute disease risk from cohort data. Langholz and Borgan (1997) [39] extended their results [11] to estimate absolute disease risk in nested case-control studies. The variability and ecacy of estimators have also been extensively discussed as in Benichou and Gail (1995) [8], Gail and Pfeier (2005) [25], Hosmer, Lemeshow, and May (2008) [33]. We will brie y review the method proposed by Gail, Brinton and Byar (1989) [24] and Langholz and Borgan (1997) [39] in the following sections. 2.2.2 Estimating Disease Risk in Incidence Cohort Data First, let's consider the situation of disease risk with competing risks in a breast cancer cohort as an example. Breast cancer is the disease of interest and death from all other causes are the competing risks. The covariates X include age at menarche, number of previous breast biopsies, age at rst live birth and number of rst degree relatives with breast cancer. Assuming that baseline breast cancer hazard h 1 is related to X by: h 1 (u;x) =h 1 (u)rr(u;x); where rr is the relative risk for a women with age u and covariates x. h 2 is assumed to be independent ofX. To estimate the disease risk with this model, one needs to estimate the baseline hazards h 1 and relative risk rr. These quantities are estimable in cohort data following methods in Benichou and Gail (1990a) [6]. However, when the covariate data are only available for follow-up data in the embed- ded case-control samples, a new approach from Gail, Brinton and Byar [24] is needed. 17 Using the denition of attributable risk and the estimates of Miettinen (1974) [56] and Bruzzi (1985)[13], Gail, Brinton and Byar estimated the baseline hazard for breast cancer from estimates of the composite hazard for the entire cohort and those of the relative risk from the case-control data. They used odds ratio as the estimates of relative risk. They arrived at the nal formula: h 1j =h 1j I X i=1 ij rr 1 ij ; which is simplied as: ^ h 1j = 1 t j I X i=1 n ij rr 1 ij ( ^ ): where the summation is taken over all levels of X and ij is the proportion of women at level i of X among women who develop breast cancer at age j. The quantities ij , which are assumed to be constant over the age interval j, can be estimated from the cases. If one denotes by the parameters in the logistic model, then rr ij can be estimated as a function of ^ , the MLE of . For more details, see sections 2.1 to 2.4 in Benichou and Gail [8]. 2.2.3 Estimating Disease Risk in Nested Case-control Data Langholz and Borgan proposed an alternative method to estimate the disease risk in a nested case-control setting. They considered the situation of possibly continuous time- dependent exposure history in estimating the risk of a disease in the absence or presence of competing risks. They assumed a base model similar to Benichou and Gail (1995) [8], but additionally included a vector of unknown but time-xed exposures . h 1 (u;x) =h 1 (u)rr( 0 ;x(u)) 18 Therefore, the integrated hazard is: H 1 (s;t;x) = Z t s rr( 0 ;x(u))h 1 (u)du This is the expected number of events during time period (s;t] per subject with covariate history x(u). In this model, they assume an event can occur more than once during this interval. Estimation of the regression parameter 0 in integrated hazard expression is based on the partial likelihood: L() = Y ( r(; x i j (t j )) P l2 ~ R(t j ) r(; x l (t j )) ) In this paper, the authors also addressed the estimation when a competing risk is present. The absolute disease risk of c1 in the age interval (s;t] in the presence of competing risks c2 with a covariate history x 0 (u), s<ut, is: (s;t; x 0 ) = Z t s S(s;u; x 0 )h 1 (u; x 0 )du; with S(s;u; x 0 ) = expf Z u s [h 1 (v; x 0 +h 2 (v)]dvg being the probability that neitherc1 norc2 occur between s and u. More details were presented in [8]. 19 Chapter 3 Methodology Development Background 3.1 General Strategy In a survey sampling process, probability-based samples are selected to make infer- ences on the target population. We will assume a target populationR with subjects f1;:::;ng and let Y 1 ;:::;Y n be some values (doesn't matter what values) for these sub- jects. The population total ofY can be calculated asY R = P i2R Y i . Then a probability- based sample ~ R is selected and suppose that we have an unbiased estimator of the pop- ulation total Y R : E(Y ~ R ) =E[ X i2 ~ R w i ( ~ R)Y i ] =Y R ; wherew i ( ~ R) are the weights that can be constructed from the data corresponding to the sampling plan. To be specic, w i ( ~ R) will depend on the parameter of interest, denoted by 0 . Setting Y i 1, we have: E(Y ~ R ) =E[ X i2 ~ R w i ( ~ R; 0 )Y i ] = X i2R 1 =n: 20 Using the method of moments, an estimator ^ can be obtained by solving X i2 ~ R w i ( ~ R; ^ ) =n: (3.1) 3.2 Case-control Designs with Complex Sampling 3.2.1 The Classical Case-control Study In classical case-control studies, cases (persons with the disease of interest) and con- trols (persons without the disease) are selected based on their disease status, and consid- ered to represent their respective innite population. Then their exposure information is collected and the potential relationship between the exposure and the disease is evaluated by comparing the subjects in the case group and the subjects in the control group. This approach is \retrospective" since the disease status is xed for the sample, and their covariates (or exposures) are random.(see Figure 3.1) Figure 3.1: The classic case-control model. 21 The analysis method of classical case-control studies is to test if there are signicant dierences between proportions of the exposed/unexposed subjects among cases and con- trols.(Table 3.1) The results are often expressed as exposure odds ratio (OR): OR = a=c b=d = ad bc Exposed Unexposed Total Disease a b n 1 =a +b No Disease c d n 2 =c +d Total m 1 =a +c m 2 =b +d N =a +b +c +c Table 3.1: Observed 2x2 table in classic case-control studies This type of observational epidemiologic design and analysis is of great use to identify the risk/protective factors that contribute to the disease of interest, with relatively less expense, shorter duration, and smaller scope compared to cohort studies. However, when disease risk is a parameter of greater interest, it cannot be estimated through the case- control studies. Disease risk can be expressed as the disease probability in the exposed or unexposed group. If we borrow the data from the above 2x2 table (Table 3.1), the absolute disease risks (AR) in the exposed and unexposed group might be expressed as: AR in the exposed group =a=(a +c); and AR in the unexposed group =b=(b +d): 22 However, because the sample sizes in the case group and the control group are arbi- trary, the above risk estimators are not accurate. For instance, if we simply double the sample size in the control group, the above disease risk estimates will change into: AR in the exposed group =a=(a + 2c); and AR in the unexposed group =b=(b + 2d): Because of the design of case-control studies stated above, in general, it is dicult to accurately estimate absolute disease risk in exposed/unexposed group, even though under certain restricted conditions and under a rare disease assumption, the odds may approach the risks. 3.2.2 Case-control Studies with Complex Sampling Designs Contrary to classical case-control designs, case-control studies with alternative sam- pling designs, are more like a \prospective" method. In this setting, we have a nite population (or say \study base"). We select all the cases and then sample controls from this study base. We can also sample cases, but for simplicity, we illustrate the theory and examples here using all cases (see Figure 3.2). The likelihood for this case-control set is constructed based on the probability of a set of cases given the case-control set. The latter probability is conditioned on the covariate values of the case-control set and treats the disease status as random, for which reason we could say that this kind of case-control 23 design and corresponding analytic methods are more \prospective" compared to classical designs. Figure 3.2: Finite population model for case-control studies. 3.2.3 Examples of Case-control Designs with Complex Sampling 3.2.3.1 Size Matching Size matching is a very commonly used complex design. Controls are selected in numbers proportional to the number of cases. 1:m size matching means that m controls are randomly sampled without replacement from all the controls in the full study base. 24 3.2.3.2 Independent Bernoulli Trial Sampling of Controls This design is also called randomized recruitment of controls. The controls are in- cluded into the case-control sample according to the outcome of a Bernoulli trial where the probability of inclusion may depend on the covariate information known on the subject. 3.2.3.3 Case-base Sampling This design involves the selection of a simple random sample from the study base and all the cases. Table 3.2 shows the structure of a case-base sample design. A xed study base of nite size N is dened, a case sample of size c and a random sample of the study base of size b (named base sample) are selected and classied by exposure status. The base sample is drawn regardless of disease status, but their case and control status are then dertermined. The base sample and case sample can contain the same case subjects, and the nal case-control set includes the union of the subjects in both the base sample and the case sample. Exposed Unexposed Total Base Sample b 1 b 0 b =b 1 +b 0 Cases c 1 c 0 c =c 1 +c 0 Controls b 1 c 1 b 0 c 0 bc Case Sample c 1 c 0 c =c 1 +c 0 Total in case control set b 1 +c 1 c 1 b 0 +c 0 c 0 b +cc Total in study base N 1 N 0 N =N l +N 0 * The same case subjects in both the base sample and the case sample. Table 3.2: Case-base Sampling Design with Binary Outcome and Exposure. 25 3.2.3.4 Counter Matching In a simple scenario with a dichotomous matching variable (two sampling strata), namely variable C, 1:m counter matching means that per each case, m controls are selected without replacement from the opposite C sampling stratum of the case. In a more general counter matching design, the counter-matching variable has L sampling strata and in each stratum there are n l subjects (l2 (1; 2;:::;L)). If the case is from stratumj, thenm l controls are sampled from each stratum (l2 (1; 2;:::;L)), except for stratum j, from which m j 1 controls are sampled. This way it yields the required m l subjects from each stratum. (see Table 3.3) Sampling Stratum C l 1 2 ... j ... L Total Cases 0 0 ... 1 ... 0 1 Controls m 1 m 2 ... m j -1 ... m L P m l 1 Total in case control set m 1 m 2 ... m j ... m L P m l Total in study base n 1 n 2 ... n j ... n L P n l Table 3.3: General counter-matching with the case in sampling stratum j. 3.3 Conditional Logistic Analysis of Case-Control Studies with Complex Sampling The analysis methods of case-control studies with complex sampling were developed by Langholz and Goldstein (2001) based on a proportional odds (logistic) model. The likelihood for case-control data with general sampling was derived and illustrated in a number of control sampling designs. The following is a brief summary of the theory, which is also the basis of the methods for disease risk estimation proposed in later chapter. 26 3.3.1 Notation and Model As introduced in chapter 2, we are considering a nite population (or \study base") of N subjects. Let R =f1;:::;Ng index the subjects, D i indicates the disease status, and Z i denotes the covariate information for subject i. Assuming a proportional odds (logistic) model, the probability of disease is the following: pr(D i = 1jZ i ) = 0 '(Z i ; 0 ) 1 +'(Z i ; 0 ) = 0 ' i 1 + 0 ' i (3.2) where 0 is the baseline odds and '(Z i ; 0 ) is the individual odds ratio associated with the covariate value Z i . Let D be the set of indices for disease subjects in the full study base. Then the probability to observe such under the proportional odds model can be written as: pr(D = d) = Y i2d 0 ' i 1 +' i Y j2Rnd 1 1 +' i = d 0 ' d Q R where' d = Q i2d ' i andQ R = Q j2R 1 1+' i . The likelihood for estimation 0 and 0 using the full study base is given by: L( 0 ;) = D 0 ' D Q R ( 0 ;) (3.3) Since this likelihood requires information for all subjects in the full study base, it is not usually practical, especially when the study base size n is very large. Thus we have developed methods to sample the study base. To simplify the question, we will include 27 all the diseased `cases' and only sample non-diseased `controls'. We can construct the likelihood of the case set given the sampled case-control set ~ R, the likelihood requires information collected on the sampled subjects. Let(rjs) denote the probability that r is the sampled case-control set given that s is the case set, which is also the basic information needed from the sampling design. Applying Baye's theorem and the previous equation (3.2), we have pr(D = dj ~ R = r) = pr( ~ R = rjD = d)pr(D = d) P sr pr( ~ R = rjD = s)pr(D = s) = d 0 ' d (rjd) P sr s 0 ' s (rjs) and the likelihood for the sampled case-control set is L( 0 ;) = D 0 ' D ()( ~ RjD) P s ~ R s 0 ' s ()( ~ Rjs) (3.4) This likelihood is a quite general expression and we can apply the specic probability (rjd) (or say \risk weights") for dierent sampling schemes. 3.3.2 Examples of Case-control Designs with Complex Sampling In this section, we will apply the general methods described above to a variety of case-control designs with sampling. It will be shown that this method can accommodate quite a lot interesting and even creative designs which may be very useful under proper settings. 28 3.3.2.1 Size Matching Suppose we have a nite study base of size N with a total of d diseased subjects (\cases"). In a well-established 1 : m size matching design, we randomly select md controls from the Nd non-diseased subjects. Then the control selection probability (the probability that r is the sampled case-control set given that d is the case set) is: (rjd) = N d md 1 and the conditional likelihood to get the sampled case-control set out of all the possible sets of d cases in this sample r is: L( 0 ;) = D 0 ' D () Nd md 1 P s ~ R:jsj=jDj s 0 ' s Nd md 1 = ' D () P s ~ R:jsj=jDj ' s () (3.5) Because in the setting, the likelihood is conditioned on the sets of equal size of cases, the \risk weights" in both the numerator and denominator cancel out, and so is the baseline odds parameter 0 . Using the outcome and covariate information from the subjects of the case-control set, we are able to estimate the odds ratio parameter ', but not the baseline odds parameter 0 . 3.3.2.2 Independent Bernoulli Trial Sampling of Controls In some circumstances, we select the controls according to the outcome of a Bernoulli trial. We can adjust the Bernoulli trial success probability to get the appropriate control 29 size we want. We can also further adjust the inclusion probability for dierent subsets of subjects, i.e. according to dierent exposure levels such as smokers/non-smokers, female/male. Let j be the probability of including j if j were a control in the case- control set, then the control selection probability is given by (rjd) = Y j2rnd j Y j2Rnr (1 j ) = Y j2d 1 j Y j2r j Y j2Rnr (1 j ) and the conditional likelihood becomes L( 0 ;) = D 0 ' D () Q j2D 1 j Q j2 ~ R j Q j2Rn ~ R (1 j ) P s ~ R s 0 ' s () Q k2s 1 k Q k2 ~ R k Q k2Rn ~ R (1 k ) (3.6) = Q j2D [ 1 j 0 ' j ()] P s ~ R Q k2s [ 1 k 0 ' k ()] = Q j2D [ 1 j 0 ' j ()]Q ~ R ( 0 ;) P s ~ R Q k2s [ 1 k 0 ' k ()]Q ~ R ( 0 ;) (3.7) = Y j2D [ 1 j 0 ' j ()]Q ~ R ( 0 ;) (3.8) whereQ ~ R ( 0 ;) = Q k2 ~ R f1 + 1 k 0 ' k ()g 1 . In the likelihood (3:6), the factors common to ~ R and Rn ~ R cancel out. We incorporate the term 1 j into the corresponding `odds' term, and reconstruct the numerator and denominator to (3:8). The denominator would then equal one, because the terms in it dene a probability distribution over all possible sets of cases for any 0 and . Under this distribution, the reconstructed odds ratio of having the outcome is 1 j ' j (). 30 3.3.2.3 Case-base Sampling As described in Table 3.2, in this design we randomly select b subjects regardless of case or control status from the study base of sizeN. All other cases in the study base who were not initially selected in theb subjects are further included in the base sample. Thus the nal case-control sample set consists of the base sample b as well as the additional cases. Suppose we have a number ofd cases in the nal sample, then the control selection probability (the probability that r is the sampled case-control set given that d is the case set) is: (rjd) = d b(rd) N b Applying the above probability into the general expression of the likelihood in (3.4), we could get L( 0 ;) = D 0 ' D () d b(j ~ Rjd) P s ~ R s 0 ' s () s b(j ~ Rjs) = D 0 ' D () d b(j ~ Rjd) P j ~ Rj k=j ~ Rjb P s ~ R:jsj=k s 0 ' s () k b(j ~ Rjk) (3.9) In this likelihood, 0 does not cancel, leading to an alternative approach to estimate baseline odds 0 and odds ratio parameter . 3.3.2.4 Counter Matching As described in Table 3.3 in a general CM design, we assume a counter-matching variableC withL sampling strata.Table 3.3 shows the scenario of a single case. Assuming there are a total of d cases in the study base, then for each stratum, we x a marginal 31 total of m l d subjects to be sampled. Suppose there are n l subjects in each stratum and d l of them are being cases, we are randomly selecting m l dd l controls out of then l d l non-disease subjects in each stratum. The control selection probability can be written as: (rjd) = " L Y l=1 n l d l m l d d l # 1 Applying the above probability into the general expression of the likelihood in (3.4), we get: L( 0 ;) = D 0 ' D () h Q L l=1 n l D l m l DD l i 1 P s ~ R:jsj=D s 0 ' s () h Q L l=1 n l s l m l Ds l i 1 (3.10) = ' D () h Q L l=1 (n l d l )(n l d l 1) (m l Dd l + 1) i 1 P s ~ R:jsj=D ' s () h Q L l=1 (n l s l )(n l s l 1) (m l Ds l + 1) i 1 3.3.3 Computation Methods on the Odds Ratio Langholz and Goldstein (2001) provide three computational methods to estimate the odds ratio for case-control designs with complex sampling. Here we brie y introduce two of them. 3.3.3.1 Exploit Log-linear Form Considering the conditional likelihood expressed in 3.4, we can rewrite it into the exponential form when the odds ratio term is log-linear. We replace the ( ~ Rjs) with a 32 more general term w ~ R (s) called risk weights and assume there are k covariates for each individual. s 0 ' s ()w ~ R (s) = w ~ R (s) Y i2s [ 0 exp(z 1i 1i + + z ki ki )] = exp ( log(w ~ R (s)) + s log( 0 ) + X i2s z 1i ! 1 + + X i2s z ki ! k ) The above likelihood is equivalent to an individually matched conditional logistic like- lihood, except that we have to treat all the cases as matched to a `pseudo' case with covariate value to the sum of the covariate values equal to the case set. The log risk weight of each set is added to each line of data as an oset. 3.3.3.2 Unconditional Logistic Regression Unconditional logistic regression based on a `marginal' likelihood may also be valid for large samples. The marginal probability that i is a case given that i is a member of the case-control set is given by pr(D i = 1ji2 ~ R) = 1 i 0 ' i 1 + 1 i 0 ' i (3.11) where i = pr(i2 ~ RjD i = 0) = X sR:s63i " X rR:r3i (rjs) # pr(D = sjD i = 0) = E " X rR:r3i (rjD) ! jD i = 0 # 33 To compute i , we need the information for all study base members, which is not available in most scenarios. Thus it is reasonable to use a method of moments type estimator by using the case-control set information only: ^ i = P rR:r3i (rjD). Thus, by including log(^ ) as an oset in the unconditional logistic regression model, we can compute odds ratios as well as baseline odds using the standard unconditional logistic methods. Note that the variance estimation of the parameters will be more conservative (larger). 3.4 The Connection between Case-control and Survey Sampling Designs In the previous sections, we reviewed the odds ratio estimation methods for classical case-control studies and case-control studies with complex sampling. None of these pro- vided an accurate or valid estimation of the baseline odds or its variance. In order to compute the disease odds for each individual with certain covariate proles under the pro- portional odds model, the estimation and computation of baseline odds and its variance is a necessity. Here we propose a statistical approach derived from survey sampling. Survey sampling describes the process of selecting a sample from a target population in order to conduct a survey and make inference on the targeted population. Assume that the targeted population isR, including subjectsf1;:::;Ng with unknown covariate values of interestfY 1 ;:::;Y N g. A basic sample survey problem is to sample the population, collect the covariate information Z on the sample, and then estimate the population sum Y R = P i2R Y i or the population mean Y R = N 1 P i2R Y i . The major purpose of the survey sampling is to reduce the cost and/or the amount of work that it would take to survey the entire 34 population. To some extent, this advantage is very similar to case-control study, as they are both able to paint a whole picture through sampling parts of it by importance. A valid survey sampling design and appropriate unbiased homogeneous linear (UHL) estimator is the key to making mathematically sound statistical inferences about the targeted population. Let ~ R denote a random sample selected from the population R with distribution over subsets of R, i.e. denotes the probability to select ~ R as the survey sample from R. We say ~ R (or ) is a survey sampling design. We can dene Z ~ R to be a -unbiased homogeneous linear (-UHL) estimator for any Y 1 ;:::;Y N , which satises: E[Y ~ R ] =Y R = X i2R Y i (Unbiased) Y ~ R = X i2 ~ R w i ( ~ R)Y i where w i ( ~ R) 0 (Homogeneous linear) The survey problem can be easily connected to case-control designs with complex sampling. We use the same notations for both problems, even though they do refer to dierent terms in each problem. The targeted population R in survey sampling problem is the analog to the study base (a nite population) in the case-control design, and ~ R in survey sampling is the analog to the case-control sample set. More directly, let (w;) be a -UHL survey sampling design, then (rji) =w i (r)(r) 35 is a case-control design, where (rji) indicates the control selection probability given i is the case. From the above formula bridging the survey sampling and case-control studies, we see that the weights w i (r) in the -UHL estimator actually can be computed only using the information collected from the sampled case-control set. Under our general assumption of a proportional odds model (logistic model), the weights can be computed using the data from the case-control set, and depend on baseline odds 0 : E[Y ~ R ] =E[ X i2 ~ R w i ( ~ R; 0 )Y i ] =Y R = X i2R Y i : Since 0 is the parameter of interest to estimate, we can easily set Y i 1 (or other convenient value), then E[Y ~ R ] =E[ X i2 ~ R w i ( ~ R; 0 )Y i ] =Y R = X i2R Y i =N: Recall that we dene the size of the study base R is N. Thus a method of moment estimator ^ 0 can be obtained by solving X i2 ~ R w i ( ~ R; ^ 0 ) =N: Its variance can be estimated using the Delta method. This is the general approach we propose to estimate baseline odds 0 . However, we note that the weights must be con- structed so they only depend on covariate information from the sample, while the weights may additionally depend on odds ratio under a proportional odds model. The odds 36 ratio will be estimated using conditional logistic likelihood as described in the previous chapter. 37 Chapter 4 New Estimation Methods for Baseline Odds and Risk 4.1 General Notation To be consistent with previous chapters, we consider a nite population R (or say \study base") of N subjects. Let R =f1;:::;Ng index the subjects, D i indicates the disease status, z i denotes the covariate information for subject i, Y i continues to denote the nuisance covariate we want to estimate in survey sampling problem. We select all the cases D in the study base, and sample the controls from the remaining subjects in the study base according to certain control selection probability (rjD). We also assume a proportional odds (logistic) model, and the disease probability is exactly the disease risk we want to estimate for each individual: pr(D i = 1jz i ) = 0 '(z i ;) 1 +'(z i ;) = 0 ' i 1 + 0 ' i (4.1) 38 where 0 is the baseline odds and '(z i ;) (or ' i ) is the individual odds ratio associated with the exposure value z i , which can also be written in a log-linear form: '(z i ;) =exp(z i ): 4.2 A Natural Horwitz-Thompson Baseline Odds Estimator Recall that the study base is R =f1;:::;Ng with characteristic valuesfY 1 ;:::;Y N g; we estimate the population sum Y R = P i2R Y i . Under the proportional odds model assumption, each individual has a disease probability of p i = 0 ' i 1+ 0 ' i . Then the disease occurrence in the study base can be considered as an independent Bernoulli trial D i BT(p i ). In survey sampling problems, a very general technique to estimate a population total is the Horvitz-Thompson estimator [16, 32]. The estimator applies the inverse probability weighting (inverse inclusion probability) to each element whenever it is selected for the sample. For the scenarios described above, the natural Horwitz-Thompson estimator for Y R using cases only is given by Y D = X i2D Y i =p i : (4.2) 39 The expectation of Y D is equal to Y R , i.e. Y D is unbiased for Y R : E[Y D ] = E[ X i2D Y i p i ] = E[ X i2R Y i p i D i ] = X i2R p i Y i p i = X i2R Y i = Y R : Under the assumption that the disease occurrence for each individual is independent of each other, the variance of the Horvitz-Thompson estimator Y D is given by: var[Y D ] = var ( X i2R (Y i =p i ) D i ) = X i2R (Y 2 i =p 2 i )var(D i ) = X i2R Y 2 i 1p i p i : (4.3) An unbiased variance estimator of the variance of Y D is also readily obtainable by: c var[Y D ] = X i2D Y 2 i p i 1p i p i : Set Y i 1, then E[Y D ] =E[ X i2D Y i p i ] =E[ X i2D 1 p i ] =E[ X i2D (1 + 1 0 ' i )] =N: 40 The method of moment estimator ^ 0 can be obtained by solving P i2D (1 + 1 0 ' i ) =N : ^ 0 = P i2D 1 ' i NjDj : (4.4) The variance of this estimator can be estimated using the Delta method. Suppose X is a random variable with E(X) =. For a given function g satisfying the property that g 0 () exists and is non-zero valued, we approximately have: E(g(X))g(); and Var(g(X)) [g 0 ()] 2 VarX: Let X =Y D , g(X) = 0 = P i2D 1 ' i Y D jDj , then we have: Var( 0 ) " ( P i2D 1 ' i Y D jDj ) 0 N # 2 Var[Y D ] = " P i2D 1 ' i (NjDj) 2 # 2 Var[Y D ]: Let X = 0 , g(X) = =log( 0 ), then we have: Var() [(log( 0 )) 0 2 Var[ 0 ] = ( 1 ^ 0 ) 2 Var[ 0 ]: 41 4.3 A Rao-Blackwellized Baseline Odds Estimator 4.3.1 Rao-Blackwellized Estimator Y ~ R for Y R Recall that ~ R indicates the sampled case-control set, D indicates the case set (D ~ R), and we try to estimate the population sum Y R = P i2R Y i . Since Y D is an unbiased estimator forY R , a Rao-Blackwell estimatorY ~ R can be obtained by taking the conditional expectation value of the original estimator Y D : Y ~ R =E[Y D j ~ R]: (4.5) It can be easily proved that this estimator is also unbiased using the law of total expectation (i.e., the expected value of the conditional expected value of X given Y is the same as the expected value of X, where X is an integrable random variable and Y is any random variable, not necessarily integrable, on the same probability space.). The proof of the law of total expectation in the discrete case is shown as below: 42 E Y E XjY (XjY ) = E Y " X x x P(X =xjY ) # = X y " X x x P(X =xjY =y) # P(Y =y) = X y X x x P(X =xjY =y) P(Y =y) = X x x X y P(X =xjY =y) P(Y =y) = X x x X y P(X =x;Y =y) = X x x P(X =x) = E(X): Then the proof that the Rao-Blackwell estimator Y ~ R is unbiased for the Y R can be easily obtained by: 43 E ~ R E Y D j ~ R (Y D j ~ R) = E ~ R " X Y d Y d P(Y D =Y d j ~ R) # = X r " X Y d Y d P(Y D =Y d j ~ R) # P( ~ R =r) = X r X Y d Y d P(Y D =Y d j ~ R) P( ~ R =r) = X r X Y d Y d P(Y D =Y d ; ~ R =r) = E(Y D ) =Y R : Further, by using the the law of total variance, we have: Var[Y ] = E X [Var[Y jX]] + Var X [E[Y jX]]; and Var[Y D ] = E ~ R h Var[Y D j ~ R] i + Var ~ R h E[Y D j ~ R] i : Recall how we denedY ~ R in Section 4.5, it can be easily shown that compared toY D , the Rao-Blackwell estimator Y ~ R is an improved estimator for Y R , as the variance of Y ~ R is less than or equal to that of Y D : var[Y ~ R ] = var[Y D ]E ~ R [var[Y D j ~ R]] var[Y D ] 44 Considering a very general case-control sampling scheme from a nite population as described in Section 4.1, we can explicitly write out the general expression for the point estimator of Y ~ R : Y ~ R = E[Y D j ~ R] =E[ X i2R (Y i =p i ) D i j ~ R] = X i2 ~ R (Y i =p i ) pr(D i = 1j ~ R) = X i2 ~ R Y i pr(D i = 1j ~ R = r) pr(D i = 1) = X i2 ~ R Y i pr(D i = 1; ~ R = r) pr( ~ R = r)pr(D i = 1) = X i2 ~ R Y i pr( ~ R = rjD i = 1) pr( ~ R = r) = X i2 ~ R Y i P d ~ Rnfig d q Rnfig ( ~ Rjd[fig) P d ~ R d q R ( ~ Rjd) = X i2 ~ R Y i P d ~ Rnfig d ( ~ Rjd[fig) q i P d ~ R d ( ~ Rjd) = X i2 ~ R Y i q i w i; ~ R (4.6) where w i; ~ R = P d ~ Rnfig d ( ~ Rjd[fig) P d ~ R d ( ~ Rjd) : And the variance estimator of Y ~ R : var[Y ~ R ] = var[E[Y D j ~ R]] = var[Y D ]E[var[Y D j ~ R]] = X i2R Y 2 i 1p i p i X rR [ X dr p djr (Y d Y r ) 2 ] pr( ~ R = r); 45 which can be unbiasedly estimated by c var[Y ~ R ] = c var[Y D ] X d ~ R p dj ~ R (Y d Y ~ R ) 2 = c var[Y D ] X i;j2 ~ R Y i q i Y T j q j (w i;j; ~ R w i; ~ R w j; ~ R ); (4.7) where w i;j; ~ R = 8 > > < > > : w i; ~ R = i if i =j P d ~ Rnfi;jg d ( ~ Rjd[fi;jg) P d ~ R d ( ~ Rjd) if i6=j: The detailed deduction of the above quantities can be found in Appendix A. 4.3.2 Computation of Y ~ R and c var[Y ~ R ] For general designs and one dimensional X, let C 0 (X) = X d ~ R d ( ~ Rjd); C 1 (X) = X i2 ~ R X i X d ~ Rnfig d ( ~ Rjd[fig) = X d ~ R d X i2 ~ Rnd X i ( ~ Rjd[fig); and C 2 (X) = X i6=j2 ~ R X i X T j X d ~ Rnfi;jg d ( ~ Rjd[fi;jg) = X d ~ R d X i6=j2 ~ Rnd X i X T j ( ~ Rjd[fi;jg): 46 Then we have: Y ~ R = C 1 (Y=q) C 0 (4.8) and c var[Y ~ R ] = c var[Y D ] " C 1 (Y 2 =pq) +C 2 (Y=q) C 0 C 1 (Y=q) C 0 2 # : (4.9) Since we can write the estimator Y ~ R as a function of baseline odds 0 we can set Y i to be a convenient value such as Y i 1 and solve the method of moment estimator of baseline odds 0 . Due to dierent control selection probabilities in case-control studies with dierent sampling designs, the explicit estimator of baseline odds 0 is calculated under specic designs. 4.3.3 Rao-Blackwellized Baseline Odds Estimator in Size Matching Sampling Design 4.3.3.1 Baseline Odds Coecient ^ We have the general expression of the unbiased \Rao-Blackwellized" estimator Y ~ R (4.6): Y ~ R = X i2 ~ R Y i q i w i; ~ R ; where w i; ~ R = P d ~ Rnfig d ( ~ Rjd[fig) P s ~ R s ( ~ Rjs) : 47 In size matching, we have w i;jDj; ~ R = P d ~ Rnfig:jdj=jDj1 ' d ( ~ Rjd[fig) exp() P d ~ R:jdj=jDj ' d ( ~ Rjd) and, with Y i 1, Y ~ R () = X i2 ~ R 1 q i w i;jDj; ~ R = X i2 ~ R (1 + exp() ' i ) w i;jDj; ~ R = X i2 ~ R (exp() +' i ) P d ~ Rnfig:jdj=jDj1 ' d ( ~ Rjd[fig) P d ~ R:jdj=jDj ' d ( ~ Rjd) = exp() P d ~ R:jdj=jDj1 ' d P i2 ~ Rnd ( ~ Rjd[fig) P d ~ R:jdj=jDj ' d ( ~ Rjd) +jDj: Thus, setting Y ~ R (^ ) =N and solving for exp(^ ), we have: exp(^ ) = (NjDj) 1 P d ~ R:jdj=jDj1 ' d P i2 ~ Rnd ( ~ Rjd[fig) P d ~ R:jdj=jDj ' d ( ~ Rjd) = j ~ Rj (jDj 1) NjDj P d ~ R:jdj=jDj1 ' d P d ~ R:jdj=jDj ' d : And the estimator for baseline odds coecient ^ is: ^ =log( j ~ Rj (jDj 1) NjDj P d ~ R:jdj=jDj1 ' d P d ~ R:jdj=jDj ' d ): (4.10) 4.3.3.2 Variance Estimate for Baseline Odds Coecient ^ Based on the Delta method and Taylor series approximation, we are able to get the variance estimator for baseline odds coecient . Suppose we have a function U 48 of baseline odds and odds ratio coecient , with E(U(;)) = 0, E() = 0 and E() = 0 . Then a rst-order Taylor approximation will give us: U(;)U( 0 ; 0 ) + dU( 0 ; 0 ) d ( 0 ) + dU( 0 ; 0 ) d ( 0 ); so that: ( 0 ) dU( 0 ; 0 ) d 1 [U(;) dU( 0 ; 0 ) d ( 0 )]; and var[] = dU( 0 ; 0 ) d 2 [Var[U] + dU( 0 ; 0 ) d 2 Var[]]: If = ( 1 ;:::; p ) is a vector-valued random variable with mean 0 = ( 01 ;:::; 0p ), then var[] = dU( 0 ; 0 ) d 2 [Var[U] + dU( 0 ; 0 ) d T 1 dU( 0 ; 0 ) d ]; which is estimated by c var[^ ] = dU(^ ; ^ ) d 2 [Var[U] + dU(^ ; ^ ) d T 1 dU(^ ; ^ ) d ]: (4.11) In the above FM scenario, recall that we have: Y ~ R () = exp() P d ~ R:jdj=jDj1 ' d P i2 ~ Rnd ( ~ Rjd[fig) P d ~ R:jdj=jDj ' d ( ~ Rjd) +jDj: (4.12) We replace U(;) with Y ~ R ()N, then we have: dU(^ ; ^ ) d = d ^ Y ~ R d =( ^ Y ~ R jDj) 49 and dU(^ ; ^ ) d = d ^ Y ~ R d = exp() T (1) () S (0) ; () T (0) () S (1) () S (0) () 2 ! whereT (0) () is the numerator of (4.12),S (0) () the denominator of (4.12),T (1) andS (1) the derivatives of T (0) and S (0) . Thus the variance estimate for baseline odds coecient ^ can be calculated by: c var[^ ] = d ^ Y ~ R d 2 [Var[Y ~ R ] + d ^ Y ~ R d 2 Var[]]: (4.13) 4.3.3.3 Computation of the Baseline Odds and its Variance Estimate The computing diculty is that the sums are over the subsets of the case-control set and the number of terms generally increases very fast as the size of case-control set increases. However, by applying a recursive algorithm, each component in the expressions of the point and variance estimator of can be computed. We let B 0 (m; r) = X dr:jdj=m ' d ; B 1 (m; r; ~ Z) = X dr:jdj=m X i2d ~ Z i ! ' d ; and B 2 (m; r; ~ Z) = X dr:jdj=m X i2d ~ Z i ! 2 ' d : 50 Then the following recursion formulas hold: B 0 (m; r[fjg) = B 0 (m; r) +' j B 0 (m 1; r); B 1 (m; r[fjg) = B 1 (m; r) +' j B 1 (m 1; r) + ~ Z j ' j B 0 (m 1; r); and B 2 (m; r[fjg) =B 2 (m; r) +' j B 2 (m 1; r) + ' j [ ~ Z j B 1 (m 1; r) T +B 1 (m 1; r) ~ Z T j ] +' j ~ Z 2 j B 0 (m 1; r): The above formulas can be proved by splitting the outer sum into sets that do not contain j and those contain j. And it is easy to prove that: B 0 (m; r) 0 =B 1 (m; r; ~ Z): Using the above expressions, the point estimator can be written as: exp(^ ) = j ~ Rj (jDj 1) NjDj B 0 (jDj 1; ~ R) B 0 (jDj; ~ R) : Thus the components of the variance estimator can be written as: 51 dU(^ ; ^ ) d = d ^ Y ~ R d = exp()(j ~ Rj (jDj 1)) B 0 (jDj 1; ~ R) B 0 (jDj; ~ R) ; dU(^ ; ^ ) d = d ^ Y ~ R d = exp() T (1) () S (0) () T (0) () S (1) () S (0) () 2 ! = exp()(j ~ Rj (jDj 1)) B 1 (jDj 1; ~ R; ~ Z) B 0 (jDj; ~ R) B 0 (jDj 1; ~ R)B 1 (jDj; ~ R; ~ Z) B 0 (jDj; ~ R) 2 ! ; and c var[Y ~ R ] = c var[Y D ] " C 1 (Y 2 =pq) +C 2 (Y=q) C 0 C 1 (Y=q) C 0 2 # : In size matching scenario, the components in c var[Y ~ R ] can be simplied by canceling out control selection probabilities ( ~ Rjd) and baseline odds 0 , and converting d into ' d : C ' 0 (X) = X d ~ R:jdj=jDj ' d =B 0 (jDj; ~ R); C ' 1 (X) = X i2 ~ R X i X d ~ Rnfig ' d = X d ~ R:jdj=jDj1 ' d X i2 ~ Rnd X i = X d ~ R:jdj=jDj1 ( X i2 ~ R X i X i2d X i )' d = ( X i2 ~ R X i )B 0 (jDj 1; ~ R)B 1 (jDj 1; ~ R;X); 52 C ' 2 (X) = X i6=j2 ~ R X i X T j X d ~ Rnfi;jg:jdj=jDj2 ' d = X d ~ R:jdj=jDj2 ' d X i6=j2 ~ Rnd X i X T j = X d ~ R:jdj=jDj2 ( X i;j2 ~ Rnd X i X T j X i=j2 ~ Rnd X i X T j )' d = X d ~ R:jdj=jDj2 ( X i;j2 ~ Rnd X i X T j X i2 ~ Rnd X 2 i )' d = X d ~ R:jdj=jDj2 0 B @ 0 @ X i2 ~ Rnd X i 1 A 2 X i2 ~ Rnd X 2 i 1 C A' d = ( X i2 ~ R X i ) 2 B 0 (jDj 2; ~ R) [( X i2 ~ R X i ) T B 1 (jDj 2; ~ R;X) +B 1 (jDj 2; ~ R;X) T ( X i2 ~ R X i )] +B 2 (jDj 2; ~ R;X) [( X i2 ~ R X 2 i )B 0 (jDj 2; ~ R)B 1 (jDj 2; ~ R;X 2 )]: 4.4 Disease Risk Estimation 4.4.1 Point Estimator of Disease Risk Assuming a proportional odds (logistic) model, the disease risk for each individual with a certain exposure prole is the disease probability as shown in Equation (4.1): pr(D i = 1jz i ) = 0 '(z i ;) 1 +'(z i ;) = 0 ' i 1 + 0 ' i = exp( 0 +z i 0 ) 1 +exp( 0 +z i 0 ) ; (4.14) where 0 and var( 0 )can be estimated by ^ and c var(^ ), as stated in the previous sections, and 0 and var( 0 ) can be estimated by regular logistic regression. Hence, we have the disease risk estimate for p i : ^ p i = exp( ^ t) 1 +exp( ^ t) : 53 4.4.2 Condence Interval Estimator of Disease Risk Let ^ t denote (^ +z i ^ ), and var( ^ t) denote var(^ +z i ^ ), then a 100(1)% condence interval for ^ t can be constructed by ^ tz 1 2 var( ^ t) under normal approximation by the Central Limit Theorem. As we can easily prove that Equation 4.14 is a monotonic function of ( 0 +z i 0 ), the 100(1)% condence interval for pr(D i = 1jz i ) can be constructed by transformation from condence interval of ^ t, based on the asymptotic theory of law of large numbers and delta methods. We derived a corresponding 100(1)% condence interval: exp( ^ tz 1 2 var( ^ t)) 1 +exp( ^ tz 1 2 var( ^ t)) ; exp( ^ t +z 1 2 var( ^ t)) 1 +exp( ^ t +z 1 2 var( ^ t)) ! : 54 Chapter 5 Simulation Studies To assess the accuracy and robustness of our estimation methods, we conducted a series of sim- ulations in realistic scenarios. In the simulations, we rst simulated a large study based on pre-specied true disease risks, true relative risks associated with the exposure to each individual and other design parameters. We then applied our analytical methods to the data, derived risk estimates and compared them to the pre-specied true values to evaluate performance. 5.1 Simulation Setup 5.1.1 Study Base Generation and Case-Control Sampling We rst generated a study base with a size of n individuals. We assigned an exposure value Z i to each individual according to an exposure distribution. Considering the most common exposure distribution families, we simulated scenarios where the exposure Z 0 i s were from normal and Bernoulli distributions. Given each simulation, we also dened the baseline odds 0 and the exposure odds ratio.We determined the true disease probability p i for each individual under the proportional odds model (see ref 4.1). We simulated each individual's disease status D i from independent Bernoulli trials with a success probability p i . Next, we sampled case and control subjects from the study base using practical sampling schemes and analyzed the sampled data to estimate the odds ratio, baseline odds, their variances, 55 disease risks and its condence interval. We iterated the simulation process for a large number of trials, calculated the means of these estimates and compared them to the pre-specied true values. 5.1.2 Simulation Settings With the study base size n xed at 1000 and controls being randomly selected according to a xed 1:1 or 1:2 size matching sampling design, we varied the following trial parameters in our simulation study: (i) Exposure values were generated using a standard normal distribution N(0; 1) and a Bernoulli trial with success probability to be inf0.1, 0.2, 0.5g. (ii) The baseline odds 0 was simulated inf0.01, 0.02, 0.05, 0.1, 0.2g to cover a range of disease probabilities. (iii) The odds ratio or z was varied to emulate dierent severity levels of exposure eect on the disease. In the binary exposure scenario, or z was selected to range inf1, 2, 5g and in the normal distributed exposure scenario, inf1, 1.189, 1.495, 2, 5g. The odds ratios 1.189 and 1.495 for the normal exposure were chosen such that the odds ratio comparing a person with a 97.5% percentile exposure to a person with a 2.5 % percentile exposure isf1, 2, 5g. The simulation results corresponding to those simulation settings are presented and discussed as follows. 5.2 Simulation Results 5.2.1 Baseline Odds Coecient Estimation In the tables 5.1 and 5.2, we show the estimated baseline odds coecients using the proposed methods with both true (left columns) and estimated (right columns) from logistic regression analysis. In a real study, neither baseline odds nor odds ratio would be known, requiring them to be estimated simultaneously, just as what we did for the right columns. It can be seen from the right columns that both point and variance estimates using estimated are extremely close to the estimates using true , as well as the true baseline odds coecient, which means the estimation method is highly accurate. 56 Specically, the results in table 5.1 were derived from 1000 simulated 1:1 size matched trials under scenarios of normal exposure Z N(0; 1) with baseline odds 0 = 0:01( =4:605) and exposed odds ratio or z = 1; 1:189; and 1:495. Using true , the case only estimates (CO) (^ co ) are -4.664, -4.653, and -4.670 and the Rao-Blackwell estimates (RB) (^ rb ) are -4.664, -4.652 and -4.664. Using estimated , the case only estimates (^ co ) are -4.676, -4.691, and -4.676 and the Rao-Blackwell estimates (^ rb ) are -4.646, -4.654 and -4.623. All these estimates are within < 2% dierences from the true value, while the ^ rb estimates are within < 1% dierences, which speak for the good performance of our methods. Rao-Blackwell estimates show a slight downward bias in all cases comparing to ^ co , displaying a smaller bias. Both ^ co and ^ rb are ecient, with all coecient of variations (CV) < 4%. In particular, the variances are 0.134, 0.134, and 0.148 for ^ co and 0.082, 0.082, and 0.089 for ^ rb using true; the variances are 0.175, 0.212, and 0.180 for ^ co and 0.085, 0.088, and 0.090 for ^ rb using estimated. The ^ rb consistently has smaller variances and thus is slightly more ecient than ^ co , apparently because of the maximal conditional inference as implied by the Rao-Blackwellization technique. While comparing estimation results between using true and using estimated , the point estimates ^ rb are less than 0:1% dierences. The variance estimates using estimated are a little larger, but still very close at the same decimal level. These simulation studies show that using estimated doesn't impact the point or condence interval estimates. true estiamted Estimation Method OR ^ \ Var[^ ] ^ \ Var[^ ] CO 1 -4.605 -4.664 0.134 -4.676 0.175 RB 1 -4.605 -4.664 0.082 -4.646 0.085 CO 1.189 -4.605 -4.653 0.134 -4.691 0.212 RB 1.189 -4.605 -4.652 0.082 -4.654 0.088 CO 1.495 -4.605 -4.670 0.148 -4.676 0.180 RB 1.495 -4.605 -4.664 0.089 -4.623 0.090 Table 5.1: Simulation results for 1000 trials, n = 1000, 1:1 size matching, 0 = 0:01( = 4:605), ZN(0; 1) 57 Similar results were observed in table 5.2, which were derived from 1000 simulated 1:1 size matched trials under scenarios of normal exposure ZN(0; 1) with baseline odds 0 = 0:1( = 2:303) and exposed odds ratio or z = 1; 1:189; and 1:495. Using true , the case only estimates (^ co ) are -2.312, -2.311, and -2.310 and the Rao-Blackwell estimates (^ rb ) are -2.312, -2.311, and -2.310. Using estimated , the case only estimates (^ co ) are -2.317, -2.314, and -2.309 and the Rao-Blackwell estimates (^ rb ) are -2.314, -2.309 and -2.289. All these estimates are within < 1% dierences from the true value, while the ^ rb estimates are within< 0:5% dierences from the true value, which again speak for the good performance of our methods. The ^ co estimates showed a slight downward bias in all cases, displaying a consistent downward bias. Both ^ co and ^ rb are ecient, with all coecient of variations (CV) < 1%. In particular, using estimated the variances are 0.013 for ^ co and 0.012 for ^ rb , with respect to odds ratios or z = 1; 1:189; and 1:495. Because of Rao-Blackwellization, the ^ rb consistently has smaller variances and thus is slightly more ecient than ^ co . Both methods have good variance estimates, which are almost identical to the empirical derived variance. While comparing estimation results between using true and using estimated , the point estimates ^ rb are less than 0:05% dierences. The variance estimates using estimated are very close at the same decimal level. These simulation studies show that using estimated doesn't impact the point or condence interval estimates. true estiamted Estimation Method OR ^ \ Var[^ ] ^ \ Var[^ ] CO 1 -2.303 -2.312 0.012 -2.317 0.013 RB 1 -2.303 -2.312 0.012 -2.314 0.012 CO 1.189 -2.303 -2.311 0.013 -2.314 0.013 RB 1.189 -2.303 -2.311 0.012 -2.309 0.012 CO 1.495 -2.303 -2.310 0.014 -2.309 0.013 RB 1.495 -2.303 -2.310 0.013 -2.289 0.012 Table 5.2: Simulation results for 1000 trials, n = 1000, 1:1 size matching, 0 = 0:1( = 2:303), ZN(0; 1) 58 In the next series of simulation tables, totaling 192 cases, we extensively studied the perfor- mance of ^ co and ^ rb in the following combination of parameters: (1) 0 inf0.2, 0.1, 0.05, 0.02g (2) or z in (1,2,5) (3) Exposure distribution in Bern(0:1);Bern(0:2);Bern(0:5);N(0; 1) and (4) 1:1 and 1:2 Size Matching ratios. Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -1.6094 -1.6075 0.0087 0.0000 0.0059 0.1407 0.2 1 B(0.1) 1:1 0.12% 5.80% RB -1.6094 -1.6075 0.0083 0.0000 0.0059 0.1407 0.2 1 B(0.1) 1:1 0.12% 5.68% CO -1.6094 -1.6074 0.0083 0.0000 -0.0090 0.1045 0.2 1 B(0.1) 1:2 0.12% 5.67% RB -1.6094 -1.6074 0.0060 0.0000 -0.0090 0.1045 0.2 1 B(0.1) 1:2 0.12% 4.83% CO -1.6094 -1.6068 0.0104 0.0000 -0.0004 0.0764 0.2 1 B(0.2) 1:1 0.16% 6.33% RB -1.6094 -1.6068 0.0100 0.0000 -0.0004 0.0764 0.2 1 B(0.2) 1:1 0.16% 6.22% CO -1.6094 -1.6067 0.0096 0.0000 -0.0080 0.0571 0.2 1 B(0.2) 1:2 0.17% 6.09% RB -1.6094 -1.6067 0.0073 0.0000 -0.0080 0.0571 0.2 1 B(0.2) 1:2 0.17% 5.31% CO -1.6094 -1.6040 0.0194 0.0000 -0.0060 0.0482 0.2 1 B(0.5) 1:1 0.34% 8.68% RB -1.6094 -1.6040 0.0191 0.0000 -0.0060 0.0482 0.2 1 B(0.5) 1:1 0.34% 8.61% CO -1.6094 -1.6042 0.0164 0.0000 -0.0079 0.0361 0.2 1 B(0.5) 1:2 0.33% 7.97% RB -1.6094 -1.6042 0.0141 0.0000 -0.0079 0.0361 0.2 1 B(0.5) 1:2 0.33% 7.40% CO -1.6094 -1.6124 0.0074 0.0000 0.0013 0.0122 0.2 1 N(0,1) 1:1 0.19% 5.33% RB -1.6094 -1.6124 0.0071 0.0000 0.0013 0.0122 0.2 1 N(0,1) 1:1 0.19% 5.22% CO -1.6094 -1.6140 0.0073 0.0000 0.0017 0.0091 0.2 1 N(0,1) 1:2 0.28% 5.30% RB -1.6094 -1.6140 0.0051 0.0000 0.0017 0.0091 0.2 1 N(0,1) 1:2 0.28% 4.42% Table 5.3: Simulation results for 0 =0.2, or z =1 59 Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -1.6094 -1.6076 0.0080 0.6931 0.7156 0.1183 0.2 2 B(0.1) 1:1 0.12% 5.55% RB -1.6094 -1.6076 0.0085 0.6931 0.7156 0.1183 0.2 2 B(0.1) 1:1 0.12% 5.74% CO -1.6094 -1.6075 0.0077 0.6931 0.6975 0.0793 0.2 2 B(0.1) 1:2 0.12% 5.45% RB -1.6094 -1.6075 0.0061 0.6931 0.6975 0.0793 0.2 2 B(0.1) 1:2 0.12% 4.87% CO -1.6094 -1.6061 0.0089 0.6931 0.6928 0.0620 0.2 2 B(0.2) 1:1 0.21% 5.86% RB -1.6094 -1.6061 0.0101 0.6931 0.6928 0.0620 0.2 2 B(0.2) 1:1 0.21% 6.25% CO -1.6094 -1.6067 0.0083 0.6931 0.6893 0.0435 0.2 2 B(0.2) 1:2 0.17% 5.66% RB -1.6094 -1.6067 0.0074 0.6931 0.6893 0.0435 0.2 2 B(0.2) 1:2 0.17% 5.35% CO -1.6094 -1.6031 0.0142 0.6931 0.6877 0.0369 0.2 2 B(0.5) 1:1 0.39% 7.44% RB -1.6094 -1.6031 0.0174 0.6931 0.6877 0.0369 0.2 2 B(0.5) 1:1 0.39% 8.23% CO -1.6094 -1.6050 0.0123 0.6931 0.6892 0.0279 0.2 2 B(0.5) 1:2 0.28% 6.90% RB -1.6094 -1.6050 0.0135 0.6931 0.6892 0.0279 0.2 2 B(0.5) 1:2 0.28% 7.23% CO -1.6094 -1.6105 0.0101 0.6931 0.6928 0.0143 0.2 2 N(0,1) 1:1 0.07% 6.25% RB -1.6094 -1.6093 0.0104 0.6931 0.6928 0.0143 0.2 2 N(0,1) 1:1 0.01% 6.34% CO -1.6094 -1.6115 0.0100 0.6931 0.6910 0.0105 0.2 2 N(0,1) 1:2 0.13% 6.20% RB -1.6094 -1.6099 0.0068 0.6931 0.6910 0.0105 0.2 2 N(0,1) 1:2 0.03% 5.11% Table 5.4: Simulation results for 0 =0.2, or z =2 Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -1.6094 -1.6072 0.0077 1.6094 1.6372 0.1201 0.2 5 B(0.1) 1:1 0.14% 5.44% RB -1.6094 -1.6072 0.0111 1.6094 1.6372 0.1201 0.2 5 B(0.1) 1:1 0.14% 6.54% CO -1.6094 -1.6074 0.0075 1.6094 1.6201 0.0719 0.2 5 B(0.1) 1:2 0.13% 5.38% RB -1.6094 -1.6074 0.0067 1.6094 1.6201 0.0719 0.2 5 B(0.1) 1:2 0.13% 5.11% CO -1.6094 -1.6069 0.0081 1.6094 1.6117 0.0564 0.2 5 B(0.2) 1:1 0.16% 5.61% RB -1.6094 -1.6069 0.0128 1.6094 1.6117 0.0564 0.2 5 B(0.2) 1:1 0.16% 7.04% CO -1.6094 -1.6073 0.0078 1.6094 1.6058 0.0367 0.2 5 B(0.2) 1:2 0.13% 5.49% RB -1.6094 -1.6073 0.0081 1.6094 1.6058 0.0367 0.2 5 B(0.2) 1:2 0.13% 5.59% CO -1.6094 -1.6038 0.0108 1.6094 1.6017 0.0288 0.2 5 B(0.5) 1:1 0.35% 6.48% RB -1.6094 -1.6236 0.0164 1.6094 1.6017 0.0288 0.2 5 B(0.5) 1:1 0.88% 8.40% CO -1.6094 -1.6231 0.0379 1.6094 1.6128 0.0238 0.2 5 N(0,1) 1:1 0.85% 11.99% RB -1.6094 -1.6118 0.1942 1.6094 1.6128 0.0238 0.2 5 N(0,1) 1:1 0.14% 27.34% Table 5.5: Simulation results for 0 =0.2, or z =5 60 Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -2.3026 -2.2997 0.0152 0.0000 0.0022 0.2746 0.1 1 B(0.1) 1:1 0.13% 5.35% RB -2.3026 -2.2997 0.0139 0.0000 0.0022 0.2746 0.1 1 B(0.1) 1:1 0.13% 5.13% CO -2.3026 -2.3001 0.0144 0.0000 -0.0263 0.2015 0.1 1 B(0.1) 1:1 0.11% 5.22% RB -2.3026 -2.3001 0.0106 0.0000 -0.0263 0.2015 0.1 1 B(0.1) 1:2 0.11% 4.47% CO -2.3026 -2.2981 0.0182 0.0000 -0.0084 0.1425 0.1 1 B(0.2) 1:1 0.19% 5.87% RB -2.3026 -2.2981 0.0170 0.0000 -0.0084 0.1425 0.1 1 B(0.2) 1:1 0.19% 5.68% CO -2.3026 -2.2992 0.0167 0.0000 -0.0169 0.1065 0.1 1 B(0.2) 1:2 0.15% 5.61% RB -2.3026 -2.2992 0.0129 0.0000 -0.0169 0.1065 0.1 1 B(0.2) 1:2 0.15% 4.94% CO -2.3026 -2.2936 0.0350 0.0000 -0.0121 0.0887 0.1 1 B(0.5) 1:1 0.39% 8.15% RB -2.3026 -2.2936 0.0339 0.0000 -0.0121 0.0887 0.1 1 B(0.5) 1:1 0.39% 8.03% CO -2.3026 -2.2993 0.0291 0.0000 -0.0063 0.0665 0.1 1 B(0.5) 1:2 0.14% 7.42% RB -2.3026 -2.2993 0.0254 0.0000 -0.0063 0.0665 0.1 1 B(0.5) 1:2 0.14% 6.94% CO -2.3026 -2.3014 0.0127 0.0000 0.0011 0.0227 0.1 1 N(0,1) 1:1 0.05% 4.89% RB -2.3026 -2.3015 0.0117 0.0000 0.0011 0.0227 0.1 1 N(0,1) 1:1 0.05% 4.71% CO -2.3026 -2.3044 0.0124 0.0000 -0.0003 0.0169 0.1 1 N(0,1) 1:2 0.08% 4.84% RB -2.3026 -2.3044 0.0089 0.0000 -0.0003 0.0169 0.1 1 N(0,1) 1:2 0.08% 4.09% Table 5.6: Simulation results for 0 =0.1, or z =1 Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -2.3026 -2.2996 0.0136 0.6931 0.7228 0.2127 0.1 2 B(0.1) 1:1 0.13% 5.07% RB -2.3026 -2.2996 0.0141 0.6931 0.7228 0.2127 0.1 2 B(0.1) 1:1 0.13% 5.17% CO -2.3026 -2.3000 0.0130 0.6931 0.6944 0.1395 0.1 2 B(0.1) 1:2 0.11% 4.96% RB -2.3026 -2.3000 0.0107 0.6931 0.6944 0.1395 0.1 2 B(0.1) 1:2 0.11% 4.51% CO -2.3026 -2.2989 0.0152 0.6931 0.6986 0.1087 0.1 2 B(0.2) 1:1 0.16% 5.36% RB -2.3026 -2.2989 0.0169 0.6931 0.6986 0.1087 0.1 2 B(0.2) 1:1 0.16% 5.66% CO -2.3026 -2.2989 0.0140 0.6931 0.6873 0.0760 0.1 2 B(0.2) 1:2 0.16% 5.15% RB -2.3026 -2.2989 0.0131 0.6931 0.6873 0.0760 0.1 2 B(0.2) 1:2 0.16% 4.97% CO -2.3026 -2.2967 0.0252 0.6931 0.6849 0.0655 0.1 2 B(0.5) 1:1 0.26% 6.91% RB -2.3026 -2.2967 0.0305 0.6931 0.6849 0.0655 0.1 2 B(0.5) 1:1 0.26% 7.61% CO -2.3026 -2.3002 0.0215 0.6931 0.6892 0.0499 0.1 2 B(0.5) 1:2 0.10% 6.37% RB -2.3026 -2.3002 0.0244 0.6931 0.6892 0.0499 0.1 2 B(0.5) 1:2 0.10% 6.79% CO -2.3026 -2.3095 0.0174 0.6931 0.7005 0.0249 0.1 2 N(0,1) 1:1 0.30% 5.71% RB -2.3026 -2.3066 0.0159 0.6931 0.7005 0.0249 0.1 2 N(0,1) 1:1 0.17% 5.46% CO -2.3026 -2.3116 0.0170 0.6931 0.6968 0.0183 0.1 2 N(0,1) 1:2 0.39% 5.63% RB -2.3026 -2.3086 0.0118 0.6931 0.6968 0.0183 0.1 2 N(0,1) 1:2 0.26% 4.71% Table 5.7: Simulation results for 0 =0.1, or z =2 61 Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -2.3026 -2.2997 0.0128 1.6094 1.6575 0.1871 0.1 5 B(0.1) 1:1 0.12% 4.92% RB -2.3026 -2.2997 0.0164 1.6094 1.6575 0.1871 0.1 5 B(0.1) 1:1 0.12% 5.57% CO -2.3026 -2.2999 0.0124 1.6094 1.6248 0.1093 0.1 5 B(0.1) 1:2 0.12% 4.85% RB -2.3026 -2.2999 0.0114 1.6094 1.6248 0.1093 0.1 5 B(0.1) 1:2 0.12% 4.65% CO -2.3026 -2.2981 0.0135 1.6094 1.6127 0.0855 0.1 5 B(0.2) 1:1 0.19% 5.05% RB -2.3026 -2.2981 0.0185 1.6094 1.6127 0.0855 0.1 5 B(0.2) 1:1 0.19% 5.93% CO -2.3026 -2.2987 0.0128 1.6094 1.6053 0.0568 0.1 5 B(0.2) 1:2 0.17% 4.92% RB -2.3026 -2.2987 0.0138 1.6094 1.6053 0.0568 0.1 5 B(0.2) 1:2 0.17% 5.10% CO -2.3026 -2.2982 0.0184 1.6094 1.6049 0.0475 0.1 5 B(0.5) 1:1 0.19% 5.90% RB -2.3026 -2.2982 0.0286 1.6094 1.6049 0.0475 0.1 5 B(0.5) 1:1 0.19% 7.36% CO -2.3026 -2.2999 0.0166 1.6094 1.6064 0.0379 0.1 5 B(0.5) 1:2 0.12% 5.59% CO -2.3026 -2.3241 0.0663 1.6094 1.6154 0.0351 0.1 5 N(0,1) 1:1 0.93% 11.08% RB -2.3026 -2.3070 0.1377 1.6094 1.6154 0.0351 0.1 5 N(0,1) 1:1 0.19% 16.08% CO -2.3026 -2.3230 0.0637 1.6094 1.6137 0.0255 0.1 5 N(0,1) 1:2 0.89% 10.86% RB -2.3026 -2.3069 0.0418 1.6094 1.6137 0.0255 0.1 5 N(0,1) 1:2 0.19% 8.87% Table 5.8: Simulation results for 0 =0.1, or z =5 Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -2.9957 -2.9937 0.0349 0.0000 0.0051 0.2873 0.05 1 B(0.2) 1:1 0.07% 6.24% RB -2.9957 -2.9937 0.0308 0.0000 0.0051 0.2873 0.05 1 B(0.2) 1:1 0.07% 5.86% CO -2.9957 -2.9941 0.0318 0.0000 -0.0245 0.2125 0.05 1 B(0.2) 1:2 0.05% 5.95% RB -2.9957 -2.9941 0.0242 0.0000 -0.0245 0.2125 0.05 1 B(0.2) 1:2 0.05% 5.20% CO -2.9957 -2.9903 0.0669 0.0000 -0.0057 0.1711 0.05 1 B(0.5) 1:1 0.18% 8.65% RB -2.9957 -2.9903 0.0636 0.0000 -0.0057 0.1711 0.05 1 B(0.5) 1:1 0.18% 8.43% CO -2.9957 -2.9928 0.0556 0.0000 -0.0108 0.1284 0.05 1 B(0.5) 1:2 0.10% 7.88% RB -2.9957 -2.9928 0.0488 0.0000 -0.0108 0.1284 0.05 1 B(0.5) 1:2 0.10% 7.38% CO -2.9957 -3.0041 0.0246 0.0000 0.0053 0.0451 0.05 1 N(0,1) 1:1 0.28% 5.22% RB -2.9957 -3.0041 0.0212 0.0000 0.0053 0.0451 0.05 1 N(0,1) 1:1 0.28% 4.84% CO -2.9957 -3.0095 0.0237 0.0000 0.0030 0.0333 0.05 1 N(0,1) 1:2 0.46% 5.11% RB -2.9957 -3.0095 0.0169 0.0000 0.0030 0.0333 0.05 1 N(0,1) 1:2 0.46% 4.32% Table 5.9: Simulation results for 0 =0.05, or z =1 62 Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -2.9957 -2.9947 0.0242 0.6931 0.6876 0.2724 0.05 2 B(0.1) 1:2 0.03% 5.19% RB -2.9957 -2.9947 0.0201 0.6931 0.6876 0.2724 0.05 2 B(0.1) 1:2 0.03% 4.74% CO -2.9957 -2.9910 0.0286 0.6931 0.6922 0.2057 0.05 2 B(0.2) 1:1 0.16% 5.65% RB -2.9957 -2.9910 0.0307 0.6931 0.6922 0.2057 0.05 2 B(0.2) 1:1 0.16% 5.86% CO -2.9957 -2.9934 0.0261 0.6931 0.6797 0.1430 0.05 2 B(0.2) 1:2 0.08% 5.40% RB -2.9957 -2.9934 0.0247 0.6931 0.6797 0.1430 0.05 2 B(0.2) 1:2 0.08% 5.25% CO -2.9957 -2.9895 0.0481 0.6931 0.6845 0.1235 0.05 2 B(0.5) 1:1 0.21% 7.34% RB -2.9957 -2.9895 0.0576 0.6931 0.6845 0.1235 0.05 2 B(0.5) 1:1 0.21% 8.03% CO -2.9957 -2.9955 0.0406 0.6931 0.6889 0.0943 0.05 2 B(0.5) 1:2 0.01% 6.72% RB -2.9957 -2.9955 0.0469 0.6931 0.6889 0.0943 0.05 2 B(0.5) 1:2 0.01% 7.23% CO -2.9957 -3.0059 0.0339 0.6931 0.7074 0.0468 0.05 2 N(0,1) 1:1 0.34% 6.12% RB -2.9957 -3.0044 0.0278 0.6931 0.7074 0.0468 0.05 2 N(0,1) 1:1 0.29% 5.55% CO -2.9957 -3.0095 0.0323 0.6931 0.7004 0.0339 0.05 2 N(0,1) 1:2 0.46% 5.98% RB -2.9957 -3.0059 0.0224 0.6931 0.7004 0.0339 0.05 2 N(0,1) 1:2 0.34% 4.98% Table 5.10: Simulation results for 0 =0.05, or z =2 Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -2.9957 -2.9956 0.0226 1.6094 1.6423 0.1899 0.05 5 B(0.1) 1:2 0.01% 5.02% RB -2.9957 -2.9956 0.0212 1.6094 1.6423 0.1899 0.05 5 B(0.1) 1:2 0.01% 4.86% CO -2.9957 -2.9925 0.0245 1.6094 1.6172 0.1459 0.05 5 B(0.2) 1:1 0.11% 5.23% RB -2.9957 -2.9925 0.0317 1.6094 1.6172 0.1459 0.05 5 B(0.2) 1:1 0.11% 5.95% CO -2.9957 -2.9941 0.0231 1.6094 1.6076 0.0977 0.05 5 B(0.2) 1:2 0.05% 5.07% RB -2.9957 -2.9941 0.0257 1.6094 1.6076 0.0977 0.05 5 B(0.2) 1:2 0.05% 5.35% CO -2.9957 -2.9931 0.0342 1.6094 1.6017 0.0858 0.05 5 B(0.5) 1:1 0.09% 6.18% RB -2.9957 -2.9931 0.0521 1.6094 1.6017 0.0858 0.05 5 B(0.5) 1:1 0.09% 7.62% CO -2.9957 -2.9972 0.0306 1.6094 1.6073 0.0695 0.05 5 B(0.5) 1:2 0.05% 5.84% RB -2.9957 -2.9972 0.0454 1.6094 1.6073 0.0695 0.05 5 B(0.5) 1:2 0.05% 7.11% CO -2.9957 -3.0337 0.1382 1.6094 1.6230 0.0555 0.05 5 N(0,1) 1:1 1.27% 12.26% RB -2.9957 -3.0067 0.1490 1.6094 1.6230 0.0555 0.05 5 N(0,1) 1:1 0.37% 12.84% CO -2.9957 -3.0335 0.1260 1.6094 1.6136 0.0398 0.05 5 N(0,1) 1:2 1.26% 11.70% RB -2.9957 -3.0063 0.0513 1.6094 1.6136 0.0398 0.05 5 N(0,1) 1:2 0.35% 7.53% Table 5.11: Simulation results for 0 =0.05, or z =5 63 Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -3.9120 -3.9085 0.1769 0.0000 -0.0315 0.4369 0.02 1 B(0.5) 1:1 0.09% 10.76% RB -3.9120 -3.9085 0.1580 0.0000 -0.0315 0.4369 0.02 1 B(0.5) 1:1 0.09% 10.17% CO -3.9120 -3.9251 0.1436 0.0000 -0.0250 0.3284 0.02 1 B(0.5) 1:2 0.33% 9.65% RB -3.9120 -3.9251 0.1246 0.0000 -0.0250 0.3284 0.02 1 B(0.5) 1:2 0.33% 8.99% CO -3.9120 -3.9362 0.0717 0.0000 0.0019 0.1242 0.02 1 N(0,1) 1:1 0.62% 6.80% RB -3.9120 -3.9355 0.0491 0.0000 0.0019 0.1242 0.02 1 N(0,1) 1:1 0.60% 5.63% CO -3.9120 -3.9495 0.0640 0.0000 0.0014 0.0888 0.02 1 N(0,1) 1:2 0.96% 6.41% RB -3.9120 -3.9490 0.0435 0.0000 0.0014 0.0888 0.02 1 N(0,1) 1:2 0.94% 5.28% Table 5.12: Simulation results for 0 =0.02, or z =1 Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -3.9120 -3.9198 0.0655 0.6931 0.6789 0.3649 0.02 2 B(0.2) 1:2 0.20% 6.53% RB -3.9120 -3.9198 0.0609 0.6931 0.6789 0.3649 0.02 2 B(0.2) 1:2 0.20% 6.29% CO -3.9120 -3.9113 0.1234 0.6931 0.6830 0.3061 0.02 2 B(0.5) 1:1 0.02% 8.98% RB -3.9120 -3.9113 0.1441 0.6931 0.6830 0.3061 0.02 2 B(0.5) 1:1 0.02% 9.71% CO -3.9120 -3.9273 0.1027 0.6931 0.6982 0.2352 0.02 2 B(0.5) 1:2 0.39% 8.16% RB -3.9120 -3.9273 0.1207 0.6931 0.6982 0.2352 0.02 2 B(0.5) 1:2 0.39% 8.85% CO -3.9120 -3.9555 0.1149 0.6931 0.7446 0.1249 0.02 2 N(0,1) 1:1 1.11% 8.57% RB -3.9120 -3.9512 0.0681 0.6931 0.7446 0.1249 0.02 2 N(0,1) 1:1 1.00% 6.60% CO -3.9120 -3.9629 0.0970 0.6931 0.7221 0.0866 0.02 2 N(0,1) 1:2 1.30% 7.86% RB -3.9120 -3.9548 0.0584 0.6931 0.7221 0.0866 0.02 2 N(0,1) 1:2 1.09% 6.11% Table 5.13: Simulation results for 0 =0.02, or z =2 Method ^ \ Var[^ ] ^ \ Var[ ^ ] 0 orz Exopusre m ratio bias% se% CO -3.9120 -3.9186 0.0555 1.6094 1.6044 0.2256 0.02 5 B(0.2) 1:2 0.17% 6.01% RB -3.9120 -3.9186 0.0634 1.6094 1.6044 0.2256 0.02 5 B(0.2) 1:2 0.17% 6.43% CO -3.9120 -3.9219 0.0863 1.6094 1.6147 0.2068 0.02 5 B(0.5) 1:1 0.25% 7.49% RB -3.9120 -3.9219 0.1300 1.6094 1.6147 0.2068 0.02 5 B(0.5) 1:1 0.25% 9.19% CO -3.9120 -3.9265 0.0769 1.6094 1.6146 0.1706 0.02 5 B(0.5) 1:2 0.37% 7.06% RB -3.9120 -3.9265 0.1169 1.6094 1.6146 0.1706 0.02 5 B(0.5) 1:2 0.37% 8.71% CO -3.9120 -4.0216 4.7854 1.6094 1.6598 0.1171 0.02 5 N(0,1) 1:1 2.80% 54.40% RB -3.9120 -3.9655 1.0410 1.6094 1.6598 0.1171 0.02 5 N(0,1) 1:1 1.37% 25.73% CO -3.9120 -4.0174 1.1822 1.6094 1.6409 0.0818 0.02 5 N(0,1) 1:2 2.69% 27.06% RB -3.9120 -3.9578 0.1063 1.6094 1.6409 0.0818 0.02 5 N(0,1) 1:2 1.17% 8.24% Table 5.14: Simulation results for 0 =0.02, or z =5 64 We found estimates in 160 of these 192 cases. Within these estimates, the results were generally good for both the Case Only Estimator (CO) and Rao-Blackwell Estimator (RB). For example, for 0 =0.2 andor z =1 or 2 or 5(tables 5.3, 5.4 and 5.5), the proposed methods were able to nd point and variance estimates for all the above-mentioned scenarios. The error away from the true value for the ^ was as low as 0.01% for standardized normally distributed exposures, and on average as low as 0.12 % for binary exposure. As the binary exposure percentages went up, or as fewer controls were available, the error% increased as well, but stayed small (< 1%). That means our proposed methods can be successfully applied in a reasonable range of scenarios with varied disease density level, disease odds ratios, and exposure distributions. As long as a good size of case control sample are found, the proposed methods were able to give accurate estimates of baseline odds and disease risks. Comparing the estimates of relative risks ^ co and ^ rb across all the 160 returned cases, we found the relative estimation bias error were at most around 1%. The bias was both downward and upward, supporting that the estimators are generally unbiased. The bias was generally smaller for Bernoulli exposures, when compared to standard normal exposures. The relative standard errors of estimates, except for a few cases, are at most around 10%, regardless of simulated scenarios. The performance also changes with size match ratio, which suggests a trade-o in-between for maximum inference, given any specic design parameters, such as trial size, exposure type and others. For example, in tables 5.3 to 5.5, for binary exposures, comparing to 1-to-2 size match- ing, estimation errors were smaller in 1-to-3 matched experiments when the exposure odds ratio OR >= 2. The errors were larger otherwise. In contrast, for normal exposures, the observation is reversed. This means, trying dierent size match schemes in experiment design may help to achieve the best inference given dierent exposures and conferred risk levels. In particular, we can increase match ratio if the conferred risk is high and it is a Bernoulli exposure, or decrease the match ratio if the conferred risk is high and it is a normal exposure. The same conclusion generally holds true for cases we studied at 0 = 0:1. We also observed that in all the completed 65 cases where ^ estimates were available, their variance estimates are also available and reasonably small, meaning the overall estimation is robust. For the other 32 cases without returned estimates, we observed that the proposed methods would encounter diculty if the number of cases in the sample is too small. For example, the smallest 0 that has valid estimates for all tested scenarios was 0.02 for a total of 1000 participants study base. If the number of cases went lower, neither ^ nor ^ and their variance estimators could be computed. The requirement was the same for the exposure variable distribution. If the exposure variable distribution was extreme, for example, an exposure distribution with B(0.005) for a total of 1000 participants study base, the methods will stop working as well. However, in rare disease or rare exposure scenarios, this method could still work if the study base is large enough to obtain a case-control sample with moderate size. 5.2.2 Condence Interval of Disease Risk Using table 5.4, which has 0 = 0:2 and OR = 2, from the previous section as an example, we estimated the disease risk associated with every possible exposure value and the associated condence intervals and listed them in the following table 5.15, while the true p (exposure=0) = 0:2 1+0:2 = 0:1667 and p (exposure=1) = 0:22 1+0:22 = 0:2857. Taking a Bernoulli exposure of B(0:1) and 1-to-2 matching as an example, the estimated disease risk and condence interval for an exposed person is 0.2907 (0.95CI: 0.1696-0.4513) for case only estimation and 0.2907 (0.95CI: 0.1694- 0.4516) for Rao-Blackwell estimation. The numbers for unexposed person are 0.1669 (0.95CI: 0.1439-0.1927) and 0.1669 (0.95CI: 0.1433-0.1963). CIs from both estimators correctly captured the truep values respectively. The CI bounds (lower and upper limit) were generally tight, while improvements could be further made (see future directions). The same conclusions were drawn when we investigated other scenarios (data not shown). 66 Size M. Exposure 95% CI Method Ratio ^ \ Var[^ ] ^ \ Var[ ^ ] Distr. V. ^ p lower upper CO 1:1 -1.6094 -1.6076 0.008 0.6931 0.7156 0.1183 B(0.1) 0 0.1669 0.1439 0.1927 RB 1:1 -1.6094 -1.6076 0.0085 0.6931 0.7156 0.1183 B(0.1) 0 0.1669 0.1433 0.1936 CO 1:1 -1.6094 -1.6076 0.008 0.6931 0.7156 0.1183 B(0.1) 1 0.2907 0.1696 0.4513 RB 1:1 -1.6094 -1.6076 0.0085 0.6931 0.7156 0.1183 B(0.1) 1 0.2907 0.1694 0.4516 CO 1:2 -1.6094 -1.6075 0.0077 0.6931 0.6975 0.0793 B(0.1) 0 0.1669 0.1444 0.1922 RB 1:2 -1.6094 -1.6075 0.0061 0.6931 0.6975 0.0793 B(0.1) 0 0.1669 0.1467 0.1893 CO 1:2 -1.6094 -1.6075 0.0077 0.6931 0.6975 0.0793 B(0.1) 1 0.2870 0.1842 0.4178 RB 1:2 -1.6094 -1.6075 0.0061 0.6931 0.6975 0.0793 B(0.1) 1 0.2870 0.1850 0.4165 CO 1:1 -1.6094 -1.6061 0.0089 0.6931 0.6928 0.062 B(0.2) 0 0.1671 0.1429 0.1945 RB 1:1 -1.6094 -1.6061 0.0101 0.6931 0.6928 0.062 B(0.2) 0 0.1671 0.1415 0.1964 CO 1:1 -1.6094 -1.6061 0.0089 0.6931 0.6928 0.062 B(0.2) 1 0.2863 0.1923 0.4034 RB 1:1 -1.6094 -1.6061 0.0101 0.6931 0.6928 0.062 B(0.2) 1 0.2863 0.1916 0.4044 CO 1:2 -1.6094 -1.6067 0.0083 0.6931 0.6893 0.0435 B(0.2) 0 0.1670 0.1437 0.1934 RB 1:2 -1.6094 -1.6067 0.0074 0.6931 0.6893 0.0435 B(0.2) 0 0.1670 0.1449 0.1918 CO 1:2 -1.6094 -1.6067 0.0083 0.6931 0.6893 0.0435 B(0.2) 1 0.2855 0.2037 0.3843 RB 1:2 -1.6094 -1.6067 0.0074 0.6931 0.6893 0.0435 B(0.2) 1 0.2855 0.2043 0.3834 CO 1:1 -1.6094 -1.6031 0.0142 0.6931 0.6877 0.0369 B(0.5) 0 0.1675 0.1374 0.2027 RB 1:1 -1.6094 -1.6031 0.0174 0.6931 0.6877 0.0369 B(0.5) 0 0.1675 0.1345 0.2068 CO 1:1 -1.6094 -1.6031 0.0142 0.6931 0.6877 0.0369 B(0.5) 1 0.2859 0.2045 0.3841 RB 1:1 -1.6094 -1.6031 0.0174 0.6931 0.6877 0.0369 B(0.5) 1 0.2859 0.2023 0.3873 CO 1:2 -1.6094 -1.605 0.0123 0.6931 0.6892 0.0279 B(0.5) 0 0.1673 0.1391 0.1998 RB 1:2 -1.6094 -1.605 0.0135 0.6931 0.6892 0.0279 B(0.5) 0 0.1673 0.1379 0.2014 CO 1:2 -1.6094 -1.605 0.0123 0.6931 0.6892 0.0279 B(0.5) 1 0.2858 0.2127 0.3722 RB 1:2 -1.6094 -1.605 0.0135 0.6931 0.6892 0.0279 B(0.5) 1 0.2858 0.2117 0.3736 CO 1:1 -1.6094 -1.6105 0.0101 0.6931 0.6928 0.0143 N(0,1) 0 0.1665 0.1409 0.1957 RB 1:1 -1.6094 -1.6093 0.0104 0.6931 0.6928 0.0143 N(0,1) 0 0.1667 0.1407 0.1963 CO 1:1 -1.6094 -1.6105 0.0101 0.6931 0.6928 0.0143 N(0,1) 1 0.2854 0.2273 0.3517 RB 1:1 -1.6094 -1.6093 0.0104 0.6931 0.6928 0.0143 N(0,1) 1 0.2857 0.2271 0.3524 CO 1:2 -1.6094 -1.6115 0.01 0.6931 0.691 0.0105 N(0,1) 0 0.1664 0.1409 0.1954 RB 1:2 -1.6094 -1.6099 0.0068 0.6931 0.691 0.0105 N(0,1) 0 0.1666 0.1454 0.1903 CO 1:2 -1.6094 -1.6115 0.01 0.6931 0.691 0.0105 N(0,1) 1 0.2849 0.2313 0.3453 RB 1:2 -1.6094 -1.6099 0.0068 0.6931 0.691 0.0105 N(0,1) 1 0.2852 0.2356 0.3405 Column names Size m., Distr. and V. are abbreviations for Size Matching Ratio, Distribution and Values, respectively Table 5.15: Disease Risk Estimates for Simulated Exposure of B(0,1) under 0 =0.2 and or z =2. 67 Chapter 6 Real Data Application: MEPEDS Study 6.1 Motivation The Multi-Ethnic Pediatric Eye Disease Study (MEPEDS) is a multi-ethnic, population-based study of preschool children in Los Angeles and Riverside Counties in California [78]. MEPEDS provided population-based information with regard to the prevalence and risk factors of vari- ous vision disorders including refractive error (myopia, hyperopia, astigmatism), strabismus and amblyopia in minority children [9, 17, 18, 23, 28, 29, 37, 48{50, 52, 53, 69{71, 79, 82, 84{87]. It improved our understanding of the magnitude and causes of ocular disease problems in population- based minority samples of African-American, Latino and Non-Hispanic White and Asian children in the United States. MEPEDS also studied selected demographic, biological and behavioral risk factors associated with these diseases and the consequences of these diseases from a health-related quality of life perspective. Despite the rich information collected, there remained unknown knowledge. For example, are there additional risk factors, or lifestyle factors, or even genetic factors we want to explore, but not collected in the original study design? Or any aspect we collected in the original study, but were not comprehensively measured? For a population-based sample involving over 9000 children, it is not feasible or cost-eective to revisit everyone to collect additional information or carry on 68 ancillary studies. For this purpose, a case-control study with a smaller number of subjects is a more feasible and reliable study design. With small size case-control studies, it is common to estimate odds ratio relative to certain exposures, to give inference on the disease risks with regard to exposures. However, due to the limited size of case-control sample, and its sampling nature by disease status, it is not possible to generate disease risk estimation related to certain exposures, such as to estimate disease probability in the MEPEDS cross-sectional design in dierent ethnic groups, age groups, or groups with specic characteristics. Our motivation was therefore to develop statistical estimators and apply them to actual data to unbiasedly estimate baseline odds as well as disease risk using a case-control study, which was comparable to estimates from a large population-based sample. We chose strabismus as the disease in this chapter to illustrate how our methods work in a real data example. 6.2 Background 6.2.1 MEPEDS Study Design, Population and Methods MEPEDS (Multi-Ethnic Pediatric Eye Disease Study) is a multi-ethnic, population-based study of preschool children in Los Angeles and Riverside Counties in California. The study design and sampling plan have been described elsewhere [78]. Brie y, the study was designed to establish the prevalence of common ocular conditions in a population-based sample of African-American, Asian, Hispanic, and non-Hispanic white children; to identify risk factors associated with these conditions; and to explore the relationship between physical and psychosocial functioning and the prevalence and severity of various ocular disorders. Eligible MEPEDS participants were children aged 6 to 72 months living in 1 of 100 selected census tracts in Los Angeles County in and around the cities of Alhambra, Inglewood, and Glen- dale, and in the city of Riverside in Riverside County, California. The overall participation rate 69 for eligible children was 80%; specically, 9172 children completed comprehensive eye examina- tions of 11519 eligible children identied by a door-to-door census.Participants were recruited, and clinical examinations and parental interviews were completed from 2003 to 2011. All children aged 5 to 70 months on the day of the door-to-door household screening who were residents of one of the study-selected census tracts were eligible for participation. Informed consent was obtained from the parent or legal guardian (referred to as `parent' hereafter) of each eligible child and followed by a brief in-home interview that included basic demographic information and a medical history including self-reported eye conditions and stereopsis testing. Eligible children were then scheduled for a comprehensive eye examination and in-depth parental interview at the local MEPEDS clinic. The comprehensive eye examination was conducted by optometrists or ophthalmologists who were trained and certied using standardized study protocols. The examinations, described in detail previously [78], included monocular distance optotype visual acuity (VA) testing for children aged 30 months or older, evaluation of ocular alignment, anterior segment and dilated fundus evaluations, and measurement of refractive error under cycloplegic conditions. The protocol for cycloplegia was 2 drops of 1% cyclopentolate (0.5% for <12-month-olds) administered 5 minutes apart followed by autorefraction using the Retinomax Autorefractor (Right Manufacturing, Virginia Beach, VA) at least 30 minutes after the last drop. When Retinomax condence ratings were not 8 in both eyes despite 3 attempts, cycloplegic retinoscopy was performed. Noncycloplegic retinoscopy was performed if parents did not allow cycloplegic eye drops. 6.2.2 Denition, Determination and Impact of Strabismus Strabismus is a common childhood ocular disorder in which the eyes do not align with each other when looking at an object. It was assessed in MEPEDS using the unilateral cover 70 (cover/uncover) test (for which Krimsky testing was substituted only if the child would not per- mit cover testing) and alternate cover and prism test, at distance and near xation, without correction and with optical correction, if worn. Transient misalignment after alternate cover test- ing was not designated as strabismus unless conrmed by a repeat unilateral cover test. Hirschberg testing was used when cover testing could not be performed. Strabismus was dened as constant or intermittent heterotropia of any magnitude at distance or near xation. Children tested at only 1 xation distance and found to be without strabismus were considered non-strabismic. Strabismus was classied according to the horizontal direction (esotropia, exotropia) of the tropia or as vertical strabismus if there was no horizontal tropia, according to the primary direction. Strabismus may interfere with normal binocular depth perception and thereby hinder normal physical functioning. Further, visibly noticeable strabismus may have negative impact on a child`s self-image, with potentially signicant psychosocial consequences [26, 34, 36, 57{59]. MEPEDS is one of the rst studies to provide a direct measure of the likelihood of physical and psychosocial consequences in a population-based pediatric population [84].A parental inter- view, was administered by a study interviewer to the parent of each child at the time of the clinical examination, or subsequently by telephone if the child was accompanied to the exam by a person other than a parent. This interview consisted of sociodemographic, medical, ocular history items, and the PedsQL, a 23-item instrument designed to measure core dimensions of pediatric general health-related qualify of life (GHRQOL). It covered physical functioning, and psychosocial composite functioning including emotional, social, and school functioning. In MEPEDS, we found that strabismus was associated with signicantly worse GHRQOL, including physical, emotional, social, and school functioning as measured by the PedsQL. The associations existed even after controlling for gender, age, ethnicity, and family income level. The negative association of strabismus with GHRQOL was seen both for esotropia and exotropia, and both for intermittent and constant strabismus. The apparent associations were not explained by 71 Hispanic/Latino African American Non-Hispanic White Asian (n = 3003) (n = 3005) (n = 1514) (n = 1522) Prevalence (%) (95% condence interval) (n) Any strabismus 2.4 (1.9-3.0) (73) 2.5 (2.0-3.1) (76) 3.2 (2.4-4.3) (49) 3.5 (2.7-4.6) (54) Esotropia 0.9 (0.5-1.2) (26) 1.1 (0.7-1.5) (33) 2.3 (1.6-3.2) (35) 1.4 (0.9-2.1) (21) Exotropia 1.5 (1.0-1.9) (44) 1.4 (0.9-1.8) (41) 0.7 (0.4-1.3) (11) 2.1 (1.4-3.0) (32) Table 6.1: Strabismus Prevalence and Subtypes by Ethnicity in the Multi-Ethnic Pediatric Eye Disease Study parents' knowledge of strabismus diagnosis prior to their child's examination, or by the association of strabismus with other systemic diseases that could themselves in uence quality of life. Even after restricting the analysis to children without a history of co-morbidities or a diagnosis of strabismus prior to the clinical examination, a signicant relationship of strabismus with worse physical and psychosocial health remained. (p< 0:05 for all.) 6.2.3 Prevalence and Risk Factors of Strabismus In the MEPEDS study, a total of 3003 Hispanic, 3005 African American, 1514 non-Hispanic white and 1522 Asian children were included in the analysis of strabismus. (Only four Hispanic, two African American, three Asian children were excluded from the strabismus analysis because of an inability to complete all testing.) The prevalence of strabismus is 2.4% (95% CI, 1.9 - 3.0) in Hispanic, 2.5% (95% CI, 2.0 - 3.1) in African American, 3.24% (95% CI, 2.40 - 4.26) in non-Hispanic white and 3.55% (95% CI, 2.68 - 4.60) in Asian children. Exotropia was more common than esotropia in Hispanic/Latino (1.5% vs 0.9%), African American (1.4% vs 1.1%) and Asian children (2.1% vs 1.4%), yet esotropia (2.3%) was more common than exotropia (0.7%) among non-Hispanic white children. There was no signicant dierence in prevalence of strabismus between males and females. When the results were stratied into 6 age groups, the proportion having strabismus sig- nicantly increased with age in all the four ethnic groups (P < 0:05, trend test for all). The 72 Hispanic/Latino African American Non-Hispanic White Asian (n = 3003) (n = 3005) (n = 1514) (n = 1522) Age (mos) Prevalence (%) (95% condence interval) (n) 6-11 2.0 (0.4-3.6) (6) 1.1 (0.0-2.3) (3) 2.4 (0.5-7.0) (3) 1.5 (0.2-5.2) (2) 12-23 2.2 (1.0-3.5) (12) 1.3 (0.3-2.2) (7) 1.7 (0.5-4.4) (4) 0.4 (0.01-2.3) (1) 24-35 0.9 (0.1-1.6) (5) 2.2 (1.0-3.4) (12) 1.6 (0.4-4.0) (4) 3.8 (1.8-6.9) (10) 36-47 2.6 (1.3-4.0) (14) 1.9 (0.7-3.0) (10) 4.0 (2.1-6.8) (12) 2.2 (0.8-4.7) (6) 48-59 3.9 (2.3-5.5) (21) 4.2 (2.5-5.9) (23) 4.1 (2.0-7.2) (11) 6.3 (3.8-9.7) (18) 60-72 2.8 (1.4-4.2) (15) 3.9 (2.2-5.5) (21) 4.6 (2.6-7.5) (15) 5.3 (3.1-8.3) (17) Table 6.2: Strabismus Prevalence Stratied by Age (Months) in the Multi-Ethnic Pedi- atric Eye Disease Study prevalence ranged from 2.0% among 6- to 11-month-olds to 2.8% among 60- to 72-month-olds in Hispanic children, ranged from 1.1% for 6- to 11-month-olds to 3.9% for 60- to 72-month-olds in African American, ranged from 2.4% among 6- to 11-month-olds to 4.6% among 60- to 72-month- olds in non-Hispanic White children, and ranged from 1.5% for 6- to 11-month-olds to 5.3% for 60- to 72-month-olds in Asian. A pooled risk factor analysis for strabismus was conducted by combining MEPEDS and Balti- more Pediatric Eye Disease Study data. After adjustment for the other variables in the multivari- ate analysis, the following were identied as signicant independent indicators of a greater risk for esotropia: gestational age younger than 33 weeks (OR=4.43), active maternal smoking during pregnancy (OR=2.04), age range of 48 to 72 months (OR>7.94 relative to reference age group of 6 to 11 months), SE anisometropia of 1.00 D or more (OR=2.03 relative to reference level of < 0.50 D), and SE hyperopia starting at the 2.00 to less than 3.00 D level (OR=6.38-122.24 for dierent levels of hyperopia, relative to reference level of 0.00 to < +1.00 D. In the multivariate analysis for exotropia, active maternal smoking during pregnancy (OR=2.88), gestational age less than 33 weeks (OR=2.48), female sex (OR=1.62), bilateral astigmatism of 1.50 D or more (OR=1.49-5.88 relative to reference level of< 0.50 D), and J0 anisometropia of at least 0.25 D(OR=2.01-2.63 relative to reference level of < 0.25 D) were identied as independent indicators of a greater risk for after adjustment for the other variables. 73 6.3 Analysis Methods We chose strabismus as the disease, and children with strabismus (either esotropia or ex- otropia) were chosen to be the cases. Controls were randomly selected using size matching sam- pling method from the rest of the study base. Exposure information was available on each subject. Smoking during pregnancy was chosen to be the exposure, since it was a binary variable, and had been shown to be an independent risk factor for both esotropia and exotropia. Baseline odds coecient and disease risk were estimated, and compared to a full study base analysis. 6.4 Results 6.4.1 Study Population Out of 9197 children in the MEPEDS, 7963 children was included in this analysis. 1212 children were excluded due to missing data with regard to the question of maternal smoking; an additional 3 children were excluded as their ethnicity was not clearly categorized into the four ethnic groups. In this analysis sample, there were 2957 Hispanic children, 2625 African American children, 1081 Non-Hispanic White and 1300 Asian children. There were more males than females in each ethnic group. The percentages of maternal smoking during pregnancy were also dierent across ethnic groups. There were 2.3% in Hispanic, 9.8% in African American, 2.0% in Non- Hispanic White, 0.5% in Asian. Prevalence of strabismus was 2.4% for Hispanic and African American children, and higher in Non-Hispanic White (3.0%) and Asian children (3.9%). 6.4.2 Estimated Prevalence of Strabismus Associated with Maternal Smoking during Pregnancy We constructed case-control samples from the analysis sample using 1-to-1 size matching. The Case-only estimator (CO) and Rao-Blackwell estimator (RB) were used to estimate baseline odds coecient and its variance Var[]. Odds ratio estimates were estimated using conditional 74 Hispanic African Non-Hispanic or Latino American White Asian Total (n= 2957) (n=2625) (n=1081) (n=1300) (n=7963) Gender Male 1516 (51.3) 1332 (50.7) 581 (53.8) 676 (52.0) 4105 (51.6) Female 1441 (48.7) 1293 (49.3) 500 (46.3) 624 (48.0) 3858 (48.5) Age 6-11 months 293 (9.9) 263 (10.0) 98 (9.1) 125 (9.6) 779 (9.8) 12-23 months 530 (17.9) 498 (19.0) 168 (15.5) 217 (16.7) 1413 (17.7) 24-35 months 566 (19.1) 476 (18.1) 174 (16.1) 217 (16.7) 1433 (18.0) 36-47 months 520 (17.6) 451 (17.2) 221 (20.4) 231 (17.8) 1423 (17.9) 48-59 months 526 (17.8) 473 (18.0) 190 (17.6) 237 (18.2) 1426 (17.9) 60-72 months 522 (17.7) 464 (17.7) 230 (21.3) 273 (21.0) 1489 (18.7) Maternal Smoking during Pregnancy Yes 69 (2.3) 257 (9.8) 22 (2.0) 7 (0.5) 355 (4.5) No 2888 (97.7) 2368 (90.2) 1059 (98.0) 1293 (99.5) 7608 (95.5) Strabismus Yes 72 (2.4) 64 (2.4) 32 (3.0) 51 (3.9) 219 (2.8) No 2885 (97.6) 2561 (97.6) 1049 (97.0) 1249 (96.1) 7744 (97.2) Table 6.3: Characteristics of Children in the Analysis Sample by Ethnicity (n (%)) logistic regression. To compare to the case-control sample, the full study base was also analyzed with unconditional logistic regression. The following table (6.4) showed the estimation of baseline odds coecient, odds ratio coecient and their variances under dierent estimation methods. We analyzed the risk of strabismus with maternal smoking during pregnancy in each ethnicity. The baseline odds coecient and odds ratio coecient were shown in the following table (6.4). With a signicantly smaller sample size (144 vs 2957, 128 vs 2625, 64 vs 1081), we got a very close baseline odds coecient estimate ^ as well as close estimation of its variance. It proved the proposed estimators' accuracy and eciency. Sampling Estimation Method Method N ncases n controls ^ \ Var[] ^ ^ \ Var[] [ ORz Hispanic/Latino Full Sample UL 2957 72 2885 -3.7250 0.0151 0.0241 0.9369 0.2805 2.5520 1:1 Size Matching CO 144 72 72 -3.7195 0.0152 0.0242 0.8627 0.9292 2.3695 1:1 Size Matching RB 144 72 72 -3.7195 0.0145 0.0242 0.8627 0.9292 2.3695 African American Full Sample UL 2625 64 2561 -3.8162 0.0201 0.0220 0.8840 0.1011 2.4205 1:1 Size Matching CO 128 64 64 -3.8155 0.0199 0.0220 0.9654 0.3120 2.6260 1:1 Size Matching RB 128 64 64 -3.8155 0.0206 0.0220 0.9654 0.3120 2.6260 Non-Hispanic White Full Sample UL 1081 32 1049 -3.5351 0.0343 0.0292 1.2326 0.5843 3.4300 1:1 Size Matching CO 64 32 32 -3.5126 0.0350 0.0298 0.5180 1.4145 1.6787 1:1 Size Matching RB 64 32 32 -3.5126 0.0311 0.0298 0.5180 1.4145 1.6787 UL: Unconditional Logistic Regression; CO: Case-only Estimator; RB: Rao-Blackwell Estimator. Table 6.4: Point and Variance Estimation of Baseline Odds Coecient in the Analysis Sample 75 Maternal Smoking Sampling Estimation Estimated 95% Condence Interval during Pregancy Method Method N Prevalence of Strabismus Lower Limit Upper Limit Yes Full Sample UL 2957 5.80% 2.08% 15.15% 1:1 Size Matching CO 144 5.43% 0.85% 27.85% 1:1 Size Matching RB 144 5.43% 0.85% 27.83% No Full Sample UL 2957 2.35% 1.86% 2.98% 1:1 Size Matching CO 144 2.37% 1.87% 2.99% 1:1 Size Matching RB 144 2.37% 1.88% 2.98% Table 6.5: Estimated Prevalence of Strabismus in Hispanic Children 6-72 months old Using the CO method, the estimated prevalence of strabismus for Hispanic children with maternal smoking during pregnancy was 5.43% (95% CI, 0.85% -27.85%); for Hispanic children without maternal smoking was 2.37% (95% CI, 1.87% -2.99%). The estimates from the RB method were almost identical with estimates from the CO method. The full sample analysis data estimated the prevalence of strabismus with and without maternal smoking were 5.80%(95% CI, 0.85% -15.15%) and 2.35%(95% CI, 1.86% -2.98%), respectively. (Table 6.4.2) Maternal Smoking Sampling Estimation Estimated 95% Condence Interval during Pregancy Method Method N Prevalence of Strabismus Lower Limit Upper Limit Yes Full Sample UL 2625 5.06% 2.62% 9.53% 1:1 Size Matching CO 128 5.47% 1.84% 15.18% 1:1 Size Matching RB 128 5.47% 1.83% 15.19% No Full Sample UL 2625 2.15% 1.64% 2.82% 1:1 Size Matching CO 128 2.16% 1.64% 2.82% 1:1 Size Matching RB 128 2.16% 1.63% 2.84% Table 6.6: Estimated Prevalence of Strabismus in African American Children 6-72 months old Using the CO method, the estimated prevalence of strabismus for African American children with maternal smoking during pregnancy was 5.47% (95% CI, 1.84% -15.18%); for African Amer- ican children without maternal smoking was 2.16% (95% CI, 1.64% -2.82%). The estimates from the RB method were almost identical with estimates from the CO method. The full sample anal- ysis estimated the prevalence of strabismus with and without maternal smoking as 5.06 %(95% CI, 2.62% -9.53%) and 2.15 %(95% CI, 1.64% -2.82%), respectively. (Table 6.4.2) Using the CO method, the estimated prevalence of strabismus for Non-Hispanic White children with maternal smoking during pregnancy was 4.77% (95% CI, 0.47% -34.64% ); for Non-Hispanic White children without maternal smoking was 2.90% (95% CI, 2.02% -4.12% ). The estimates from the RB method were almost identical with estimates from the CO method. The full sample 76 Maternal Smoking Sampling Estimation Estimated 95% Condence Interval during Pregancy Method Method N Prevalence of Strabismus Lower Limit Upper Limit Yes Full Sample UL 1081 9.09% 2.10% 31.84% 1:1 Size Matching CO 64 4.77% 0.47% 34.64% 1:1 Size Matching RB 64 4.77% 0.47% 34.57% No Full Sample UL 1081 2.83% 1.99% 4.02% 1:1 Size Matching CO 64 2.90% 2.02% 4.12% 1:1 Size Matching RB 64 2.90% 2.07% 4.04% Table 6.7: Estimated Prevalence of Strabismus in Non-Hispanic White Children 6-72 months old analysis estimated the prevalence of strabismus with and without maternal smoking as 9.09 %(95% CI, 2.10% -31.84%) and 2.83 %(95% CI, 1.99% -4.02%), respectively. (Table 6.4.2) From all the three ethnic groups' analysis, both CO and RB methods were able to produce very close point estimates with full study base estimates, using a much smaller size of case-control sample. There is only one scenario that the estimation of disease risk in exposed subjects in Non-Hispanic White was much further away from the fully study base. This was because the number of cases were too limited in the Non-Hispanic White study base (only 32 cases), which impacted accurate estimation using the proposed methods. The estimates were not estimable in Asian children, because there were only 7 parents in the full study sample that smoked during pregnancy. 77 Chapter 7 Real Data Application: LALES Study 7.1 Motivation In the previous chapter, we applied the proposed method to obtain prevalence estimation for MEPEDS, a cross-sectional study. The disease risk in that setting is equivalent in theory to the disease prevalence at each stratication. However, disease risk is more appropriate when referring to disease occurrence/incidence. In this chapter, we applied the proposed method to a longitudinal study and estimated disease cumulative incidence using sampled case-controls, and validated the estimates by comparing to the estimates from the full study base. 7.2 Background 7.2.1 LALES Study Design, Population and Methods The Los Angeles Latino Eye Study (LALES) [76], was a population-based longitudinal study designed to estimate the prevalence of cataract, glaucoma, diabetic retinopathy and age-related maculopathy, blindness and visual impairment among Latinos. It also explored the association of various risk factors with ocular disease and studied the impact of ocular disease on quality of life and visual functioning, the utility of eye health and utilization of eye care services. With a follow up study on the participants from baseline, LALES was able to estimate the incidence and 78 progression of diabetic retinopathy, glaucoma, cataract and age-related maculopathy, and assess the association of various risk factors with the incidence and progression of ocular disease, as well as evaluate the impact of incidence and progression of ocular disease on vision-specic quality of life in a well-characterized cohort. The LALES baseline study was conducted during 2000-2003. The study population consisted of self-identied Latinos, aged 40 years or older, living in the city of La Puente, California. Six census tracts of La Puente specically were selected to allow generalizability of the LALES data to other Mexican-Americans in the United States because of demographic and socioeconomic char- acteristics similar to those of the Latino population in Los Angeles County, California, and in the United States. Details of the study design, sampling plan, and baseline data are reported else- where. In brief, all eligible individuals from all dwelling units within 6 census tracts were identied and invited for a detailed clinical and eye examination, which was performed using standardized protocols by trained ophthalmologists and technicians at the LALES local eye examination center. LALES 4-year follow up was conducted during 2004-2008. All living Latinos who were ex- amined previously in the baseline study were invited to participate in a home interview and a comprehensive clinical examination, including measurement of visual acuity, refraction, visual eld, intraocular pressure, fundus and optic disc photographs, hemoglobin a1c, and blood glu- cose. Similar questionnaire and examination procedures were used for both baseline and follow-up studies. At baseline, 7789 eligible participants were identied, 6357 (82%) of them completed the clinical examination (which included 11 participants who did not complete an in-home interview). At the 4-year follow-up study, 6100 out of 6357 were identied as living eligible participants, and 4658 (76%) completed the follow-up examination. 79 7.2.2 Denition, Determination and Impact of Diabetic Retinopathy Diabetic retinopathy (DR) is one of the most common complications of diabetes and is a leading cause of visual impairment and blindness among adults in the United States (US) [75]. The growing number of adults impacted by diabetes and complications of diabetes worldwide un- derscores the importance of accurate documentation of the burden of DR. In the United States, Latinos have a high prevalence of diabetes mellitus, and thus have a proportionately higher num- ber of persons with diabetes than other racial or ethnic groups and are at risk of experiencing complications associated with diabetes, including retinopathy. In LALES, diabetic retinopathy was one of the primary outcomes to be assessed at both baseline and follow-up [77, 80, 81]. A participant was considered to have denite diabetes mellitus if any of the following criteria were met: (1) the participant had a history of diabetes and was being treated with oral hypoglycemic medications, insulin, or diet alone; (2) the participant's hemoglobin A1c was measured at 7.0% or higher; or (3) the participant had a random blood glucose of 200 mg/100 ml or higher. The diabetes was considered to be type I if the participant was younger than 30 years of age when diagnosed with diabetes and was receiving insulin therapy. Otherwise, the diabetes was considered to be type II. Diabetic retinopathy was dened as retinopathy in persons with denite diabetes mellitus. Grading protocols for DR were modications of the Early Treatment Diabetic Retinopathy Study adaptation of the modied Airlie House classication of DR. For each eye, the maximum grade in any of the 7 standard photographic elds was determined for each of the lesions. Eyes were graded according to the following criteria: (1) no DR (levels 10 through 13), or (2) any DR (levels 14 through 85). Diabetic retinopathy was then classied as (1) non-proliferative DR (NPDR), mild (level 14 through 20), moderate (levels 31 through 43), or severe (levels 47 through 53); or (2) proliferative DR (PDR; levels 60 through 85). 80 LALES showed that more severe DR was associated with worse health-related quality of life (HRQOL) scores on all of the SF-12 and NEI-VFQ-25 sub-scales [46]. SF-12 is a standard US norm-based instrument to assess general health quality of life, including Physical Component Sum- mary (PCS) and Mental Component Summary (MCS). While NEI-VFQ-25 is a disease-targeted set of measures designed to complement SF-12 by focusing on aspects of HRQOL particularly relevant to visually impaired adults, regardless of the cause of visual disability. The NEI-VFQ-25 is composed of 12 vision-targeted scales: general vision, general health, near and distance vision activities, ocular pain, vision-related social function, vision-related role function, vision-related mental health, vision-related dependency, driving diculties, color vision, and peripheral vision. The domains with the most signicant impact were for vision-related daily activities, dependency, and mental health. Persons with bilateral moderate NPDR had the most substantial decrease in quality of life compared with those with less severe DR. The prevention of incident DR and, more important, its progression from unilateral to bilateral NPDR is likely to be set as an important goal in management of individuals with DM. 7.2.3 Prevalence, Incidence and Risk Factors of Diabetic Retinopathy Prevalence of diabetic retinopathy has been reported by LALES [81]. Of 6357 eligible partic- ipants in the LALES baseline study, 1263 participants (19.9%) had denite diabetes. Gradable fundus photographs were obtained from at least 1 eye for 1217 (96%) of those with denite dia- betes. The age- and gender-specic prevalence of DR for the 1217 participants with denite diabetes and gradable fundus photographs were presented [81]. In persons with denite diabetes, the prevalence of any DR was 46.9%. There was an increase in prevalence of any DR (including PDR and ME) with age (from those 40 to 49 years of age to those 70 to 79 years of age; P=0.05). A decrease in prevalence of any DR was noted in older persons (80 or more years of age). There was no gender-specic dierence in the prevalence of any DR (P= 0.09, age-adjusted;). After adjusting 81 for non-participation (using a method of direct standardization) and for missing or ungradable photographs, no dierence was found in the adjusted and unadjusted rates (adjusted rates for any DR, 45.4%; for mild NPDR, 14.9%;for moderate NPDR, 20.7%; for severe NPDR, 4.3%; and for PDR, 5.6%) (see Table 7.1). Age Group (yrs)* Total 40-49 50-59 60-69 70-79 80+ (n=275) (n=406) (n=335) (n=171) (n=30) (n=1217) Any DR 40.7 (35.0,46.6) 46.8 (41.8,51.5) 48.7 (43.0,53.7) 55.0 (47.5,62.4) 40.0 (22.5,57.5) 46.9 (44.0,49.6) Gender** Female Male (n=685) (n=532) Any DR 44.8 (40.9,48.4) 49.6 (45.4,53.9) * Association of any DR and age (P=0.05) ** Association of any DR and gender (P=0.09, age-adjusted) Table 7.1: Age- and Gender-Specic Prevalence (95% Condence Interval) of Diabetic Retinopathy in Latinos with Denite Diabetes In LALES follow-up study 4 four years later, 6100 out of the 6357 at baseline exam, were identied to be living eligible participants and invited for the follow-up. 4658 (76%) completed the follow-up examination. Of these, 904 had denite diabetes at baseline (of which 69 of 904 (7.6%) were newly diagnosed, and 835 of 904 (92.4%) were previously diagnosed), and 775 had gradable fundus photographs in the same eye at baseline and at follow-up. Of these 775 diabetic participants, 404 diabetics were at risk of developing any retinopathy, and 324 diabetics with DR at baseline were at risk for progression of DR. The 4-year incidence of diabetic retinopathy in the rst eye was 28.2% for the diabetic par- ticipants. Age-specic incidence ranged from 37.5% in the 40-to-49-year age group to 23.5% in the 70-or-more-year age group. There was an overall inverse relationship between age and DR incidence (P<.01), with age group 40-to-49 having the highest incidence. There was a signicant increase in incidence of DR with increasing duration of diabetes (P<.001), increasing from 17.3% in the newly diagnosed to 41.9% in diabetics with 15 or more years duration (see Table 7.2.3). 82 Incidence in 1st Eye Age Group (yrs)* n % (95%CI) 40-49 96 37.5 (27.8,47.2) 50-59 146 30.1 (22.7,37.6) 60-69 111 19.8 (12.4,27.2) 70+ 51 23.5 (11.9,35.2) Overall 404 28.2 (23.8,32.6) Duration of diabetes (yrs)** New 139 17.3 (11.0,23.6) 1 to 4 124 27.4 (19.6,35.3) 5 to 9 67 31.3 (20.2,42.5) 10 to 14 43 51.2 (36.2,66.1) >15 31 41.9 (24.6,59.3) Overall 404 28.2 (23.8,32.6) * P of trend (P=0.01) ** P of trend (P <0.001) Table 7.2: Estimated Four-Year Incidence of Any Diabetic Retinopathy Stratied by Age and Duration of Diabetes at Baseline) Univariate and multivariate associations of various risk factors with any DR have also been reported in LALES baseline data [80]. A stepwise logistic regression analyses was used to de- termine the independently associated risk factors of having DR. The order of being selected into the nal stepwise model was considered the order of importance. These independently associated risk factors included longer duration of diabetes (OR=1.08, 95%CI=(1.06-1.11)), elevated glyco- sylated hemoglobin (OR=1.22, 95%CI=(1.13-1.31)), elevated systolic blood pressure (OR=1.26, 95%CI=(1.08-1.47)), male gender (OR=1.50, 95%CI=(1.13-2.01)), and treatment with insulin (OR=1.60, 95%CI=(1.12-2.30)). The length of diabetes was reported to be one of the most signicant independent risk factor of DR in LALES baseline and follow up study [77, 80, 81]. LALES as a population-based study was able to estimate the prevalence of DR stratied by length of diabetes groups, or estimate the prevalence of DR as a function of length of diabetes. 7.3 Analysis Methods We chose diabetic retinopathy to be the disease of interest. We used the LALES follow- up data to study the outcome 4-year incidence of DR. In contrast to analysis results above using the full study base, we constructed a case control sample, with all the cases in the study base, and randomly selected controls according to size matching control selection schemes. The 83 above identied biological, demographic and ocular risk factors were studied, and the disease risk (disease incidence) associated with these factors was estimated. Results from case-control sample estimations is compared to the full study base analysis. 7.4 Results 7.4.1 Study Population Out of 6357 LALAES participants at baseline, 4658 (76%) completed the follow-up exami- nation. 904 of these 4658 participants had denite diabetes at baseline. A total of 404 diabetic participants with no diabetic retinopahty in any eye at baseline, were at risk of developing diabetic retinopahty, hence consisted of the analysis sample in this chapter. In this four hundred and four study base, 38.6% were males and 61.4% were females. They were 23.8% in years 40-49 at baseline, 36.1% in years 50-59, 27.5% in years 60-69, 10.4% in 70- 79, and only 2.2% in years 80 or older. They were more in newly diagnosed or short history of diabetes at baseline (34.4% newly diagnosed, 30.5% with a history 4 years or less). 41.6 % of them had glycosylated hemoglobin between 7% to 9 %. (Table 7.3) showed the characteristics of the analysis sample at baseline. 7.4.2 Four-year Incidence Rate of Diabetic Retinopathy Estimation We constructed case-control samples from the analysis sample using 1:1 Size Matching. Case only estimator (CO) and Rao-blackwellized estimator (RB) were used to estimate baseline odds coecient and its variance Var[]. Odds ratio estimates were estimated using conditional logistic regression. To compare to the 1:1 Size Matching sample, the full study base was also analyzed with unconditional logistic regression. The following table (7.4) showed the estimation of baseline odds coecient and odds ratio coecient and their variances under dierent sample and estimation methods. 84 Developed any Diabetic Retinopathy in 4 years Characteristics at Baseline No Yes Total Gender Male 106 (36.6) 50 (43.9) 156 (38.6) Female 184 (63.4) 64 (56.1) 248 (61.4) Age (yrs) MeanSD 58.11SD 55.189.96 52.289.95 40-49 60 (20.7) 36 (31.6) 96 (23.8) 50-59 102 (35.2) 44 (38.6) 146 (36.1) 60-69 89 (30.7) 22 (19.3) 111 (27.5) 70-79 34 (11.7) 8 (7.0) 42 (10.4) 80+ 5 (1.7) 4 (3.5) 9 (2.2) Length of Diabetic History (yrs) MeanSD 4.036.60 7.218.79 4.927.41 New 115 (39.7) 24 (21.0) 139 (34.4) 1-4 90 (31.0) 34 (29.8) 124 (30.7) 5-9 46 (15.9) 21 (18.4) 67 (16.6) 10-14 21 (7.2) 22 (19.3) 43 (10.6) 15+ 18 (6.2) 13 (11.4) 31 (7.7) Glycosylated Hemoglobin MeanSD 7.881.82 8.892.0 8.171.93 <7.0% 97 (33.7) 21 (18.9) 118 (29.6) 7.0%-9.0% 123 (42.7) 43 (38.7) 166 (41.6) >9.0% 68 (23.6) 47 (42.3) 115 (28.8) Total 290 (71.8) 114 (28.2) 404 Table 7.3: Characteristics of Diabetic Participants at risk for Diabetic Retinopathy at baseline in the Analysis Sample (n (%)) For disease risk estimation associated with gender, age, length of diabetic history, and glyco- sylated hemoglobin, the table (7.4) showed that for all the four exposure variables, the estimation of from CO and RB is very close to that from the full study base. The variance of baseline odds coecient from CO and RB methods were slightly larger than the variance estimated from full study base. Considering the much smaller sample size of the case control sample comparing to the full study base, the benet and CO and RB methods were quite obvious. Sampling Estimation Method Method N ncases n controls ^ \ Var[] ^ ^ \ Var[] [ ORz Gender Full Sample UL 404 114 290 -1.0561 0.0211 0.3478 0.3046 0.0505 1.3561 1:1 Size Matching CO 228 114 114 -1.0556 0.0223 0.3480 0.3109 0.0734 1.3646 1:1 Size Matching RB 228 114 114 -1.0556 0.0237 0.3480 0.3109 0.0734 1.3646 Age (yrs) Full Sample UL 404 114 290 -0.9542 0.0127 0.3851 -0.3066 0.0134 0.7360 1:1 Size Matching CO 228 114 114 -0.9472 0.0134 0.3878 -0.2947 0.0184 0.7447 1:1 Size Matching RB 228 114 114 -0.9515 0.0128 0.3862 -0.2947 0.0184 0.7447 Length of Diabetic History (yrs) Full Sample UL 404 114 290 -0.9636 0.0128 0.3815 0.4062 0.0135 1.5012 1:1 Size Matching CO 228 114 114 -0.9759 0.0144 0.3769 0.4917 0.0283 1.6351 1:1 Size Matching RB 228 114 114 -0.9615 0.2304 0.3823 0.4917 0.0283 1.6351 Glycosylated Hemoglobin Full Sample UL 404 114 290 -0.9856 0.0134 0.3732 0.5145 0.0126 1.6728 1:1 Size Matching CO 228 114 114 -0.9909 0.0152 0.3712 0.5293 0.0200 1.6977 1:1 Size Matching RB 228 114 114 -0.9840 0.0172 0.3738 0.5293 0.0200 1.6977 Table 7.4: Point and Variance Estimation of Baseline Odds Coecient in the Analysis Sample 85 Table (7.5) to table (7.8) calculated the estimated four-year cumulative incidence rate of DR associated with dierent exposure variable levels. The estimated four-year cumulative incidence rate of DR in females was 25.81% (95 %CI, 20.62% - 31.80%) from the CO method. The estimated four-year cumulative incidence rate of DR in males was higher, 32.20% (95 %CI, 20.57% - 46.55%) from the CO method. Age, length of diabetic history, and glycosylated hemoglobin were modeled as a standardized continuous variable. Various meaningful data points were selected to estimate associated four-year cumulative incidence rates. For example, at baseline age 45 years old, the estimated four-year cumulative incidence rate of DR was 35.82% (95 %CI, 27.24% - 45.40%) from the CO method. As age gets older, the associated risk decreases. At baseline age 75 year old, the estimated cumulative incidence was 18.6% (95 %CI, 11.92% - 27.83%) from the CO method. This nding was consistent with previous published incidence rates on LALES; the estimation were very close. The estimated 4-year incidence rate of DR for a newly diagnosed diabetic participant, was 22.56% (95 %CI, 18.21% - 27.59%), yet for a diabetic participant with diabetes history of 15 years, the estimated rate was 42.38% (95 %CI, 30.72% - 54.96% ). The estimated 4-year incident DR was 21.46% (95 %CI, 17.36% - 26.23%). The estimates from the RB method were very close to those from the CO method. All the results (see tables below) were consistent with previous publications and full study sample analysis. Thus we further demonstrate the unbiasedness and eciency of the estimators. Estimated 95% Condence Interval Gender Sampling Method Estimation Method N 4-year Incidence of DR Lower Limit Upper Limit Female Full Sample UL 404 25.81% 20.74% 31.61% 1:1 Size Matching CO 228 25.81% 20.62% 31.80% 1:1 Size Matching RB 228 25.81% 20.47% 31.99% Male Full Sample UL 404 32.05% 21.83% 44.35% 1:1 Size Matching CO 228 32.20% 20.57% 46.55% 1:1 Size Matching RB 228 32.20% 20.50% 46.66% Table 7.5: Estimated 4-year Incidence Rate of Diabetic Retinopathy by Gender 86 Estimated 95% Condence Interval Age (yrs) Sampling Method Estimation Method N 4-year Incidence of DR Lower Limit Upper Limit 45 Full Sample UL 404 35.99% 28.24% 44.54% 1:1 Size Matching CO 228 35.82% 27.24% 45.40% 1:1 Size Matching RB 228 35.72% 27.21% 45.23% 55 Full Sample UL 404 29.24% 24.77% 34.14% 1:1 Size Matching CO 228 29.33% 24.71% 34.41% 1:1 Size Matching RB 228 29.24% 24.72% 34.21% 65 Full Sample UL 404 23.29% 18.63% 28.71% 1:1 Size Matching CO 228 23.58% 18.51% 29.54% 1:1 Size Matching RB 228 23.50% 18.50% 29.38% 75 Full Sample UL 404 18.24% 12.34% 26.12% 1:1 Size Matching CO 228 18.66% 11.95% 27.95% 1:1 Size Matching RB 228 18.60% 11.92% 27.83% Table 7.6: Estimated 4-year Incidence Rate of Diabetic Retinopathy at Dierent Age at Baseline Length of Estimated 95% Condence Interval Diabetic History (yrs) Sampling Method Estimation Method N 4-year Incidence of DR Lower Limit Upper Limit New Full Sample UL 404 22.56% 18.21% 27.59% 1:1 Size Matching CO 228 21.37% 16.47% 27.26% 1:1 Size Matching RB 228 21.61% 9.50% 42.01% 2 Full Sample UL 404 24.53% 20.37% 29.23% 1:1 Size Matching CO 228 23.69% 19.18% 28.88% 1:1 Size Matching RB 228 23.95% 10.86% 44.88% 4 Full Sample UL 404 26.62% 22.48% 31.21% 1:1 Size Matching CO 228 26.17% 21.83% 31.04% 1:1 Size Matching RB 228 26.45% 12.30% 47.98% 8 Full Sample UL 404 31.11% 26.19% 36.51% 1:1 Size Matching CO 228 31.61% 26.04% 37.76% 1:1 Size Matching RB 228 31.92% 15.34% 54.82% 12 Full Sample UL 404 36.00% 29.18% 43.42% 1:1 Size Matching CO 228 37.61% 28.92% 47.17% 1:1 Size Matching RB 228 37.95% 18.48% 62.25% 15 Full Sample UL 404 39.87% 31.17% 49.25% 1:1 Size Matching CO 228 42.38% 30.72% 54.96% 1:1 Size Matching RB 228 42.73% 20.83% 67.91% Table 7.7: Estimated 4-year Incidence Rate of Diabetic Retinopathy at Dierent Lengths of Diabetic History at Baseline Estimated 95% Condence Interval Glycosylated Hemoglobin Sampling Method Estimation Method N 4-year Incidence of DR Lower Limit Upper Limit 7% Full Sample UL 404 21.46% 17.36% 26.23% 1:1 Size Matching CO 228 21.22% 16.72% 26.56% 1:1 Size Matching RB 228 21.34% 16.64% 26.94% 8% Full Sample UL 404 26.30% 22.13% 30.94% 1:1 Size Matching CO 228 26.17% 21.76% 31.13% 1:1 Size Matching RB 228 26.31% 21.62% 31.60% 9% Full Sample UL 404 31.79% 26.71% 37.34% 1:1 Size Matching CO 228 31.81% 26.27% 37.93% 1:1 Size Matching RB 228 31.96% 26.14% 38.41% 10% Full Sample UL 404 37.84% 30.90% 45.31% 1:1 Size Matching CO 228 38.04% 30.04% 46.75% 1:1 Size Matching RB 228 38.21% 29.97% 47.18% 12% Full Sample UL 404 50.93% 38.81% 62.95% 1:1 Size Matching CO 228 51.53% 36.80% 66.01% 1:1 Size Matching RB 228 51.71% 36.82% 66.30% Table 7.8: Estimated Four-year Incidence Rate of Diabetic Retinopathy at Dierent Hemoglobin A1c Levels at Baseline 87 Chapter 8 Future Directions 8.1 Future work In this work, I derived, benchmarked and applied two disease risk estimation methods for case-control studies. These methods worked well in simulated scenarios and in two real datasets arising from cross-sectional and longitudinal epidemiological studies. In the process of formulating the ideas, developing the methods, implementing the software and analyzing the real data, I also identied several directions for improvement on these methods. Developing point estimator and variance estimator for additional case sampling. In current work, the estimators were based on taking all the cases in the study base and sampling the controls. They worked well if there were a moderate size of cases in the study base. But when the study base gets to be very large, the number of cases may get too big which diminishes the advantages of a case-control sample design. Case sampling is therefore necessary and will increase the generalization of the methods to more real study settings. Improving method performance in more extreme scenarios. It is noted in the simulation study that some extreme occasions, such as low availability of case samples, are currently challenging for the proposed methods. One potential to address this issue is to use re- sampling techniques, such as bootstrap, to generate an empirically derived mock trial with 88 increased study base and increased number of cases. The unbiased point inference could be based on this larger trial, which will allow us to estimate the disease risks otherwise impossible. We can also adjust the derived condence intervals from this augmented study base back to the real study base using the asymptotic relationship between sample number and variances. Applying methods in more general datasets. Although we proposed the methods for disease risk estimation with a case-control design, it can actually apply to any generic case-control type of data derived from observational studies. That has signicant implications because it means the methods could be generally applied to biomedical observational data, and any such data from other elds, including internet surveys, business analytics, and government data through agencies' open data initiatives. The only requirement is the exposure variable can be modeled and delineated. In the future, I will actively seek to apply the methods to other elds of studies. Implementing user friendly open source software. One thing that helps to popularize these methods is to develop user friendly statistical software with carefully documented use cases. I am implementing these methods into an open source R package, with detailed documen- tations and active demos. The preliminary codes are already available through Bitbucket open source repository. The plan is to complete the coding and documentation, and release this software package to statistical researchers through a formal publication in the near future. Other improvements. There are other important aspects, in particular long-term ones, their solutions requiring additional dedicated theoretical research. These include: how to extend the methods to multivariate settings, to non logistic models and to exposure with free formed distributions. These improvements if made, will enable the proposed methods for more exible data schemes and modeling techniques. 89 Bibliography [1] O. O. Aalen and S. Johansen. An empirical transition matrix for non-homogeneous markov chains based on censored observations. Scandinavian Journal of Statistics, pages 141{150, 1978. [2] W. Ahrens and I. Pigeot. Handbook Of Epidemiology. Springer London, Limited, 2005. [3] J. Ananijevic-Pandey, M. Jarebinski, B. Kastratovic, H. Vlajinac, Z. Radojkovic, and D. Brankovic. Case-control study of congenital malformations. European Journal of Epi- demiology, 8(6):871{874, Nov. 1992. [4] P. K. Andersen, O. Borgan, R. D. Gill, and N. Keiding. Statistical models based on counting processes. Springer Science & Business Media, 2012. [5] P. Armitage and T. Colton. Encyclopedia of Biostatistics: 8-Volume Set. Wiley, Apr. 2005. [6] J. Benichou and M. H. Gail. Estimates of absolute cause-specic risk in cohort studies. Biometrics, pages 813{826, 1990. [7] J. Benichou and M. H. Gail. Variance calculations and condence intervals for estimates of the attributable risk based on logistic models. Biometrics, pages 991{1003, 1990. [8] J. Benichou and M. H. Gail. Methods of inference for estimates of absolute risk derived from population-based case-control studies. Biometrics, 51(1):182{194, Mar. 1995. ArticleType: research-article / Full publication date: Mar., 1995 / Copyright 1995 International Biometric Society. [9] M. S. Borchert, R. Varma, S. A. Cotter, K. Tarczy-Hornoch, R. McKean-Cowdin, J. H. Lin, G. Wen, S. P. Azen, M. Torres, J. M. Tielsch, et al. Risk factors for hyperopia and myopia in preschool children: the multi-ethnic pediatric eye disease and baltimore pediatric eye disease studies. Ophthalmology, 118(10):1966{1973, 2011. [10] M. S. Borchert, R. Varma, S. A. Cotter, K. Tarczy-Hornoch, R. McKean-Cowdin, J. H. Lin, G. Wen, S. P. Azen, M. Torres, J. M. Tielsch, D. S. Friedman, M. X. Repka, J. Katz, J. Ibironke, and L. Giordano. Risk factors for hyperopia and myopia in preschool children: The multi-ethnic pediatric eye disease and baltimore pediatric eye disease studies. Ophthal- mology, 118(10):1966{1973, Oct. 2011. [11] O. Borgan and B. Langholz. Nonparametric estimation of relative mortality from nested case-control studies. Biometrics, 49(2):593{602, June 1993. ArticleType: research-article / Full publication date: Jun., 1993 / Copyright 1993 International Biometric Society. [12] N. E. Breslow, N. E. Day, and W. Davis. Statistical methods in cancer research. v. 1, the analysis of case-control studies. n. e. breslow & n. e. day ; technical editor for IARC, w. davis. http://apps.who.int/iris/handle/10665/38121, 1980. 338 p. 90 [13] P. BRUZZI, S. B. GREEN, D. P. BYAR, L. A. BRINTON, and C. SCHAIRER. Estimating the population attributable risk for multiple risk factors using case-control data. American journal of epidemiology, 122(5):904{914, 1985. [14] Bryan Kestenbaum. Epidemiology and Biostatistics: An Introduction to Clinical Research. Springer, 2009. [15] A. Chaudhuri and H. Stenger. Survey Sampling: Theory and Methods, Second Edition. CRC Press, Dec. 2010. [16] W. G. Cochran. Sampling techniques. Wiley series in probability and mathematical statistics. Wiley, New York, 3d ed edition, 1977. [17] S. Cotter, R. Varma, K. Tarczy-Hornoch, R. McKean-Cowdin, J. Lin, G. Wen, J. Wei, M. Borchert, S. Azen, M. Torres, et al. Joint writing committee for the multi-ethnic pediatric eye disease study and the baltimore pediatric eye disease study groups. risk factors associated with childhood strabismus: the multi-ethnic pediatric eye disease and baltimore pediatric eye disease studies. Ophthalmology, 118(11):2251{2261, 2011. [18] S. A. Cotter, R. Varma, K. Tarczy-Hornoch, R. McKean-Cowdin, J. Lin, G. Wen, J. Wei, M. Borchert, S. P. Azen, M. Torres, et al. Risk factors associated with childhood stra- bismus: the multi-ethnic pediatric eye disease and baltimore pediatric eye disease studies. Ophthalmology, 118(11):2251{2261, 2011. [19] S. A. Cotter, R. Varma, K. Tarczy-Hornoch, R. McKean-Cowdin, J. Lin, G. Wen, J. Wei, M. Borchert, S. P. Azen, M. Torres, J. M. Tielsch, D. S. Friedman, M. X. Repka, J. Katz, J. Ibironke, and L. Giordano. Risk factors associated with childhood strabismus: The multi- ethnic pediatric eye disease and baltimore pediatric eye disease studies. Ophthalmology, 118(11):2251{2261, Nov. 2011. [20] R. Doll and A. B. Hill. Smoking and carcinoma of the lung. BMJ, 2(4682):739{748, Sept. 1950. [21] M. M. Dorsch, R. K. R. Scragg, A. J. Mcmichael, P. A. Baghurst, and K. F. Dyer. Congenital malformations and maternal drinking water supply in rural south australia: A case-control study. American Journal of Epidemiology, 119(4):473{486, Apr. 1984. [22] W. D. Dupont. Converting relative risks to absolute risks: a graphical approach. Statistics in medicine, 8(6):641{651, 1989. [23] A. Fozailo, K. Tarczy-Hornoch, S. Cotter, G. Wen, J. Lin, M. Borchert, S. Azen, R. Varma, W. C. for the MEPEDS Study Group, et al. Prevalence of astigmatism in 6-to 72-month- old african american and hispanic children: the multi-ethnic pediatric eye disease study. Ophthalmology, 118(2):284{293, 2011. [24] M. H. Gail, L. A. Brinton, D. P. Byar, D. K. Corle, S. B. Green, C. Schairer, and J. J. Mulvihill. Projecting individualized probabilities of developing breast cancer for white fe- males who are being examined annually. JNCI: Journal of the National Cancer Institute, 81(24):1879{1886, 1989. [25] M. H. Gail and R. M. Pfeier. On criteria for evaluating models of absolute risk. Biostatistics, 6(2):227{239, 2005. [26] M. J. Go, A. W. Suhr, J. A. Ward, J. K. Croley, and M. A. OHara. Eect of adult strabismus on ratings of ocial u.s. army photographs. 10(5):400{403. 91 [27] R. J. Gray. A class of k-sample tests for comparing the cumulative incidence of a competing risk. The Annals of statistics, pages 1141{1154, 1988. [28] M.-E. P. E. D. S. Group et al. Prevalence of myopia and hyperopia in 6-to 72-month- old african american and hispanic children: the multi-ethnic pediatric eye disease study. Ophthalmology, 117(1):140{147, 2010. [29] B. P. Hammond, G. Wen, R. Varma, and K. Tarczy-Hornoch. Anisometropic refractive errors occur more frequently in left eyes than right. Journal of American Association for Pediatric Ophthalmology and StrabismusfJAAPOSg, 16(1):e17, 2012. [30] A. L. Herbst, H. Ulfelder, and D. C. Poskanzer. Adenocarcinoma of the vagina. New England Journal of Medicine, 284(16):878{881, 1971. PMID: 5549830. [31] C. Hertzman, M. Wiens, D. Bowering, B. Snow, and D. Calne. Parkinson's disease: A case- control study of occupational and environmental risk factors. American Journal of Industrial Medicine, 17(3):349355, 1990. [32] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a nite universe. Journal of the American Statistical Association, 47(260):663{685, Dec. 1952. [33] D. W. Hosmer, S. Lemeshow, and S. May. Model development. Applied Survival Analysis: Regression Modeling of Time-to-Event Data, Second Edition, pages 132{168, 2008. [34] S. Jackson, R. A. Harrad, M. Morris, and N. Rumsey. The psychosocial benets of corrective surgery for adults with strabismus. 90(7):883{888. [35] James J. Schlesselman. Case-Control Studies : Design, Conduct, Analysis. Oxford University Press, Jan. 1982. [36] H. A. Johns, R. E. Manny, K. D. Fern, and Y.-S. Hu. The eect of strabismus on a young child's selection of a playmate. 25(5):400{407. [37] J. S. Kim, G. Wen, K. Tarczy-Hornoch, R. McKean-Cowdin, M. Borchert, S. Cotter, M. Tor- res, R. Varma, M.-E. P. E. D. S. Group, et al. Racial/ethnic dierences in axial length in preschool children: Multi-ethnic pediatric eye disease study (mepeds). Investigative Ophthal- mology & Visual Science, 52(14):6329{6329, 2011. [38] E. L. Korn and F. J. Dorey. Applications of crude incidence curves. Statistics in medicine, 11(6):813{829, 1992. [39] B. Langholz and O. Borgan. Estimation of absolute risk from nested case-control data. Biometrics, 53(2):767, June 1997. [40] G. H. LEVIN ML. Cancer and tobacco smoking: A preliminary report. Journal of the American Medical Association, 143(4):336{338, May 1950. [41] H. H. Liou, M. C. Tsai, C. J. Chen, J. S. Jeng, Y. C. Chang, S. Y. Chen, and R. C. Chen. Environmental risk factors and parkinson's disease a casecontrol study in taiwan. Neurology, 48(6):1583{1588, June 1997. [42] N. Mantel and W. Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the national cancer institute, 22(4):719{748, 1959. 92 [43] N. Mantel and W. Haenszel. Statistical aspects of the analysis of data from retrospective studies of disease. The Challenge of Epidemiology: Issues and Selected Readings, 1(1):533{ 553, 2004. [44] H. J. Mapel DW. Health care utilization in chronic obstructive pulmonary disease: A case-control study in a health maintenance organization. Archives of Internal Medicine, 160(17):2653{2658, Sept. 2000. [45] P. Matthews. Covering problems for brownian motion on spheres. The Annals of Probability, pages 189{199, 1988. [46] K. Mazhar, R. Varma, F. Choudhury, R. McKean-Cowdin, C. J. Shtir, and S. P. Azen. Severity of diabetic retinopathy and health-related quality of life. 118(4):649{655. [47] H. McKean. The design and the analysis of scientic experiments, 1968. [48] R. McKean-Cowdin, S. A. Cotter, K. Tarczy-Hornoch, G. Wen, J. Kim, M. Borchert, R. Varma, M.-E. P. E. D. S. Group, et al. Prevalence of amblyopia or strabismus in asian and non-hispanic white preschool children: multi-ethnic pediatric eye disease study. Oph- thalmology, 120(10):2117{2124, 2013. [49] R. McKean-Cowdin, R. Varma, S. Cotter, K. Tarczy-Hornoch, M. Borchert, J. Lin, G. Wen, S. Azen, M. Torres, J. Tielsch, et al. Joint writing committee for the multi-ethnic pediatric eye disease study and the baltimore pediatric eye disease study groups. risk factors for astig- matism in preschool children: the multi-ethnic pediatric eye disease and baltimore pediatric eye disease studies. Ophthalmology, 118(10):1974{81, 2011. [50] R. McKean-Cowdin, R. Varma, S. A. Cotter, K. Tarczy-Hornoch, M. S. Borchert, J. H. Lin, G. Wen, S. P. Azen, M. Torres, J. M. Tielsch, et al. Risk factors for astigmatism in preschool children: the multi-ethnic pediatric eye disease and baltimore pediatric eye disease studies. Ophthalmology, 118(10):1974{1981, 2011. [51] R. McKean-Cowdin, R. Varma, S. A. Cotter, K. Tarczy-Hornoch, M. S. Borchert, J. H. Lin, G. Wen, S. P. Azen, M. Torres, J. M. Tielsch, D. S. Friedman, M. X. Repka, J. Katz, J. Ibironke, and L. Giordano. Risk factors for astigmatism in preschool children: The multi- ethnic pediatric eye disease and baltimore pediatric eye disease studies. Ophthalmology, 118(10):1974{1981, Oct. 2011. [52] R. McKean-Cowdin, R. Varma, G. Wen, K. Tarczy-Hornoch, S. Cotter, M. S. Borchert, J. M. Tielsch, D. S. Friedman, M. X. Repka, and J. Katz. Risk factors for astigmatism in a population-based study of children. the multiethnic pediatric eye disease and the baltimore pediatric eye disease studies. Investigative Ophthalmology & Visual Science, 52(14):3056{ 3056, 2011. [53] R. McKean-Cowdin, G. Wen, C. Hsu, M. Torres, R. Klein, S. P. Azen, and R. Varma. Prevalence of diabetic retinopathy in adult chinese americans: The chinese american eye study (ches). Investigative Ophthalmology & Visual Science, 55(13):5340{5340, 2014. [54] L. McKusick, W. Horstman, and T. J. Coates. AIDS and sexual behavior reported by gay men in san francisco. American Journal of Public Health, 75(5):493{496, May 1985. PMID: 3985236 PMCID: PMC1646285. [55] O. Miettinen. Estimability and estimation in case-referent studies. American journal of epidemiology, 103(2):226{235, 1976. 93 [56] O. S. Miettinen. Proportion of disease caused or prevented by a given exposure, trait or intervention. American journal of epidemiology, 99(5):325{332, 1974. [57] S. M. Mojon-Azzi and D. S. Mojon. Strabismus and employment: the opinion of headhunters. 87(7):784{788. [58] S. M. Mojon-Azzi, W. Potnik, and D. S. Mojon. Opinions of dating agents about strabismic subjects' ability to nd a partner. 92(6):765{769. [59] B. A. Nelson, K. B. Gunton, J. N. Lasker, L. B. Nelson, and L. A. Drohan. The psychosocial aspects of strabismus in teenagers and adults and the impact of surgical correction. 12(1):72{ 76.e1. [60] N. Paneth, E. Susser, and M. Susser. Origins and early development of the case-control study: part 1, early evolution. Sozial- und Prventivmedizin, 47(5):282{288, Nov. 2002. [61] A. Poonawala, S. P. Nair, and P. J. Thuluvath. Prevalence of obesity and diabetes in patients with cryptogenic cirrhosis: A case-control study. Hepatology, 32(4):689692, 2000. [62] A. S. Robbins, S. Y. Chao, and V. P. Fonseca. What's the relative risk? a method to directly estimate risk ratios in cohort studies of common outcomes. Annals of Epidemiology, 12(7):452{454, Oct. 2002. [63] K. J. Rothman. Modern Epidemiology. Lippincott Williams & Wilkins, 2008. [64] C. O. Schmidt and T. Kohlmann. When to use the odds ratio or the relative risk? Interna- tional Journal of Public Health, 53(3):165{167, June 2008. [65] R. Schrek, L. A. Baker, G. P. Ballard, and S. Dolgo. Tobacco smoking as an etiologic factor in disease. i. cancer. Cancer Research, 10(1):49{58, Jan. 1950. PMID: 15398042. [66] L. E. Sever, E. S. Gilbert, N. A. Hessol, and J. M. McINTYRE. A case-control study of con- genital malformations and occupational exposure to low-level ionizing radiation. American Journal of Epidemiology, 127(2):226{242, Feb. 1988. [67] K. N. Shands, G. P. Schmid, B. B. Dan, D. Blum, R. J. Guidotti, N. T. Hargrett, R. L. Anderson, D. L. Hill, C. V. Broome, J. D. Band, and D. W. Fraser. Toxic-shock syndrome in menstruating women. New England Journal of Medicine, 303(25):1436{1442, Dec. 1980. [68] D. C. Smith, R. Prentice, D. J. Thompson, and W. L. Herrmann. Association of exogenous estrogen and endometrial carcinoma. New England Journal of Medicine, 293(23):1164{1167, Dec. 1975. [69] K. Tarczy-Hornoch, S. A. Cotter, M. Borchert, R. McKean-Cowdin, J. Lin, G. Wen, J. Kim, R. Varma, M.-E. P. E. D. S. Group, et al. Prevalence and causes of visual impairment in asian and non-hispanic white preschool children: Multi-ethnic pediatric eye disease study. Ophthalmology, 120(6):1220{1226, 2013. [70] K. Tarczy-Hornoch, R. Varma, S. Cotter, R. McKean-Cowdin, J. Lin, M. Borchert, M. Torres, G. Wen, S. Azen, J. Tielsch, et al. Joint writing committee for the multi-ethnic pediatric eye disease study and the baltimore pediatric eye disease study groups. risk factors for decreased visual acuity in preschool children: the multi-ethnic pediatric eye disease and baltimore pediatric eye disease studies. Ophthalmology, 118(11):2262{2273, 2011. 94 [71] K. Tarczy-Hornoch, R. Varma, S. A. Cotter, R. McKean-Cowdin, J. H. Lin, M. S. Borchert, M. Torres, G. Wen, S. P. Azen, J. M. Tielsch, et al. Risk factors for decreased visual acuity in preschool children: the multi-ethnic pediatric eye disease and baltimore pediatric eye disease studies. Ophthalmology, 118(11):2262{2273, 2011. [72] K. Tarczy-Hornoch, R. Varma, S. A. Cotter, R. McKean-Cowdin, J. H. Lin, M. S. Borchert, M. Torres, G. Wen, S. P. Azen, J. M. Tielsch, D. S. Friedman, M. X. Repka, J. Katz, J. Ibironke, and L. Giordano. Risk factors for decreased visual acuity in preschool children: The multi-ethnic pediatric eye disease and baltimore pediatric eye disease studies. Ophthal- mology, 118(11):2262{2273, Nov. 2011. [73] C. L. Trotter, N. J. Andrews, E. B. Kaczmarski, E. Miller, and M. E. Ramsay. Eectiveness of meningococcal serogroup c conjugate vaccine 4 years after introduction. The Lancet, 364(9431):365{367, July 2004. [74] I. A. F. van der Mei. Past exposure to sun, skin phenotype, and risk of multiple sclerosis: case-control study. BMJ, 327(7410):316{0, Aug. 2003. [75] R. Varma. Diabetic retinopathy: Challenges and future directions. 141(3):539{541. [76] R. Varma. The los angeles latino eye study*1design, methods, and baseline data. 111(6):1121{ 1131. [77] R. Varma, F. Choudhury, R. Klein, J. Chung, M. Torres, and S. P. Azen. Four-year incidence and progression of diabetic retinopathy and macular edema: The los angeles latino eye study. 149(5):752{761.e3. [78] R. Varma, J. Deneen, S. Cotter, S. H. Paz, S. P. Azen, K. Tarczy-Hornoch, P. Zhao, and Multi-Ethnic Pediatric Eye Disease Study Group. The multi-ethnic pediatric eye disease study: design and methods. Ophthalmic epidemiology, 13(4):253{262, Aug. 2006. PMID: 16877284. [79] R. Varma, J. S. Kim, B. S. Burkemper, G. Wen, M. Torres, C. Hsu, F. Choudhury, S. P. Azen, and R. McKean-Cowdin. Prevalence and causes of visual impairment and blindness in chinese american adults: the chinese american eye study. JAMA ophthalmology, 134(7):785{ 793, 2016. [80] R. Varma, G. L. Macias, M. Torres, R. Klein, F. Y. Pea, and S. P. Azen. Biologic risk factors associated with diabetic retinopathy. 114(7):1332{1340. [81] R. Varma, M. Torres, F. Pea, R. Klein, and S. P. Azen. Prevalence of diabetic retinopathy in adult latinos. 111(7):1298{1306. [82] R. Varma, G. Wen, X. Jiang, C. Hsu, M. Torres, R. Klein, S. P. Azen, and R. McKean- Cowdin. Prevalence of diabetic retinopathy in adult chinese american individuals: The chinese american eye study. JAMA ophthalmology, 134(5):563{569, 2016. [83] G. W. ASpirin and reye's syndrome. American Journal of Diseases of Children, 136(11):971{ 972, Nov. 1982. [84] G. Wen, R. McKean, S. Azen, K. Tarczy-Hornoch, S. Cotter, M. Borchert, M. Torres, R. Varma, M. Group, et al. Health-related quality of life in preschool children with am- blyopia and strabismus. Investigative Ophthalmology & Visual Science, 50(13):4686{4686, 2009. 95 [85] G. Wen, R. McKean-Cowdin, K. Tarczy-Hornoch, S. A. Cotter, M. S. Borchert, M. Torres, S. P. Azen, R. Varma, M. ethnic Pediatric Eye Disease Study Group, et al. The association of axial length with age and gender in preschool children. Investigative Ophthalmology & Visual Science, 52(14):6325{6325, 2011. [86] G. Wen, R. McKean-Cowdin, R. Varma, K. Tarczy-Hornoch, S. A. Cotter, M. Borchert, S. Azen, M. ethnic Pediatric Eye Disease Study Group, et al. General health-related quality of life in preschool children with strabismus or amblyopia. Ophthalmology, 118(3):574{580, 2011. [87] G. Wen, K. Tarczy-Hornoch, R. McKean-Cowdin, S. A. Cotter, M. Borchert, J. Lin, J. Kim, R. Varma, M.-E. P. E. D. S. Group, et al. Prevalence of myopia, hyperopia, and astig- matism in non-hispanic white and asian children: multi-ethnic pediatric eye disease study. Ophthalmology, 120(10):2109{2116, 2013. [88] H. Whitehead. Report for the st. james parish cholera inquiry committee. Technical report, London: Churchill, 1855. [89] B. Woolf et al. On estimating the relation between blood group and disease. Ann Hum Genet, 19(4):251{253, 1955. [90] G. E. WYNDER EL. Tobacco smoking as a possible etiologic factor in bronchiogenic carci- noma: A study of six hundred and eighty-four proved cases. Journal of the American Medical Association, 143(4):329{336, May 1950. [91] S. Yusuf, S. Hawken, S. unpuu, L. Bautista, M. G. Franzosi, P. Commerford, C. C. Lang, Z. Rumboldt, C. L. Onen, L. Lisheng, S. Tanomsup, P. Wangai Jr, F. Razak, A. M. Sharma, and S. S. Anand. Obesity and the risk of myocardial infarction in 27000 participants from 52 countries: a case-control study. The Lancet, 366(9497):1640{1649, 2005. [92] Y. K. Zhang J. What's the relative risk?: A method of correcting the odds ratio in cohort studies of common outcomes. JAMA, 280(19):1690{1691, Nov. 1998. 96 Appendix A Detail Deduction of the Estimated Variance of Y ~ R A.1 Estimate the Variance of Y ~ R d var[Y ~ R ] =d var[Y D ] X d ~ R p dj ~ R (Y d Y ~ R ) 2 We have p dj ~ R = d ( ~ Rjd) P s ~ R s ( ~ Rjs) Y d = X i2d Y i p i ; Y ~ R = X i2 ~ R Y i q i w i; ~ R where w i; ~ R = P d ~ Rnfig d ( ~ Rjd[fig) P s ~ R s ( ~ Rjs) Then P d ~ R p dj ~ R (Y d ) 2 97 = ( X d ~ R d ( ~ Rjd)( X i2d ( Y i p i ) 2 + X i;j2d:i6=j Y i p i Y j p j )) 1 P s ~ R s ( ~ Rjs) = [ X i2 ~ R ( Y i p i ) 2 ( p i q i ) X d ~ Rnfig d ( ~ Rjd[fig)+ X i;j2 ~ R:i6=j Y i p i Y j p j p i q i p j q j X d ~ Rnfi;jg d ( ~ Rjd[fi;jg)] 1 P s ~ R s ( ~ Rjs) = X i2 ~ R ( Y i q i ) 2 ( 1 i ) P d ~ Rnfig d ( ~ Rjd[fig) P s ~ R s ( ~ Rjs) + X i;j2 ~ R:i6=j Y i q i Y j q j P d ~ Rnfi;jg d ( ~ Rjd[fi;jg) P s ~ R s ( ~ Rjs) = X i2 ~ R ( Y i q i ) 2 ( 1 i )w i; ~ R + X i;j2 ~ R:i6=j Y i q i Y j q j w i;j; ~ R And P d ~ R p dj ~ R (Y ~ R ) 2 = [ X d ~ R d ( ~ Rjd)( X i2 ~ R ( Y i q i w i; ~ R ) 2 + X i;j2 ~ R:i6=j Y i q i Y j q j w i; ~ R w j; ~ R )] 1 P s ~ R s ( ~ Rjs) = X i2 ~ R ( Y i q i w i; ~ R ) 2 + X i;j2 ~ R:i6=j Y i q i Y j q j w i; ~ R w j; ~ R ) And P d ~ R p dj ~ R (2Y d Y ~ R ) = 2 P s ~ R s ( ~ Rjs) X d ~ R d ( ~ Rjd) X i2d Y i p i X i2 ~ R Y i q i w i; ~ R 98 = 2 P s ~ R s ( ~ Rjs) X i2 ~ R Y i q i w i; ~ R X i2 ~ R Y i p i p i q i X d ~ Rnfig d ( ~ Rjd[fig) = (2)[ X i2 ~ R ( Y i q i w i; ~ R ) 2 + X i;j2 ~ R:i6=j Y i q i Y j q j w i; ~ R w j; ~ R ] Thus P d ~ R p dj ~ R (Y d Y ~ R ) 2 = X i2 ~ R ( Y i q i ) 2 ( 1 i )w i; ~ R + X i;j2 ~ R:i6=j Y i q i Y j q j w i;j; ~ R + X i2 ~ R ( Y i q i w i; ~ R ) 2 + X i;j2 ~ R:i6=j Y i q i Y j q j w i; ~ R w j; ~ R ) +(2)[ X i2 ~ R ( Y i q i w i; ~ R ) 2 + X i;j2 ~ R:i6=j Y i q i Y j q j w i; ~ R w j; ~ R ] = X i2 ~ R ( Y i q i ) 2 [( 1 i )w i; ~ R w 2 i; ~ R ] + X i;j2 ~ R:i6=j Y i q i Y j q j [w i;j; ~ R w i; ~ R w j; ~ R ] where w i; ~ R = P d ~ Rnfig d ( ~ Rjd[fig) P s ~ R s ( ~ Rjs) w i;j; ~ R = P d ~ Rnfi;jg d ( ~ Rjd[fi;jg) P s ~ R s ( ~ Rjs) : 99
Abstract (if available)
Abstract
Case-control is a useful study design for estimating relative risks arising from exposure distributions between diseased and non-diseased populations. Sampling can occur within an ongoing longitudinal cohort (incident case) or cross-sectional observational (prevalent case) study, to estimate relative risks cost-efficiently. Statistical methods for estimating relative risks based on a case-control study are available while that for absolute disease probability associated with certain exposure profiles requires development ❧ In this dissertation, I described the methodology development for disease risk estimators and corresponding variance estimators from survey-sampling based case-control samples. I performed simulation studies to evaluate the proposed estimators and demonstrated their usefulness in real data from two large scale projects. The new estimators were able to use samples economically and derive risk inference comparable to full sample assessments. The methods are applicable to a wide range of epidemiologic studies to derive point and confidence interval estimates of baseline and comparator disease risks.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Sampling strategies based on existing information in nested case control studies
PDF
Health-related quality of life in preschool children with strabismus or amblyopia
PDF
Power and sample size calculations for nested case-control studies
PDF
Evaluating the use of friend or family controls in epidemiologic case-control studies
PDF
Shortcomings of the genetic risk score in the analysis of disease-related quantitative traits
PDF
Comparing robustness to outliers and model misspecification between robust Poisson and log-binomial models
PDF
Cluster sample type case-control study designs
PDF
Inference correction in measurement error models with a complex dosimetry system
PDF
Personal hair dye use and risk of B-cell non-Hodgkin’s lymphomas among adult women in Los Angeles County
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Red and processed meat consumption and colorectal cancer risk: meta-analysis of case-control studies
PDF
Estimation of treatment effects in randomized clinical trials which involve non-trial departures
PDF
ROC surface in the presence of verification bias
PDF
Body size and the risk of prostate cancer in the multiethnic cohort
PDF
Two-stage genotyping design and population stratification in case-control association studies
PDF
X-linked repeat polymorphisms and disease risk: statistical power and study designs
PDF
Bayesian models for a respiratory biomarker with an underlying deterministic model in population research
PDF
Analysis of SNP differential expression and allele-specific expression in gestational trophoblastic disease using RNA-seq data
PDF
Risk factors associated with smoking initiation among Chinese adolescents: a matched case-control study
PDF
Comparative study of laparoscopy vs. laparotomy for ovarian mass removal
Asset Metadata
Creator
Wen, Ge
(author)
Core Title
Disease risk estimation from case-control studies with sampling
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
05/02/2018
Defense Date
08/28/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
case-control,OAI-PMH Harvest,risk estimation,sampling
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gauderman, James (
committee chair
), Jiang, Xuejuan (
committee member
), Mack, Wendy (
committee member
), McKean, Roberta (
committee member
)
Creator Email
dove1000@gmail.com,dovewg@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-498906
Unique identifier
UC11268535
Identifier
etd-WenGe-6301.pdf (filename),usctheses-c40-498906 (legacy record id)
Legacy Identifier
etd-WenGe-6301.pdf
Dmrecord
498906
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wen, Ge
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
case-control
risk estimation
sampling