Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Application of a two-stage case-control sampling design based on a surrogate measure of exposure
(USC Thesis Other)
Application of a two-stage case-control sampling design based on a surrogate measure of exposure
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality o f this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g^ maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographicaUy in this copy. Higher quality 6” x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. A Bell & Howell information Company 300 North Zeeb Road. Ann Arbor. M l 48106-1346 USA 313/761-4700 800/521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Application of a Two-Stage Case-Control Sampling Design Based on a Surrogate M easure of Exposure by Lisa Thurston A Thesis Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE (Biometry) May 1997 ©1997 Lisa Naomi Thurston Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 1384925 UMI Microform 1384925 Copyright 1997, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY O F S O U T H E R N C A L IFO R N IA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES. CALIFORNIA 90007 This thesis, •written by Juso-JhaAs&Lti______________ under the direction of h<z£. Thesis Committee, and approved by all its members, has been pre sented to and accepted by the Dean of The Graduate School, in partial fulfillm ent of the requirements for the degree of CQCL^i c£- „c£Jl Q^n££__________________ Dtan D ateJ& S,kL„21x._12iL L . Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract Langholz and Goldstein (1996) first presented a method for sampling from a matched case-control study based on a correlate measure of an exposure of interest. This correlate is available on subjects in the (first stage) case-control study and the second stage sample is made up of the n matched sets that are most discordant in the correlate measure, where n is determined by the power requirements. In this paper, we apply this two-stage sampling method to a case-control study of acute myelogenous leukemia and exposure to radiation from radiography. Since the exposure is available on the entire case-control study, we were able to compare the performance of the new design to that of the full case-control study and to a study made up of randomly selected matched sets. Potential gains in efficiency and some practical complications were observed in this application. It is concluded that this two stage sampling design is best used in an initially large case-control study. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table of Contents Abstract ii List of Tables and Figures iv Introduction 1 Methods 4 Original Study Methods 4 Two-Stage AML Study Methods 6 Results 10 M4/M5AML 10 All Types o f AML excluding known M4/M5 12 Discussion 14 Conclusions 18 Appendix 1: Background 22 Nested Case-Control Studies 22 Risk Set Sampling 24 Survival Regression 24 Parameter Estimation 2 7 Analysis o f Risk Set Sampled Data 30 Counting Process 30 Swedish Study 33 Top-Discordancy Design and Analysis 34 References 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables and Figures Table 1: Logistic Regression on Full Case-Control Study: Risk of Developing Type M4/M5 AML due to Radiation Exposure (according to interview data). 10 Table 2 : Odds Ratios from Subset Analysis: Radiation Exposure According to Chart Review. 11 Table 3: Logistic Regression: Risk of Developing AML (excluding known type M4/M5) due to Radiation Exposure According to Interview data). 13 Table 4: Relative Risk Estimates from Subset Analysis: Radiation Exposure According to Chart Review (All Types of AML excluding M4/M5). 13 Table 5: Number of Procedures According to Interview for Case Control Pairs (Subset of 75 and Full Case Control Study Totals) 16 Table 6: Empirical Efficiency Measured By Standard Error Of The Trend Parameter Estimate. 17 Table 7: Power of Design II (picking case-control sets based on Z variability) versus Design I (random sampling of case-control sets) for rejecting the null hypothesis Ho: bx=0, two-sided a=0.05, by the number of matched sets and the relative risks per standard deviation in Z and X. 37 Figure 1: Graphical Representation of Risk Sets. 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Introduction Large epidemiological studies are generally very expensive, in terms of both time and money, to set up and maintain. Once an investment has been made in planning a study, all data subsequently collected should be used as efficiently and thoroughly as possible. Therefore, when doing a case-control study, researchers should try to collect as much relevant information as possible on initial contact with the subjects. However, a balance must always be maintained between cost and utility of data collected. In particular, some data are very simple to obtain, and some data are more difficult or expensive to obtain, for example: a subject’s knowledge of their HTV infection status and the result of an HTV RNA PCR blood test. Athough ideally an attempt is made initially to be as thorough as possible there are situations where researchers would like to go back and collect more exposure information on the subjects in a study. In some of these situations it might be more efficient to collect more information only for a subset of the original case-control study. Nested case-control studies can be more efficient than standard case-control studies because they are conducted within larger cohort studies and can take advantage of previously collected exposure and disease information. The methods for analyzing the results of this sampling design have been thoroughly reviewed and widely applied. An analogous situation to the nested case-control study could also occur within a population based matched case-control study. This is the situation that will be discussed 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in this thesis. For example investigators may wish to collect additional information on a subset of study subjects to test new hypotheses, perhaps based on analysis of the case- control study itself or based on hypotheses generated from other sources. Specifically, after collecting all of the data, investigators may find that there is some indication of a relationship between an exposure and the outcome of interest but the chosen exposure measurement is not accurate or representative enough to adequately measure this relationship. Alternatively, other researchers may wish to take advantage of the accumulated data to test a new hypothesis. In these cases it may be prohibitive to collect more data on all the subjects in the original case-control study so choosing to take a sub sample of our original sample would be more practical. Both this subset of a case-control study and the nested case-control design are examples of two stage sampling methods. Langholz and Goldstein (1996) encountered a situation where investigators wished to extract more information from a large existing case control study. This was a study initiated by the National Institute of Occupational Health in Sweden (Appendix I - Swedish Study). Cost issues motivated the investigation of a two-stage sampling model, which would make the most efficient use of the available data. In order to investigate the possibility of an association between low-frequency electro-magnetic fields (EMF) in the work place and development of leukemia and brain tumors, researchers at the National Institute of Occupational Health undertook a population based case-control study (Floderus et al., 1993). Using standard case-control methods, investigators found that 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. there was an increasing trend in risk of leukemia with increasing EMF exposure. Investigators at the U.S. National Institutes of Occupational Health and Safety (NIOSH) then proposed a biological mechanism of carcinogenesis for EMF radiation and were interested in testing this hypothesis using the Swedish case-control study group. However a different EMF measurement, correlated to that used in the original study was required. Therefore the question posed was “How can EMF measurements from the original Swedish study be used to advantage in the proposed follow-up study?” In this thesis a second, more recent, application of a two-stage sampling design is studied. The initial case-control study investigated exposure to X-ray radiation and subsequent development of Acute Myelogenous Leukemia(AML). The data gathered from this study was in the final stages of analysis when it was made available to us by Drs. Preston- Martin and Pogoda. They had previously determined a dose-response relationship between X-ray exposure and type M4 and M5 AML. This X-ray radiation exposure was measured in two ways - by number of procedures and by estimated dosage (quantity of millirads) based on the former procedures. Dosage exposure was calculated based on the typical exposure for each of the procedures. These two measures were both obtained from two different sources - by interview and by chart review. Therefore, in the end, there were four separate, but related, measures of x-ray radiation exposure. Because the chart reviewed data was relatively expensive and time-consuming to gather we ask the question: “Could the interview data provide information to collect a less expensive 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sample of dosage exposure measurements calculated from chart reviewed procedures? How can we use the interview measures to advantage in determining our second stage sample of the case-control study?” The purpose of this paper is to illustrate the two-stage sampling design proposed by Langholz and Goldstein, using the AML study. This design attempts to accomplish the data collection of a second stage sample with minimal cost and time commitment by using the available information in the original case-control study. I will show that this sampling design can be used and that the results can then be analyzed in a simple straight forward manner. I will apply this sampling method to the AML study described and compare the results from the entire case-control study to the results from other selected two-stage sampling methods. Methods The background information provided in Appendix I describes a method that has been explored theoretically by Langholz and Goldstein but has not been evaluated in a real study with actual data. Original Study Methods We obtained a set of the variables from a Study of Low Dose Radiation and Drugs in the Etiology of Acute Myelogenous Leukemia from Dr. Preston-Martin at the University of 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Southern California. This is an ongoing case-control study of adult-onset acute myeloid leukemia that focuses on the effects of radiography. Cases of AML were found using the population-based Los Angeles Cancer Surveillance Program (CSP). Approximately half of all cases were interviewed in person, the remaining by proxy. Healthy population controls were matched to cases on birthyear, race, gender, and neighborhood of residence. The primary aim of this study was to investigate whether the development of AML is related to exposure to medical x-rays in the 10 years prior to diagnosis. After an eligible case was identified by CSP personnel the physician and the patient were contacted to request permission to include the patient in the study, and to get updated information. The case, or a proxy respondant (limited to the surviving spouse or another adult who has lived in the cases household for any 6 of the 10 years preceding the case’s leukemia diagnosis) was contacted by phone. A questionnaire (one for personal interviews, one for proxies) was given to the case in the form of an interview. The questionnaires were designed to elicit information regarding diagnostic x-ray procedures to specified body sites during the 10 years prior to the diagnosis date, location of medical records, occupational exposures, family history and other medical history. Probes were used to prompt recall. Questions asking for experiences which are difficult to remember accurately were asked in several ways. A similar procedure was followed for the control. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. These reported x-ray procedures were then further investigated. Medical records were obtained from each facility, or doctor from which the subject reported receiving treatment. These charts were reviewed for the same information as requested in the questionnaires. Published dosimetry surveys on radiography were used to derive a model to estimate bone marrow dose from each type of radiographic procedure (Shleien et al., 1978). Also various surveys were conducted regarding radiographic equipment and facility retake practices. With this information, for each examination, estimates of the bone marrow dose and of the accompanying uncertainty were derived. By combining these dosage estimates an estimate of total exposure for the past 10 years could be made (millirads) based on the number and type of procedures reported. Two-Stage AML Study Methods For this thesis the outcome of interest (AML) studied was separated into two dependent variables studied - development of type M4/M5 AML, versus development of non- M4/M5 and unknown FAB AML. The covariates included in the analysis were self- reported number of x-ray procedures in the past 10 years according to an interview, and from this the calculated dosage, in millirads, of radiation received (based on the derived model); and the number of x-ray procedures in the past 10 years according to chart review, and from this the calculated dosage of radiation received. One key assumption Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. was that if the value for number of procedures according to chart review was missing and number of procedures according to interview was zero the chart review number of chart reviewed procedures was also assumed to be zero. In the original case-control study, described above, dosages were calculated for all possible subjects based on review of all of the available charts. Using the methodology described by Langholz and Goldstein (1996) we explore an alternative approach to this typical case-control study design. The general sequence followed in the application of their two-stage top-discordancy design would be: 1) Collect self -reported number of procedures. 2) Analyze the relationship between AML and the number of procedures. 3) Select a second stage sample using the self-reported information and collect additional dosage information based on chart review for only this subset of the original case control study. In the application of Langholz and Goldstein’s two stage sampling design to the AML study the number of procedures according to the interview would be considered the “surrogate measure” (see Appendix I) and the dosage in millirads according to chart review would be the “true measure”. Because the latter, chart reviewed data, is an expensive piece of information to gather, in practice in order to decrease costs one might choose to analyze only a subset of the entire case-control study. For this analysis a sample of size 75 case-control pairs was selected for the second stage sample. According 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to the top-discordancy method, in this study the 75 case-control pairs with the most discordant values for self-reported number of procedures were the ones selected. For comparison a random sample of 75 pairs was also taken. This random sample was repeated 10 times to get an “average” expected performance. Analysis of both types of second stage samples was done using standard conditional logistic regression for matched case-control data (Langholz and Goldstein 1996, and Appendix I). In order to measure the agreement between the surrogate and the true measure, Pearson and Spearman correlation statistics were calculated between the self-reported number of procedures and radiation dosage exposure calculated from number of procedures according to chart review. Initially the data were analyzed as if all that was available was the interview data (the surrogate measure). We investigated the relationship between radiation exposure, measured as number of procedures, and development of type M4/M5 AML using logistic regression. Logistic regression was then performed on the entire case-control study, the top-discordancy second stage case-control sample and each of the 10 random samples. The parameter estimates from each of the ten random sample model fittings were averaged and the antilog was taken to get the estimated odds ratio, a summary measure of risk. The log likelihood values from each of the ten fitted models were averaged and then this average value was used to get a p-value from the chi-square distribution for the trend 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and homogeneity tests. For all samples, the dependent variable used was type M4/M5 AML, and the covariates of interest were the four measures of radiation exposure: interview and chart reviewed number of procedures and radiation dosages. The logistic regression results - patterns and trends - from the full case-control study and the two second stage samples were compared. Following the initial analysis we also decided to explore the possibility that there is actually an increased risk of other types of AML (non-M4/M5 and missing FAB) with increased exposure to x-ray procedures. It was considered possible that we may not have been able to see this relationship initially because number of procedures according to interview is not an accurate measure of radiation exposure. Therefore we followed the same procedure described above to analyze another subset of the original case-control study where the dependent variable was development of non-M4/M5 AML or missing FAB. We examined the entire sample and then a randomly selected subset and a subset selected based on the 75 case-control pairs with most discrepant surrogate measurements (number of procedures according to interview data). Note that, as before, or the randomly selected sample actually 10 random samples were selected and then these results were averaged to achieve the results shown. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Results M4/M5 AML Once we limited this analysis to type M4/M5 AML we ended up with a total sample size of 106 case-control pairs as the first stage case-control study. This small sample size had an effect on our results, as described in the Discussion section. The results of the initial analysis with the first stage case-control study, assuming chart review data are still unknown, are shown in Table 1. There is a dose-response relationship between both self- reported number of procedures (p=0.012) and corresponding dosage (p=0.006), and risk of developing type M4/M5 AML. Table 1: Logistic Regression on Full Case-ControI Study: Risk of Developing Type M4/M5 AML due to Radiation Exposure (according to interview data). Number of Procedures Relative Risk P-value Dosage (millirads) Relative Risk P-Value 0 1.0 - .— 0 1.0 - .— 1 to 10 1.2 0.686 1 to 90 1.3 0.474 1 1 to 15 2.5 0.040 91 to 700 2.3 0.178 >15 2.8 0.025 >700 5.5 0.024 Trend 1.7 0.012 Trend 1.5 0.006 The results for the logistic regression using the second stage “nested” samples are shown in Table 2. The Pearson correlation between number of procedures according to interview and millirad dosage according to chart review was only 0.236 and the Spearman correlation was 0.501. These measures indicate that the order would be maintained if one sorted by the corresponding covariates, but there was little linear correlation between the two measures. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2 : Odds Ratios from Subset Analysis: Radiation Exposure According to Chart Review. Variable Subset with N=75 Top Discordant Pairs Randomly Selected Pairs* Full Study # Procedures 0 1.00 1.00 1.00 1-10 1.98 1.30 1.42 11-15 2.24 1.44 1.59 >15 4.46 3.16 3.19 Homogeneity p-value 0.304 0.504 0.476 Trend p-value 0.080 0.227 0.136 Dosage (millirads) 0 1.00 1.00 1.00 1-91 1.82 1.62 1.69 91-700 1.96 1.77 2.10 >700 4.10 3.74 3.64 Homogeneity p-value 0.080 0.091 0.051 Trend p-value 0.017 0.021 0.008 ♦Statistics from 10 random samples averaged together and p-values determined from the averages. The increasing trend (Table 1) found for relative risk of AML with increasing number of procedures, according to interviews, was also found with chart reviewed data. The results found using the entire case-control study sample are more closely mirrored by the randomly selected second stage sample, than the top-discordant sample. However, we must keep in mind that there is inherently more variability in the results presented because they are the combination of 10 separate random samples. The relative risks are slightly higher for each stratum, defined by number of procedures, in the top discordant pair subsample than both the whole sample and the randomly selected subsample. The dose-response relationship is evident but not significant for the first and both second stage samples. Given the original sample size, these differences are small and may be attributed to random differences in the data. 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Examining the chart reviewed radiation dosages, as measured by millirads, shows that the evidence of a dose-response relationship is statistically significant, in all samples. The results are more consistent among the three samples - whole sample, randomly selected sub-sample, top discordant pair sub-sample, than seen for interview data. The p-value for trend is smallest (i.e. most significant) for the whole case-control sample and then next smallest for the sub-sample selected from most discordant case-control pairs. A ll Types o f AML excluding known M4/MS For this group from within the original case-control study, the Pearson correlation between number of procedures according to interview and millirad dosage according to chart review was 0.346 and the Spearman correlation was 0.532. These measurements lead to similar conclusions as those drawn for the type M4/M5 group i.e. the order would be maintained if you sorted by the corresponding covariates, but there was little linear correlation between the two. Looking at the initial case-control study the pattern of increasing relative risk that was found in the M4/M5 group for the surrogate measure (x-ray exposure measured by number of procedures according to interview) is not found here. In fact, looking at the millirad dosage measurement in Table 3, there appears to be a decreasing trend in relative risk of developing non-M4/M5 AML with increasing exposure to radiation. These opposing trends in the two groups may wash out when the entire case-control study sample is examined together. 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3: Logistic Regression: Risk of Developing AML (excluding known type M4/M5) due to Radiation Exposure According to Interview data). Number of Procedures Relative Risk P-Value Dosage (millirads) Relative Risk P-Value 0 1.0 -.— 0 1.0 — 1 to 10 0.71 0.157 1 to 90 0.99 0.961 1 1 tol5 0.35 0.002 91 to 700 0.68 0.121 >15 1.24 0.647 >700 0.56 0.028 Homogeneity - .— 0.005 Homogeneity -.— 0.042 Trend 0.83 0.122 Trend 0.81 0.006 Looking at the chart reviewed data from the second stage samples, there was no excess relative risk (>1) for any of the parameters. There is no evidence of any relationship between radiation exposure and non-M4/M5/unknown type AML. There was an unexpected decrease in the odds ratio for the 11 to 15 procedures stratum where the relative risk is 0.075, which occurs due to the scarcity of data in this group. This scarcity of data is shown in detail in the Discussion section (Table 5). Table 4: Relative Risk Estimates from Subset Analysis: Radiation Exposure According to Chart Review (All Types of AML excluding M4/M5). Variable Subset with N=75 Top Discordant Pairs Randomly Selected Pairs* Full Study (N=294) # Procedures 0 1.00 1.00 1.00 1-10 0.43 .51 0.62 11-15 0.08 .31 0.34 >15 0.73 .76 0.79 Trend p-value 0.444 .757 0.349 Homogeneity p-value 0.009 .222 0.065 Dosage (millirads) 0 1.00 1.00 1.00 1-91 0.66 0.68 0.90 91-700 0.30 0.38 0.58 >700 0.64 0.54 0.69 Trend p-value 0.323 0.098 0.089 Homogeneity p-value 0.236 0.181 0.195 ♦Statistics from 10 random samples averaged together and p-values determined from the averages. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Discussion There were a few extreme outliers in the AML data set. For example one of the cases reported 61 procedures but then only two procedures were found by chart review. There were four subjects with data like this. The potential problems created by these outliers may become exacerbated when we apply this two stage sampling scheme. When we select the top discordant case-control pairs for our smaller subsample it is very likely that the subjects with extreme values would be selected. Therefore you would end up with a smaller overall sample and a higher percentage of your data points with extremely unusual values. Fortunately, in this study when the analysis was repeated excluding and including these outliers and the results for the sub-sample were not greatly influenced by these case-control pairs indicating that there is no information bias. That is, reporting and procedural errors were the same for cases and controls, given the chart dosage values. This observation indicates that generally, two-stage sampling based on a factor likely to be subject to selection bias should be avoided. A related problem that may have an effect on the analysis of this AML data is the fact that not only are the interviews not objective, but neither are the medical charts. The goal would be to add information in the second stage of this two stage sampling design, but because often charts cannot be found we could end up with less information for our true measure of exposure. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As seen in this study a subjective and an objective measure of the same value may not necessarily be highly correlated. The Pearson correlation between number of procedures according to interview and millirad dosage according to chart review was only 0.236 and the Spearman correlation was 0.501. These measures indicate that the order was maintained but there was little linear correlation between the two measures. This low level of correlation may have had more of an effect on the power of our sub-sample because the size of the initial case-control sample was small. In order for this two-stage sampling method to be efficient the surrogate and the true measure need to be correlated. The more highly correlated the two measures are, the more power to be gained by using the two stage sampling design. This factor should be investigated or at least considered before proceeding. In this case-control study there were many missing values for chart reviewed information. This second true measure was more difficult to obtain, and for some subjects it just turned out to be impossible. In this M4/M5 subgroup 30% of the subjects that had interview-based data did not have chart review-based data. When we selected the sub sample of 75 top-discordant pairs, 22 of the pairs had missing chart data for either the case or the control, so they contribute no information to the results. You will not know ahead o f time which of your case-control subjects will have missing data so this further shrinks the size of your subsample. This would be an issue for any two-stage sampling method. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This issue of missing data reappeared in the analysis of the non-M4/M5/missing FAB subgroup. We saw in Table 4 that there was an unexpected decrease in the relative risk for the 11 to 15 procedures stratum, which occurs due to the scarcity of data in this group. In the full case-control study there are few cases and controls in the 11-15 procedures group and consequently for the top-discordancy subset of 75, there are combinations of strata within the 11-15 procedures group (case or control, or both, have 11-15 procedures) that contain no observations. Table 5 shows the cross section of case and control values for number of procedures according to chart review. Table 5: Number of Procedures According to Interview for Case Control Pairs (Subset of 75 and Full Case Control Study Totals) Top Discordancy Subset Count (Full Study Count) 0 1-10 Case 11-15 >15 Total 0 0 ( 4) 3 ( 21) 0 ( 0) 1 ( 1) 4 ( 26) 1-10 4 (28) 30 (155) 2 (13) 12 (21) 48 (218) Control 11-15 3 ( 4) 10 ( 18) 0 ( 1) 0 ( 3) 13 ( 25) >15 3 ( 4) 10 ( 15) 0 ( 0) 2 ( 4) 10 ( 24) Total 10 (40) 49 (211) 2 (14) 14 (29) 75 (294) We can use the larger subset of the entire case-control study (i.e. non-M4/M5/unknown subset) to look for gains in efficiency realized by using the top-discordant sampling method as compared to random sub-sampling. (Only one sample was used for this analysis!) The standard error of the parameter estimate was used as a measure of efficiency. That is, the goal would be to have smaller standard errors and therefore smaller confidence intervals for the estimates of risk. For this data the standard errors are somewhat high to start off compared to the parameter estimates themselves, but of the 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. four samples the randomly selected samples of size 75 standard errors are the largest (Table 6). Remarkably in order to achieve the standard errors for the estimates obtained with the top discordant sampling method using the random sampling method we would have had to sample twice as many subjects (N=75 versus 150). This would have a significant effect on the cost of a study. Table 6: Empirical Efficiency Measured By Standard Error Of The Trend Parameter Estimate. Parameter Randomly Selected Pairs (150) Estimate SE Randomly Selected Pairs (75) Estimate SE Top Discordant Pairs (75) Estimate SE Full Study (294) Estimate SE Number of Procedures according to chart review -0.13 0.17 -0.16 0.26 -0.13 0.17 -0.11 0.1 2 Dosage (millirads) according to chart review -0.17 0.12 -0.23 0.18 -0.14 0.14 -0.15 0.0 9 Bias can be easily introduced into case-control studies when the explanatory covariate is not objective. For example, the case may have been thinking more about possible exposures or causes for their illness, where the control will not have been going through these thought processes. Pogoda, in her analysis of the original AML case-control study, determined that agreement between interview and chart data was slightly higher for controls than cases, but both tended to underestimate dosage from charts. We investigated the possibility that recall bias was a problem in the interview data by examining case versus control self-reports for both M4/M5 and non-M4/M5 groups. To examine this bias we performed linear regression using the chart reviewed covariate as an 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. independent variable and the self-report covariate as a dependent variable for cases and then for controls. We compared the slope from each regression, as well as the scatter plots with the resulting fitted line, between the cases and controls. We also examined the intraclass (cases versus controls) correlations between interview and chart data. Using only the intraclass correlation measurements for the non-M4/M5 group we found more agreement between chart and interview data among the controls than the cases. However the plots and regression show no obvious patterns to indicate bias. In the M4/M5 group there was no consistent pattern indicating bias in either the regression or the intraclass correlation measurements. The small amount of bias found should not have greatly influenced the conclusions from the logistic regression. Conclusions Overall, there is a definite gain in efficiency in using the two stage top-discordancy sampling design over the two-stage random sampling design. This desirable effect is seen when the initial case-control study sample is large. This gain in efficiency could potentially have large consequences in terms of cost. In the two examples presented in this paper a random sub-sample twice as large would have been required to achieve the same power or efficiency as found using the top-discordancy sampling design. It is easy to imagine how this consequence could save a significant amount of time and money. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. There are also potential complications that were illustrated in the data presented in this paper. The gain in efficiency was not great when the sample size is very small and/or there are many missing values in the data (two paths to the same conclusion). The realistic possibility of these complications arising must be considered prior to applying the sampling method. It is important that the surrogate measure and the true measure used are somewhat correlated for this design to be effective. However, if the first-stage sample is very large we don’t need very high correlation to see a gain in efficiency in using the top- discordancy design. To understand the relationship between sample size and surrogate versus true measure correlation consider looking at many case-control studies for each of which you are considering using this top-discordancy design. If the sample size increases and the correlation stays the same the top N discordant pairs will have more highly discordant values for the surrogate measure, and therefore the design will yield more power. If the sample size stays the same and the measures are less correlated then sampling the top N discordant pairs will not yield a change in power from one study to the next. Therefore if a case-control study is reasonably large we do not have to have high correlation between the surrogate and the true to realize a gain in efficiency by using the top-discordancy sampling design. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In our example data for the entire case-control study, excluding the M4/M5 subtype we did see a gain in efficiency for the two stage design over the random sub-sampling method. Although there is no apparent relationship between the outcome of interest, AML and the surrogate, or true, measure, millirad dosage there is still a cost advantage to using the top-discordancy pair design. For the same cost, this design gives confidence intervals that are smaller than those given by random sub-sampling. So, although we don’t necessarily see any evidence of a relationship between radiation exposure and development of non-M4/M5/unknown AML, we can put tighter limits on the magnitude of the effect. As mentioned in the Discussion, bias may be easily introduced into case-control studies when the explanatory covariate is not objective. These bias issues could be exacerbated by this two-stage study design, because we are using the “first-stage”, possibly biased, measure to determine our “second-stage” sample. Thus we would recommend caution in using this design if the surrogate measure is subject to significant information bias. For the AML study we found evidence of a small amount of bias in the non-M4/M5 group. For both the M4/M5 and non-M4/M5 subsets the small amount of bias would not have affected the results of the logistic regression in the second-stage samples. It is plausible that bias could be a problem in other case-control studies using interview data. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This two stage top-discordancy design does not create a significant or worthwhile gain in efficiency in all settings, as mentioned above. A generally ideal setting where this design might prove useful would be in a large case-control study of a surrogate measure of another “true” measure that is difficult and/or expensive to collect. These two measures, true and surrogate, would have to be correlated. Ideally we would also like both of the measures to be objective. For example one could design a large case-control study, go out and collect some data that is relatively inexpensive to get on each case-control pair, then later go back and get the more expensive data only on the second stage sample. Some future questions to be explored in regard to the design include looking at the analysis of confounders. Would the same gain in efficiency found for the explanatory variable of interest also apply to the confounders? Would multiple variable analysis, including these confounders find the same gain in efficiency for the covariates of interest? Also, as implied earlier, it would be desirable to repeat the analysis from this thesis using a larger case-control study. 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix 1: Background Nested Case-Control Studies Because the two-stage sampling design to be described grew from the nested case-control design, I will first describe the nested case-control design in detail. Initially the nested case-control study begins with a cohort study. In the classic cohort study (Emster, 1994), one follows a group of subjects (called the cohort), who are initially disease-free, over time. Associated with each subject is a covariate history, which may include factors that are fixed over time or factors that are time dependent. Disease incidence is compared across covariates to determine risk factors for disease development. Unfortunately (in terms of simplicity of analysis) subjects do not all necessarily enter the cohort study at the same time. Each subject enters the study at some entry time, is then at risk of developing the disease of interest for some finite time period, and then exits the study at some exit time. There are two basic ways the subject may exit the study: the subject either contracts the disease of interest (fails) or is considered censored (death due to another cause, lost to follow-up or the study ends and the subject has not yet contracted the disease of interest). For the nested case-control design cases are identified from this original cohort and then for each case a specified number of controls (say m-1) are selected from among those in the cohort who have not developed the disease by the time of disease occurrence in the case, but who are at risk of developing the disease. The efficiency of this design relative to the full cohort for testing for an association between a single factor and disease can be measured as (m-l)/m (Breslow and Patton, 1979). 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore we can see a gain in efficiency over the typical case-control design. Generally nested case-control studies are done because of the difficulty and/or expense of collecting exposure and other data for a large population. As statistical analyses of these nested case-control designs become more complex and costs increase it becomes more beneficial to customize the sampling design of the nested case-control study using the available data in the larger cohort study. Intuitively, sampling designs that use previously available exposure information to draw a sub-sample from the overall cohort should be more efficient than designs based on random sampling of the cohort. This leads to creation of two-stage sampling designs like the one applied in this thesis. The expansion of nested case-control sampling theory to risk set sampling theory, has recently been studied. Langholz and Goldstein have proposed a two stage nested sampling design for case-control studies that makes use of prior information gained from the “first-stage” sample to improve efficiency. In contrast to the nested case-control design, in their design the first stage sample was a randomly sampled case-control study. They came up with this methodology from a Swedish occupational study introduced earlier. A summary thoroughly describing this study (Swedish Case-Control Study of electro-magnetic fields and cancer.) follows the theoretical discussion in this Background section. 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Risk Set Sampling All case-control studies may be viewed as nested case-control studies where the initial cohort is very large, for example the population of Los Angeles County. In the view proposed by the Cox proportional hazards model, in a cohort study at each failure time a risk set (see Figure 1) is formed that includes the case (the failure at that failure time), and all controls (any other cohort members who are at risk at the failure time). A sampled risk set of size m is a subset of the risk set that contains the case and the corresponding m-1 sampled controls. Risk set sampling designs are intrinsically related to semi-parametric estimation methods for parameters in the Cox proportional hazards model, that are used in the analysis of full cohort data. Each case control-control pair or stratum in a case- control study can be viewed simply as one risk set. When there are matching factors, the sampled risk set will be a subset of the risk set defined by the failure time and any matching factors. This corresponds to stratification of the Cox model. The approach which organizes cohort data by risk sets and therefore leads to data which looks just like a matched case control study was followed by Goldstein and Langholz (1996). This led them to believe there may be a straight-forward way to analyze the data resulting from a two-stage sampling design. The method that they proposed will be the culmination of this Background section. Survival Regression To explain the Cox Proportional Hazards Model we can start with the concepts of expected disease rates, relative risk measures, and hazard models (Mack, 1996). Note 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that the underlying distribution for the observed number of events in a cohort study is assumed to follow a Poisson process. This is a random process describing the probability of events occurring over time, and assumes that the number of events occurring in one time period are independent of the events occurring in another time period. Therefore the Poisson probability for the number of events, d, in time period t is: e - ^ ~ . where X = P(event occurring per unit person-year) n = person-years observed Some further definitions used are... Risk = probability of a disease event occurring by some time T = P[event occurs in (0,T)] = F(T) S(T) = survival probability = P[event has not occurred by T] = l-F(T) - \rA(t)dt = e Event density = P[event occurs in {t,t+dt)j = dF(T)=f(t)dt X(t) = Hazard rate = instantaneous rate of disease, the probability that an individual subject fails in the time from t to t+dt given that the subject has not yet failed at time t. = P[event occurs in (t,t+dt), given it has not occurred by t] P[event occurs in (t,t + dt)] Prevent has not occurred by f] 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. m -Im l . 5(0 1-F(0 5(0 ' Cumulative hazard = A (T) = . In the special case of a constant hazard overall time t, i.e.risk of disease does not depend on time (X(t)=A., for all t) i d t - i t S(T) = e = e Xt ,and f(t) = A.(t)S(t) = X e h . To model survival data we might use Poisson regression which models disease rates as a function of a vector of covariates, z. For k = exposure level j = stratum level The multiplicative model is of the form: ^ 7k = r k where Bj = the baseline rate in the unexposed in stratum j (nuisance variable) rk = the relative risk for exposure level k Note that this simple main effects model assumes rk is constant across all strata j. The problem with Poisson regression is the key assumption that disease rates hold for all individuals in a cell (exposure level k and stratum j). Two regression techniques that 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. identify variables that distinguish disease risks between individuals in a group are: logistic and survival regression. The advantages to survival regression are that period of risk, censored survival time data, and time-dependent exposures are all taken into account for each subject. Survival regression models model instantaneous rates, A .(t), as a function of time and explanatory exposure variables. The instantaneous rate or Hazard rate, X(t), is expressed as a continuous function of time. A baseline rate, ^(t) is assumed, and then explanatory exposure variables z(t) are modeled as they modify disease rates. Regression parameters, { 5 , are estimated, in the presence of the nuisance parameters, A^(t). The general multiplicative version of this survival model would be: Parameter Estimation To determine parameter estimates, which is usually the whole point of the analysis we need to determine likelihood functions for survival-time data. The following definitions are needed to explain use of likelihood functions. i = subject (i=l,...N) tj = observed time in view for subject i (could be time to disease of interest or time to censor) T = y 'j, = total person-years in view for entire cohort f D; = indicator of failure (disease/death) for subject i (=1 if subject fails, 0 otherwise) D = X A = total failures (deaths?) from cause of interest in the entire cohort / 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. N = total number of subjects in the cohort S = N - D = number of subjects that “survived” (i.e. did not fail) Yj(t) = in view indicator (=1 if subject i is in at risk at time t) X, = (Xj,,----- ,x,p )= explanatory (covariates) for subject i A .(t) = Hazard rate = P[failure occurs in (t,t+dt), given it has not occurred by t] According to the survival regression model define the hazard rate (proportional hazards model) for subject i as: W i , * , ) = K where, as mentioned earlier, (*/,<*) = baseline hazard rate, which is a function of tj and a only. r(x_n P) = a relative risk function of X j and p. Note that the X j may also be time- dependent: X j(t). P'x (Note often r(Xj,p) is described to be B~ ' ) The individual contributions to the survival likelihood can be written as: / ( / ,.) = L / = A(r,.,x/) D '5(f,.,x/) = A oW 0' exp(£'xf) D ' e x p ( - £ r i(M)A0(M)exp(^,x/)rfM ). At this point to determine parameter estimates we would maximize the likelihood with respect to { 3 and A ^(t). Unfortunately our real interest is in estimating the relative risk parameters jTs rather than in obtaining the individual baseline relative risk estimates using X o(t). The question has become, what can we do to deal with the nuisance 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. parameters A^(t). Cox (1972) suggested eliminating ^(t) completely by using the “partial likelihood”. At time tj, we observe exactly one failure, in subject i, out of the risk set (generally described earlier) R, of subjects at risk at tj. Given R, subjects at risk and only one subject failed, why is it that this particular subject failed rather than one of the others? The instantaneous probability of subject j failing at time tj is equal to X (t;,Xj). Therefore, the conditional probability that this particular subject failed given that one of the subjects at risk, in Rj failed is: jeR, Therefore the likelihood function contribution at time, tj becomes j e R , and the nuisance function Xq in the numerator and denominator cancel each other to simply give: L ' ( £ * , ) S k a *,) je R , The product of these individual likelihoods was called a “partial likelihood”, or a conditional logistic likelihood, by Cox (1972). Note that we don’t have a true conditional likelihood because the individual L j’s are not independent so we could not take their 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. product to get a total likelihood. (The subjects R j at risk at tj are dependent upon who has survived from earlier risk sets.) A nalysis o f Risk Set Sam pled Data With the concept of risk sets as case-control sets, the idea of sampling controls seems natural, by analogy, to case-control study methodology. A random sample of a relatively small number of controls from a risk set would be an efficient way to obtain a sample from the cohort. In the appendix to Liddell et al. (1977) it is concluded that the correct way to analyze these sampled data is to view the samples as a case-control study and use the conditional logistic likelihood as described above. With this method, the case is weighted the same as the controls so that the denominator (for the above individual contribution to the partial likelihood) is: '■«« + 2 > * = Z r k controlt ksRi where R indicates the sampled risk set at time T ( time of failure of the case), as opposed to Rj the entire risk set at time T, and rk is the relative risk associated with the covariate history for subject k at time T (same as r (x k, /?) from above). The likelihood based on this unweighted denominator was shown to have a partial likelihood interpretation in the same spirit as in the full cohort situation (Oakes, 1981). Counting Process According to Goldstein and Langholz (1996) a formal treatment of risk set sampling is based on specifying an appropriate intensity process (generalization of the hazard model). 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. They describe a counting process approach based on Anderson and Gill’s formulation of the Cox model (Anderson and Gill, 1982). A counting process N j(t), which “counts” failures for subject i, and a corresponding intensity process ^(t), is associated with each subject in the cohort. In order to accommodate sampling from the risk sets, they define the counting processes N; r(t) which records occurrences where subject i fails and the set of subjects r serves as the sampled risk set. The intensity processes corresponding to N ; r(t) take the form ^.r(0 = ^(0^(r|0, where 7t,(r|/) is the probability of choosing r as the sampled risk set if subject i were to fail at time t. Since m-1 controls are randomly sampled, without replacement, from the n- 1 controls in the risk set at time t, k, (r|/) = , for subsets of the risk set of size m \ m - \ ) that contain i. Similar to the conditional likelihood approach for the full cohort, the conditional probability that subject i fails given that one of the k e R fails is ^ r kz,(R\k) keS r 2>. keK 'n - l \ by canceling the common factor . \ m - 1 . -i out of both numerator and denominator. Hence the partial likelihood that results from taking products of the above factors over all 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. failure times is the same as the case-control likelihood based on the fore-mentioned unweighted denominator. The counting process formulation of simple nested case-control sampling provided a probabilistic model for formally establishing the properties of the m axim um partial likelihood estimator. Other methods for sampling the risk set can be accommodated by specifying appropriate sampling probabilities x,(r\i) into the above formula. This new innovation provides an analysis methods for new sampling designs where controls are sampled in a “non-representative” way. In particular we could apply this method to analyzing the results from the sampling method proposed in this paper. A likelihood contribution from each sampled risk set is of the form n, (r\case)rc m e ^ (Wr) case Z^(r l k)rk I(^)t keR kfUl where the Wk are subject (and sampled risk set) specific weights chosen to be convenient multiples of the /r,(r|£) probabilities from the intensity. The partial likelihood obtained by taking products of the above factors over all failure times has the usual basic properties of a likelihood and standard conditional logistic regression. Therefore standard matched case-control analysis can be used for analysis of risk set sampled data, treating the W as risk weights or “offsets” in the model. 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Swedish Study The example that originally motivated the investigation of this risk set sampling method by Langholz and Goldstein took place at the National Institute of Occupational Health in Sweden. In order to investigate the possibility of an association between low-frequency electro-magnetic fields (EMF) in the work place and development of leukemia and brain tumors, researchers undertook a population based case-control study (Floderus et al., 1993). The underlying cohort is the male population of Sweden between 1983 and 1987. Incident cases of leukemia and brain tumors were identified through the Swedish Cancer Registry. Controls were identified using the Swedish Census of 1980. For each case, two controls were randomly sampled from the risk set formed by those who were bom in the year of birth of the case and were alive at the time of the study. A questionnaire was administered to each subject which asked about factors that researchers hypothesized were related to these cancers. These were treated as confounders in the analysis. Occupational EMF exposure was estimated for the job performed for the longest period of time during the 10 years immediately prior to diagnosis of the case. This involved taking Gaussmeter measurements at over 1000 workplaces where subjects in this study group were employed. Using standard case-control study analysis methods, investigators found that there was an increasing trend in risk of leukemia with increasing EMF exposure. 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Investigators at the National Institute of Occupational Health and Safety (NIOSH) in the U.S. later proposed a biological mechanism of carcinogenesis for EMF radiation and were interested in testing this hypothesis using the Swedish case-control study group. However a different EMF measurement, correlated to that used in the original study was required. It would be prohibitively expensive to obtain this new measurement for all of the subjects in the original study. These researchers wanted to collect additional information on the minimum number of subjects needed to determine, with some certainty, if the new measure was a good predictor of leukemia risk. Top-Discordancy Design and Analysis To apply the two-stage sampling design, we would consider the initial case-control study to be the first stage sample (theoretically from an overall incompletely described cohort). Further, the first stage sample measurement Z is considered the surrogate measure for the second stage true measurement X. There is not additional information in Z over that in X and the underlying model is A (t , x) = A Q( t ) e /}x One obvious second stage sampling design would be to randomly sample matched sets from the original case-control study (Design I). This method, however makes no use of the previously gained knowledge regarding Z measurements . Intuitively, the original Z measurements could be used to advantage in the second stage sampling. Matched sets which have large variation in Z will tend to have large variation in X. These sets will 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. therefore be more informative for assessing the effect of X. In this example the information (negative second derivative of the log partial likelihood) contribution from each matched set base on an assumed hazard model in Z is used as a measure of that set’s variability in Z. This should result in maximizing the inverse of the variance o f the estimated parameters. In 1:1 matched sets the absolute difference in Z between the pair could simply be used (Design II). Nothing in the design specifies that for the restricted sample the case value of Z must be greater than the corresponding control’s value of Z or vice versa so no bias is introduced. Goldstein and Langholz (1996) applied the above theory to the Swedish study data. The subjects chosen for the NIOSH occupational sub-study were be referred to as the second stage sample. The first stage Swedish sample EMF measurement Z was the surrogate measure for the second stage true EMF measurement X. Langholz and Goldstein have shown that unweighted partial likelihood methods can be used to analyze the results of this two-stage sampling design. The way to apply this analysis is not immediately obvious. In particular how do we account for the sets that are not sampled? It turns out that the partial likelihood contribution from single subject sampled risk sets (i.e. those that consist of only the case) is identically one. Since they contribute nothing to estimation of p„ they may be dropped from the sample altogether. Thus for the purpose of developing an estimation method, we may characterize these two- 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. stage procedures by saying that the sample consists of the included sets plus sets that consist of only the case (the rejected sets). For the two designs we are considering here, n, (r|i) is then a distribution over sets r of size |r|=m and the singleton set {i}. To show that the partial likelihoods for design II are based on the usual unweighted A partial likelihood (i.e. Wj=l), define VT{fiz) to be the information contribution from the set r, computed using a model for the surrogate measure Z and let Cj(K) be the number of sets of size m in the frill risk set R containing i such that Vt0 z )> K . Then the distribution over sets in the risk set for the design where 1) m-1 controls are randomly sampled and 2) the set is included into the final sample if the above constraint holds is given by f n - l V ' rn - l Y 1' x(x\i) = J /(Pr( A ) > * ’,/er,|r|=m) + \ m - i j 1 -C,(/,tc) <m- ij I(r Thus, for included sets, each member of the sampled set has the same value for 7 t yielding the unweighted partial likelihood contribution. The parameter k would be chosen based on the power or cost requirements of the study. Goldstein and Langholz (1996) also described how to use asymptotic theory to compute second stage power and sample sizes. For the EMF measures in the Swedish second stage study, they assumed that Z and X are mean and standard deviation normalized and 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. have a joint bivariate normal distribution. With Pz the limiting value of (3. under the model, one can show that the correlation between Z and X is pz /px (Xiang and Langholz, 1995). Hence power and sample size can be parameterized in terms of surrogate and true relative risks. Table 1 gives the power by sample size for the new design (Design II) for a few eP z values near the relative risk observed in the original study. For comparison the power when randomly sampling sets (Design I) is also given. To illustrate the difference in power notice (from Table 1) to achieve 90% power with the randomly selected sets, when eP z =1.6 and ep x =2, approximately 50 matched sets are required. Correspondingly when the new sampling plan is used only 25 matched sets are required. We can conclude from the power calculations that it is worth doing the “Top Z Discordancy” two stage sampling. Table 7: Power of Design II (picking case-control sets based on Z variability) versus Design I (random sampling of case-control sets) for rejecting the null hypothesis Ho: bx=0, two-sided a=0.05, by the number of matched sets and the relative risks per standard deviation in Z and X. exp(Pz) exp(px) 20(40)* 25(50) 30(60) 35(70) 40(80) 45 (90) 50 (100) Design II 1.5 1.8 .67 .75 .80 .85 .88 .90 .92 2.0 .68 .76 .82 .86 .89 .92 .93 2.5 .68 .77 .83 .88 .91 .94 .96 1.6 1.8 .72 .80 .85 .89 .92 .94 .96 2.0 .72 .80 .85 .89 .92 .94 .96 2.5 .72 .80 .85 .89 .93 .95 .97 1.7 1.8 .77 .85 .89 .93 .94 .95 .98 2.0 .77 .85 .89 .93 .94 .95 .98 2.5 Design I 1.8 .36 .44 .51 .57 .62 .68 .72 2.0 .44 .52 .60 .66 .72 .77 .81 2.5 .55 .65 .73 .79 .84 .88 .91 * Numbers in parentheses are the total number of subjects in the second stage sample. 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. References Borgan O., Goldstein L., and Langholz, B. (1995) Methods for the Analysis of Sampled Cohort Data in the Cox Proportional Hazards Model. The Annals o f Statistics, 23, pp 1749-1778 Breslow N., and Paton J. (1979). Case-control analysis of cohort studies. In Energy and Health, (ed N. Breslow and A. Whittemore), pp. 226-42, Philadelphia. SIAM Institute for Mathematics and Society, SIAM. Cox D.R. (1975). Partial Likelihood. Biometrika, 62, 269-76 Emster V. (1994) Nested Case-Control Studies. Preventive Medicine, 23, 587-590 Langholz B. and Goldstein L. (1996) Risk Set Sampling in Epidemiologic Cohort Studies. Statistical Science:\ 1,35-53 Lidell,F.,McDonald J., and Thomas D (1977). Methods of cohort analysis: Appraisal by application to asbestos miners. Journal o f the Royal Statistical Society A, 140,469-91 Shleien B., Tucker T.T., Johnson D.W. (1978) The mean active bone marrow dose to the adult population of the United States from diagnostic radiology. Health Physics 34, 587 Xiang A. and Langholz B. (1995). Comparison of case-control to full cohort analyses when covariates are omitted from the model. Technical Report 108, USC Department of Preventive Medicine, Biostatistics Division, Los Angeles. 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 1: Graphical Representation of Risk Sets. Hypothetical cohort. Each line represents a subject’s time on study. • Failure | At Risk ■ m Time Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A study of pediatric oncology nurses' attitudes to and knowledge of genetic testing
PDF
Does young adult Hodgkin's disease cluster by school, residence and age?
PDF
Cognitive dysfunction and occupations with likely significant magnetic field exposure: A cross-sectional study of elderly Mexican Americans
PDF
Imputation methods for missing items in the Vitality scale of the MOS SF-36 Quality of Life (QOL) Questionnaire
PDF
Interaction of dietary fiber and serum cholesterol on early atherosclerosis
PDF
A joint model for Poisson and normal data for analyzing tumor response in cancer studies
PDF
Counter -matching in nested case -control studies: Design and analytic issues
PDF
Efficient imputation in multilevel models with measurement error
PDF
Associations of weight, weight change and body mass with breast cancer risk in Hispanic and non-Hispanic white women
PDF
Cluster analysis of p53 mutational spectra
PDF
Analysis of gene-environment interaction in lung cancer
PDF
High-dose chemotherapy followed by autologous hematiopoietic progenitor cell transplantation for relapsed or refractory Hodgkin's disease
PDF
Comparison of variance estimators in case -cohort studies
PDF
Descriptive epidemiology of thyroid cancer in Los Angeles County, 1972-1995
PDF
An exploration of nonresponse with multiple imputation in the Television, School, and Family Project
PDF
Familiality and environmental risk factors of peptic ulcer: A twin study
PDF
Determinants of body mass in healthy adult identical twins
PDF
Endometrial cancer following breast cancer treatment: Tumor characteristics and predictors of survival
PDF
A descriptive analysis of medication use by asthmatics in the Children's Health Study, 1993
PDF
A comparative study of environmental factors associated with multiple sclerosis in disease-discordant twin pairs
Asset Metadata
Creator
Thurston, Lisa Naomi
(author)
Core Title
Application of a two-stage case-control sampling design based on a surrogate measure of exposure
School
Graduate School
Degree
Master of Science
Degree Program
Biometry
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biology, biostatistics,health sciences, oncology,OAI-PMH Harvest,statistics
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
[illegible] (
committee chair
), [illegible] (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-15192
Unique identifier
UC11337837
Identifier
1384925.pdf (filename),usctheses-c16-15192 (legacy record id)
Legacy Identifier
1384925.pdf
Dmrecord
15192
Document Type
Thesis
Rights
Thurston, Lisa Naomi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
biology, biostatistics
health sciences, oncology