Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Evaluation of the accuracy and reliability of self-reported breast, cervical, and ovarian cancer incidence in a large population-based cohort of native California twins
(USC Thesis Other)
Evaluation of the accuracy and reliability of self-reported breast, cervical, and ovarian cancer incidence in a large population-based cohort of native California twins
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EVALUATION OF THE ACCURACY AND RELIABILITY OF
SELF-REPORTED BREAST, CERVICAL AND OVARIAN CANCER
INCIDENCE IN A LARGE POPULATION-BASED COHORT
OF NATIVE CALIFORNIA TWINS
by
Xinyun Lian
------------------------------------------------------------
A Thesis Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(APPLIED BIOSTATISTICS AND EPIDEMIOLOGY)
May 2006
Copyright 2006 Xinyun Lian
UMI Number: 1437575
1437575
2006
UMI Microform
Copyright
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, MI 48106-1346
by ProQuest Information and Learning Company.
ii
ACKNOWLEDGEMENTS
I like to acknowledge to Dr. Myles Gordon Cockburn of USC Department of
Preventive Medicine for providing me the records of CSP, CCR and self-reported
questionnaire. I would like to express my gratitude to Dr. Hamilton Ann and Dr.
Wendy Mack for their helpful comments, suggestions and editing throughout the
course of my research and in the preparation of this manuscript. My special thanks
go to Dr. Cockburn for his inspiration and guidance throughout my thesis experience
and assistance with preparation of this manuscript. Special thanks also go to Dr.
Carolyn M. Ervin for the gift of friendship and encouragement and her professional
support. I am very grateful to my husband and my family for persistent and generous
support.
iii
TABLE OF CONTENTS
Acknowledgements ii
List of Tables iv
Abstract vii
I. Introduction 1
II. Background 4
A. Twins and Cancer Study 4
B. Methods of Twin Subject’s Ascertainment 5
C. Methods of cancer Disease Ascertainment 7
III. Materials and Methods 9
A. Data Sources 9
B. Study Sample 11
C. Generating Working Datasets 14
D. Creation of New Variable 16
E. Statistical Analysis 17
IV. Results 21
A. The Distribution of Cancer Cases in Cancer Registry 21
B. The Distribution of Cancer Cases in Self-reported Questionnaire 25
C. The Distribution of Cancer Cases in the Working dataset 28
D. Stratify the Working Dataset 30
E. Accuracy Measurement 40
V. Discussion 49
VI. Conclusion 57
Bibliography 58
iv
LIST OF TABLES
Table 1: Female Cancer Distribution in CSP Dataset 21
Table 2: Female Cancer Distribution in CCR Dataset 22
Table 2a: Multiple Occurrences of Female Cancer in the CCR 23
Table 3: Female Cancer Distribution in the Unified Cancer Registry 24
Table 3a: The Distribution of Agreement for Female Cancer
between the CSP and CCR 24
Table 4: Female Cancer Distribution in the Self-reported Questionnaire
and Proxy Dataset 25
Table 4a: Multiple Cancer Distribution in Female Self-reported Questionnaire 26
Table 5: Multiple Cancer Distribution in Female Proxy Dataset 27
Table 6: The Distribution of Female Cancers in the Self-reported Questionnaire
And Proxy Dataset 28
Table 7: The Frequency of Three Cancers in the Unified Cancer Registry
and Self-reported Questionnaire 29
Table 8: The Frequency of Three Cancers in the Unified Cancer Registry
and Self-Respondents with Proxy Dataset 29
Table 9: Distribution of Breast Cancer in UCR-SQ Dataset
by Year of Diagnosis When Known 31
Table 9a: Distribution of Cervical Cancer in UCR-SQ Dataset
by Year of Diagnosis When Known 32
Table 9b: Distribution of Ovarian Cancer in UCR-SQ Dataset
by Year of Diagnosis When Known 33
Table 10: Distribution of Breast Cancer in Self-reported Questionnaire
(Proxy Responses only) 34
v
Table 10a: Distribution of Cercival Cancer in Self-reported Questionnaire
(Proxy Responses only) 35
Table 10b: Distribution of Ovarian Cancer in Self-reported Questionnaire
(Proxy Responses only) 36
Table 11: Distribution of Breast Cancer in Combined Self-Proxy Dataset 37
Table 11a: Distribution of Cervical Cancer in Combined Self-Proxy Dataset 38
Table 11b: Distribution of Ovarian Cancer in Combined Self-Proxy Dataset 39
Table 12a: Sensitivity, Specificity, PPV and Agreement for Self-reported
Breast Cancer with UCR as the Gold Standard 41
Table 12b: Sensitivity, Specificity, and PPV for Self and Proxy - reported
Breast Cancer with UCR as Gold Standard 41
Table 13a: Sensitivity, Specificity, PPV and Agreement for Self-reported
Cervical Cancer with UCR as the Gold Standard 41
Table 13b: Sensitivity, Specificity, and PPV for Self and Proxy-reported
Cervical Cancer with UCR as Gold Standard 42
Table 14a: Sensitivity, Specificity, PPV and Agreement for Self-reported
Ovarian Cancer with UCR as the Gold Standard 42
Table 14b: Sensitivity, Specificity, and PPV for Self and Proxy-reported
Ovarian Cancer with UCR as Gold Standard 42
Table 15: Summary of Sensitivity of Self-reported Three Female Cancers
in the UCR-SQ Dataset, by site 43
Table 16: Summary of Sensitivity of Self-reported Three Female Cancers in the
California Twin Study Including Self and Proxy Reports, by Site 44
Table 17: Summary of Specificity of Self-reported Three Female Cancers
in the UCR-SQ Dataset, by site 45
Table 18: Summary of Specificity of Self-reported Three Female Cancers in the
California Twin Study Including Self and Proxy Reports, by Site 45
vi
Table 19: Agreements of self-respondent female cancers excluding
proxy respondent in the California twin study, by site 47
Table 20: Agreements of self-respondent female cancers including
proxy respondent in the California twin study, by site 48
vii
ABSTRACT
In this validation study, we used the unified cancer registry as gold standard to verify
the self-reported female breast, cervical and ovarian cancer in the California Twin
Program. We found the sensitivity of breast cancer was 58.08%, cervical cancer was
40.88%, and ovarian cancer was 39.39%. The specificity of breast cancer was
99.55%, cervical cancer was 98.37%, and ovarian cancer was 99.53%. Between the
self-reported questionnaire and the unified cancer registry, the agreement corrected
for chance was 0.5379 for breast cancer, 0.1621 for cervical cancer and 0.1397 for
ovarian cancer; and the observed agreement was 0.9922 for breast cancer, 0.9810 for
cervical cancer, and 0.9946 for ovarian cancer. Based on the confidence intervals,
none of these results were significant. We also incorporate the proxy data in this
validation study, our results showed the proxy data did not improve the accuracy of
self-reported female cancer.
1
I. Introduction
In epidemiologic studies, history of cancer and other chronic disease is routinely
ascertained through a self-reported checklist of conditions (Desai, Bruce et al. 2001).
For a variety of reasons, self-reported disease outcomes are frequently used without
verification in epidemiologic research. One such reason is the difficulty of verifying
responses in studies with large samples and limited funds (Parikh-Patel, Allen et al.
2003).
Although a number of studies have examined agreement between self-reported
outcomes and medical records (Paganini-Hill and Ross 1982; Tretli, Lund-Larsen et
al. 1982; Colditz, Martin et al. 1986; Harlow and Linet 1989; Linet, Harlow et al.
1989; Nevitt, Cummings et al. 1992; Paganini-Hill and Chao 1993), few have
verified self-reported cancers with cancer registry data (Schrijvers, Stronks et al.
1994; Berthier, Grosclaude et al. 1997; Kerber and Slattery 1997; Bergmann, Calle et
al. 1998; Desai, Bruce et al. 2001). To date, we have not seen the validation of self-
reported cancer incidence with a cancer registry among twin populations.
Since twins are fully or partially matched on genetic determinants, share a common
childhood environment, and can often describe the relative (twin vs. co-twin)
differences in their past experience (Hamilton and Mack 2000), twins offer great
2
advantages as subjects for the studies of the role of genetics and environment in the
development of disease (Martin, Boomsma et al. 1997).
Our large population-based California twin cohort study provides data which can be
used to conduct a variety of research, such as cross-sectional analyses of multiple
health-related conditions and exposures, traditional genetic studies using twin pairs
and nested exposure-control or case-control studies (Cockburn, Collett et al. 2001).
Given the widespread use of self-reports in epidemiologic analysis and prevalence
estimation (Desai, Bruce et al. 2001), evaluation of the validity of self-reported
cancer incidence in our twin cohort data with cancer registry data is very important.
It will help us to better understand the patterns of inaccurate reporting regarding a
subject’s cancer history. For example, in a case-control study, misclassification of
disease status based on self-report is possible. Some subjects may report a breast
mass as breast cancer, or an abnormal pap smear as cervical cancer. Such
misclassification of disease status would impact nested case-control studies using the
California twin cohort. This could cause an over or underestimation of cancer cases,
which would bias an estimates of relative risk. To the extent that misclassification of
disease status is related to exposure, estimates of relative risk will be biased towards
or away from the null value of 1.
3
The objective of this study is to validate the self-reported female breast cancers,
cervical cancers and ovarian cancers within the California twin cohort by using the
unified cancer registry data as the “gold standard”. The unified cancer registry is a
dataset that has been created from the combination of two cancer registry databases:
the California Cancer Registry (CCR) and the Los Angeles Cancer Surveillance
Program (CSP). The sensitivity, specificity, and agreement will be the indicators of
accuracy in our twin cohort validation study.
4
II. Background
A. Twins and Cancer Studies
Twins share a common family background and many childhood environmental
exposures and experiences (Hrubec and Robinette 1984; Martin, Boomsma et al.
1997). Identical, or monozygotic twins (MZ), originate from one fertilized egg and
thus, have the same genotype and share 100% of their genes (Harris 1997). In
comparison, dizygotic twins (DZ) do not originate from a single fertilized egg, and
like normal siblings share an average of 50% of their genes. Twins provide
opportunities for evaluating genetic and environmental factors in pairs of subjects of
the same age who are either genetically identical or closely related. When analyzing
groups of twins, higher disease concordance for monozygotic twins compared to
dizygotic twins indicates genetic effects on disease. However, if the concordance is
comparable for monozygotic twins and dizygotic twins, environmental effects may
be indicated.
With regard to cancer research, much attention has been focused on individuals in
the general population. However, since twins can be either genetically identical or
more closely related than the general population, twins are closely matched in terms
of genetic determinants and childhood exposures. Each twin can also provide
5
information about himself or herself as well as their co-twin (Hamilton and Mack
2000) regarding their previous disease and exposure experience. In addition,
including the co-twin data as proxy data can increase the sample size, which is
especially helpful when subgroup analyses are required (Nelson, Longstreth et al.
1990).
Cancer can be the result of a series of genetic changes (Harris 1997). Some of these
are inherited from a parent, and others are induced by environmental exposures. To
accurately explore the cause of cancer, it is crucial to accurately report cancer status.
Validating the accuracy of cancer in self-reported twin questionnaires is critical in
properly evaluating the causes of cancer in twin studies.
B. Methods of ascertainment of twin subjects
There are several ways to recruit twin subjects for research purposes such as:
population-based samples, volunteer-based samples and individual case studies
(Hawkes 1997). These methods of twin ascertainment each has their own inherent
advantages and disadvantages, which are described below.
B.1 Population-based ascertainment
Population-based ascertainment is a systematic method to recruit twin subjects
through the whole twin population in specific area(s) during a defined period.
6
Population-based ascertainment of subjects has minimal bias and the biases
associated with sampling may be assessed (Hawkes 1997). However, to collect
population-based study subjects is time-consuming and very expensive. In this
study, we have the opportunity to assess a population-based twin cohort.
B.2 Volunteer-based ascertainment
Another popular but less systematic method of subject ascertainment is volunteer-
based ascertainment. Researchers may use hospital records, examine disease indices,
or recruit subjects through community resources such as societies or twin clubs. This
method is easier, quicker and less costly compared to the population-based method.
On the other hand, this method may induce sampling bias by recruiting more
interested or more severely diseased twins and cause diseased cases to be
overrepresented or overestimated (Hawkes 1997).
B.3 Individual case studies
Individual case studies of discordant or concordant twin pairs may also be valuable
in an etiologic study. For instance, if a series of twins of MZ twins reared apart
(MZA) were found to be concordant for a disease more frequently than chance alone
would indicate, then this could be interpreted as indicating genetic effects in the
etiology of a specific disease.
7
C. Methods of cancer ascertainment
There are several methods by which cancer occurrence can be ascertained in a twin
study. These include: medical records, a cancer registry, self report, and proxy
report.
C.1 Medical records
Through review of medical records of study subjects, cancer patients can be
identified. Since the patient’s information is confidential, it needs special permission
and procedure to access all of a patient’s medical records from multiple sources. It is
not easy to conduct it in practice.
C.2 Cancer registry
A cancer registry is another useful resource to provide cancer data. Usually, it is
legislated and considered complete and accurate. Linkage of the cancer registry
database to a self-reported questionnaire, by utilizing patient id, name, SSN etc, the
cancer cases in self-reported questionnaire can be verified. This method can also
identify cases in the registry that were not self-reported in the questionnaire and self
reports of cancers that were not identified in the cancer registry.
8
C.3. Self reported cancers
Subjects can provide information about their own cancer status as well as their co-
twin’s through questionnaire or survey. Participation of both twins could increase
both disease and exposure data reliability, as one twin’s response can be confirmed
by their co-twin’s response (Hamilton and Mack 2000).
C.4. Proxy report of cancers
Since not every single twin completes their questionnaire, there are two types of
respondents that exist in this study database. One was defined as double-respondents,
in which both paired twins returned their questionnaire; one was defined as single-
respondent, in which only one of the twins in the pair returned their questionnaire. A
proxy response was built from the single-respondent’s report about their non-
responding co-twin. In comparison to self-reports, the proxy report did not have
cancer diagnosis dates or residential information. However, proxy reports did include
information about cancer status and gender for non-respondents. Including proxy
respondents would therefore increase the available sample size. This is useful, in
particular, for subgroup analyses and for analyses of rare disease like cancer
(Hamilton and Mack 2000).
9
III. Methods and Materials
A. Data Sources
Three population-based data resources were used in this study: the California Twin
Cohort (for self-reported disease and exposure information) and two cancer
registries: the Los Angeles Cancer Surveillance Program (CSP) and the California
Cancer Registry (CCR). For this study the information available in the two cancer
registries has been combined into a dataset called the ‘unified cancer’ database. Each
dataset involved in the present study is described below.
A.1 Self-Reported cancer from the California Twin Questionnaire (SQ)
The large population-based cohort of native Californian twins was established
in1990. To facilitate studies of the role of genetics and environment in the
development of disease, a 16 page mailed risk factor questionnaire was used and
asked about basic demographic characteristics (age, gender, educational level,
occupational background, marital status), perceived zygosity (Kasriel and Eaves
1976), growth and development, reproductive history, use of medical services,
dietary preference, disease experience, which included cancer occurrence, and
lifestyle choices like smoking, alcohol consumption, exercise, and sun exposure. To
date over 52,283 twins, who were recruited similarly in 1991 and 2000, have
completed and returned this questionnaire and will be assessed in the present study.
10
More detailed information regarding this data resource can be found elsewhere
(Cockburn, Hamilton et al. 2001). The 16 page questionnaire is also available at
http://twins.usc.edu/questionnaire.
A.2 The Los Angeles Cancer Registry Data (CSP)
The Los Angeles Cancer Surveillance Program (CSP) is the population-based cancer
registry for Los Angeles County and has been in existence since 1972. The CSP was
initially a portion of the National Cancer Viral Program. Through the voluntary
contributions of institutions, hospitals, medical laboratories, and clinics, the CSP was
able to compile data on newly diagnosed cancer cases within Los Angeles County.
This system allows for complete treatment and demographic data to be recorded on
incident cancer cases occurring in Los Angeles County. The CSP is one of the most
notable cancer registries in the world, as its database allows scientific examination of
the etiology of cancer, as well as its demographic patterns. By the year 2002, the
CSP had grown to a size of 1.2 million cancer records, with roughly 35,000 incident
cancers accrued annually (Deapen & Cockburn 2003).
A.3 The California Cancer Registry Data (CCR)
The CCR is a population-based cancer registry for California. Since 1948, beginning
with a small number of hospitals, California has had consistent collection of cancer
data. This practice continued until all counties in California were included, which
11
occurred in 1988. The CCR provides a valuable cancer resource, as it helps to
identify areas where early detection, patient education, and various related cancer
information can be obtained. The CCR provides detailed information, which
includes: type of cancer, demographics, treatment, and survival etc. By the year
2002, the CCR had more than 1.3 million cancer cases. In addition, 121,000 incident
cancer cases are included annually (Yost 2002).
A.4 The Unified Cancer Registry (UCR)
The Los Angeles Cancer Surveillance Program (CSP) has monitored cancer from
1972 to the present (2005), but it just covers one county of California. The California
Cancer Registry (CCR) currently covers all the counties in California but was not
established until 1988. In the present study, we combined the information from two
datasets: the CCR and the CSP, and have called this new database the unified cancer
registry (UCR). This UCR was used as the gold standard in this study to evaluate the
validity of the self-reported cancers from the California twin questionnaire.
B Study sample
B.1 Cancer Surveillance Program Linkage (CSP dataset)
In this study, subjects were obtained through database linkage between the Los
Angeles Country Cancer Registry and the California Twin Program from January
1972 to December 2000. This dataset will be referred to as the “CSP dataset”.
12
B.2 California Cancer Registry Linkage (CCR dataset)
Subjects were obtained from the CCR by linking the California State Cancer
Registry to the California Twin Program, from January 1988 to December 2000 to
identify the matched subjects. Since multiple cancer cases may exist for the same
subject, that subject would have several records, one for each cancer occurrence and
site. To facilitate analysis, a single unique subject record for an individual person
was created which contained all the information from these duplicate records,
including multiple cancer occurrences by site and reoccurrence of the cancer.
Multiple occurrences were defined as a subject who had more than one kind of
cancer. For example, one subject might have breast cancer, cervical cancer and
ovarian cancer. A recurrence was defined as a subject who had reoccurrences of the
same cancer at different times. For example, breast cancer occurred in 1988 and
1999. In this case: a subject with occurrences of breast cancer, cervical and ovarian
cancer, as well as reoccurrence of breast cancer would now have all this information
in one subject record. Duplicate records were then removed from the dataset. This
was referred to as the “CCR dataset”.
B.3 Self-reported questionnaire dataset
Twins reported the occurrence of cancers in the self-reported questionnaire database.
However, the date of diagnosis of cancer, which is an important factor in the present
study, was not initially available. For some cases, the age of cancer diagnosis was
13
available from non-scannable data written in the questionnaire. After identifying
those subjects who had reported having breast cancer, cervical cancer, or ovarian
cancer, a separate case dataset was made with their study numbers. The hard copies
of the original questionnaires were reviewed, and the age of diagnosis was manually
entered into a dataset. If no age of cancer diagnosis was provided or was unknown,
then “9999” was entered for age. If the age information was listed as “not sure” or
“erased”, then “99” were entered for age. After the manually entered data was
rechecked, the dataset was merged back into the original self-reported database by
subject ID. If necessary, the age at cancer diagnosis and the date of birth was used to
calculate diagnosis date. After removal of duplicate records, the unique subject ID
with multiple cancer occurrences was retained. This was referred to as the “SQ
dataset”. Note each subject in this dataset might have more than one record which
contains the multiple occurrences, or reoccurrences. The method was used for SQ is
similar to the method for the CCR dataset.
B.4 Proxy dataset
In order to increase the sample size of self-reported questionnaires and collect more
complete information about the cancer status of non-respondent twins, we generated
the ‘proxy twins’ data. Proxy responses for a non-respondent twin were created,
based on the existing information provided by the corresponding single respondent
twin in the self-reported questionnaire. A new proxy ID and sequence number were
14
created for the non-respondent based on the existing single respondent. In addition,
the gender and cancer status of the non-respondent twin was available in the single
respondent’s questionnaire and could be used to make the proxy record for the non-
respondent twin. However, the single respondent twin’s questionnaire did not
include a cancer diagnosis date, age of cancer diagnosis question, or state/city
residency etc regarding the non-respondent twin. This dataset is referred to as the
“PQ dataset”.
C. Generating working datasets
C.1 The Unified Cancer Registry (UCR) Dataset
The unified cancer registry (UCR) dataset was created based on the merge of the
CSP and the CCR dataset by subject ID. Datasets were merged as follows: unique
CCR subject records were retained; cancers recorded after 1998 with records in both
the CSP and CCR had CSP information appended to their unique record; and subject
records in the CSP that occurred before 1998 were retained as a unique record. Since
the CSP only covered the cancer cases within Los Angeles County and has records
before 1988 and the CCR covers all of California after 1988, this match-merge was
done to increase the number of cases in the unified cancer registry. The final dataset
was referred to as the unified cancer registry (UCR) dataset.
15
C.2 Self-Proxy Questionnaire (SPQ)
The self-proxy questionnaire (SPQ) dataset was generated by merging proxy data
(PQ) with original self-reported questionnaire data (SQ) by subject ID. This was
referred to as the “SPQ dataset”.
C.3 Unified Cancer Registry – Self-Reported Questionnaire (UCR-SQ) Dataset
The unified cancer registry-self-reported questionnaire dataset was created by
merging the unified cancer registry data and the self-reported questionnaire data by
subject ID. The merge of these datasets provided the analysis database used for
accuracy testing. We removed subjects, who only had records in the unified cancer
registry, but did not exist in the self-reported questionnaire dataset because they
would be the subjects who did not return the questionnaire (or their twin did not
return the questionnaire). The remaining subjects will be utilized for the final
analyses to test the accuracy. This was referred to as the UCR-SQ dataset.
C.4 Unified Cancer Registry – Self-Proxy Questionnaire (UCR-SPQ) Dataset
The unified cancer registry-self-proxy questionnaire (UCR-SPQ) dataset was created
by merging the unified cancer registry data and the self-proxy questionnaire data by
subject ID. The merge of these datasets also tested accuracy and gave us an
additional database that we would use for testing how accuracy changes compared to
the self-report dataset alone. We removed subjects who only had records in the
16
unified cancer registry, but did not exist in the self-proxy questionnaire dataset
because these would be the subjects who did not return their questionnaire (or their
twin did not return it). The remaining subjects were utilized for the final analyses to
test accuracy of self-proxy dataset. This was referred to as the UCR-SPQ dataset.
D. Creation of New Variables
D.1 Diagnostic date
The cancer diagnostic date was created by the cancer diagnostic age and date of birth
in the self-reported questionnaire. Using the diagnostic date from the self-reported
questionnaire dataset to match the year ranges of the unified cancer registry dataset
may explain some of the disagreements between the self-reported questionnaire and
the unified cancer registry. For instance, if a patient had cancer in 1970, they would
not be expected to have this record in the unified cancer registry, since both the CSP
and CCR were not established in that year. In fact, the CSP was established in 1972
and the CCR was established in 1988. Also if a patient had cancer in late 2000, it
may not be in the unified cancer registry either, as the unified cancer registry might
not have received this report yet.
D.2 Residence
State name and state code or county name and county code were available in the self-
reported questionnaire. The residence variable was created with these data. If the
State name was “CA” or state code was “06’, then residence was coded as
17
Californian. If the county name was ‘Los Angeles’ or county code was ‘037’, then
residence was coded as LA. The residency variable might also help to explain some
unmatched cases between the unified cancer registry and the self-reported
questionnaire. For example, if the patient moved out of Los Angeles County or
California before they answered the questionnaire, they were not expected to have
records in the self-reported questionnaire database, but they might have a cancer
recorded by the CSP or CCR. On the other hand, if subjects moved into Los Angeles
County or California during or after this survey, they would expect to have their
cancer status in the self-reported questionnaire, but they might not have records in
either the CSP or CCR, since their cancer was diagnosed outside of California.
E. Statistical analysis
In this analysis, we used sensitivity and specificity to evaluate the accuracy (validity)
and reliability of self-reported cancer incidence among female’s reporting breast
cancer, cervical cancer or ovarian cancer using the unified cancer registry as a gold
standard (Rapoport, Teres et al. 1990). We also used the Kappa statistic to assess the
agreement between the self-reported breast cancer, cervical cancer and ovarian
cancer and the unified cancer registry (Fleiss 1981).
E.1 Sensitivity and Specificity (Validity)
In this study, sensitivity was defined as the percent of subjects who had the study
cancers based on the unified cancer registry, and had so indicated by self report.
18
Specificity was defined as the percent of those who did not have the study cancers
based on the unified cancer registry, and were so indicated by self-report. The true
positives (TP) were defined as those who had breast cancer, cervical cancer or
ovarian cancer according to the unified cancer registry and self-reported their cancer
status. The false positives (FP) were those who did not have breast cancer, cervical
cancer or ovarian cancer according to the unified cancer registry and self reported
they had these types of cancer. Subjects who did have breast cancer, cervical cancer
or ovarian cancer based on the unified cancer registry but did not self report these
cancers were defined as false negatives (FN). Subjects who did not have breast
cancer, cervical cancer or ovarian cancer based on the unified cancer registry and
also were negative for these cancers according to the self-report were defined as true
negatives (TN). Using these definitions and notations, the following formulas were
used to calculate the sensitivity and specificity in this study as indicated below:
The unified cancer registry (‘Gold Standard’)
Cancers found in registry Cancers not found in registry
Yes True Positive (TP) False Positive (FP)
Self-reported cancer
No False Negative (FN) True Negative (TN)
Sensitivity = TP / (TP + FN) Specificity = TN / (TN + FP)
Positive predictive value (PPV) = TP/ (TP+FP)
Negative predictive value (NPV) = TN/ (TN +FN)
19
The positive predictive value represents the proportion of verified cancer cases
among total self-reported cancer cases based on the gold standard, which is the
unified cancer registry in this study; and the negative predictive value represent the
proportion of verified none cancer cases among total self-reported none cancer cases
based on the gold standard, which is the unified cancer registry in this study.
E.2 Agreement (Reliability)
Sensitivity and specificity are useful when the ‘truth’ is known, and the UCR was
considered to represent the truth for these calculations. However, it is possible that a
self-reported cancer was not reported in the registry if it occurred outside of
California or if the name linkage between the respondent and the registry was not
made correctly. When both sources of information can be considered to be correct, it
is of interest to determine the agreement between them. Reliability (or agreement)
between the self-reported female cancers from the questionnaire and those reported
in the UCR was evaluated by calculating the Kappa statistic as a measure of
agreement beyond chance between these two sources.
Kappa represents the difference between the observed degree of agreement and the
degree of agreement expected to occur by chance, relative to the degree of
agreement that would occur by chance alone. The expected frequency of agreement
between the
20
self-reported cancer and that which was indicated in the unified cancer registry is the
expected frequency of both indicating each has the same such cancer, plus the
expected frequency of both indicating each does not have the same such cancer. The
formula is as follows:
k = [(Observed frequency of agreement)-(Expected frequency of agreement)]
[(Total observed)-(Expect ed frequency of agreement)]
Positive values of kappa suggest the agreement is beyond what would be expected by
chance, while negative values indicate that there is little agreement beyond what
would occur by chance. If there is only chance agreement between two
classifications, then the value of kappa is zero. If there is perfect agreement, then the
value of kappa is one. To evaluate the strength of agreement measured with the
kappa statistic, we used the classification system suggested by Landis and
Lepkowski (Landis, Lepkowski et al. 1982) (kappa values <0.40 represent poor to
fair agreement, 0.40-0.60 moderate agreement, 0.60-0.80 substantial agreement, and
0.80-1.00 almost perfect agreement).
Although the kappa statistic can be used to measure accuracy and indicates the
degree of agreement between two sources of data, good agreement in a kappa
statistic does not necessarily imply high sensitivity and specificity. The kappa
statistic provides little information about where the disagreement lies (Fleiss 1981).
21
IV. Results
A. The distribution of cancer cases in cancer registry
A.1 Cancer Surveillance Program (CSP)
A total of 832 subjects in the twin cohort were identified from the CSP dataset.
367(44.11%) were females and 465 (55.89%) were males. Among the 367 female
subjects, there were 112 (30.52%) breast cancer cases, 85(23.16%) cervical cancer
cases and 14(3.81%) ovarian cancer cases, the remaining 156 (42.51%) subjects
were other cancers (see Table 1).
Table 1. Female Cancer Distribution in CSP Dataset
Cancer Frequency Percent
Breast 112 30.52
Cervix 85 23.16
Ovary 14 3.81
Others 156 42.51
Total 367 100.00
A.2 California Cancer Registry (CCR)
A total of 5,232 subjects were identified from linking the California State Cancer
Registry to the California Twin Program from January 1988 to December 2000.
Multiple records existed in this dataset. There were two situations which caused
multiple records: one situation was the multiple cases of the same cancer in one
person due to reoccurrences of the same cancer. In this case, we retained the record
which had the earliest diagnostic date in the CCR dataset. The other situation
involved multiple occurrences of a different cancer in one person. In this case, we
22
kept all these records for the one subject to which they belonged. For instance, the
reoccurrence of breast cancers counted only once in the breast cancer group, while
multiple cancers of the breast and cervix, counted once for the breast cancer group
and once for the cervical cancer group. Among the 5,232 subjects from the CCR,
2,299 (43.90%) were female and 2,933 (56.10%) were male. Among the 2,299
female subjects, there were 784 (33.25%) breast cancer cases, 359 (15.22%) cervical
cancer cases and 90 (3.82%) cases of ovarian cancer; the remaining 1,125 (47.71%)
cases were of other cancers. Because of multiple occurrences of different cancers,
the total female cancers were 2,358 in the CCR dataset (see Table 2).
Table 2. Female Cancer Distribution in the CCR Dataset
Cancer Frequency Percent
Breast 784 33.25
Cervix 359 15.22
Ovary 90 3.82
Others 1,125 47.71
Total 2,358 100
A.2.1 Multiple Occurrences of Different Cancers in CCR
Focusing on the multiple occurrences of different cancers, among 2,299 individual
female subjects, 58 of these subjects had multiple cancers. Among them, there were
5 subjects who had both breast cancer and cervical cancer; 3 subjects who had both
breast cancer and ovarian cancer; 30 subjects had both breast cancer and an other
cancer; 1 subject who had both cervical cancer and ovarian cancer; 13 subjects who
23
had both cervical cancer and an other cancer; 5 subjects who had both ovarian cancer
and an other cancer; 1 subject had breast cancer, cervical cancer and an other cancer.
There were no subjects registered with breast, cervix and ovarian cancer (see Table
2a).
Table 2a. Multiple occurrences of female cancer in the CCR
Type of Cancer One
cancer
Two
cancers
Three
cancers
Total occurrences
in 2,299 subjects
(2,358)
Breast cancer 745 0 0 745
Cervical cancer 339 0 0 339
Ovary cancer 81 0 0 81
Other cancer 1,076 0 0 1,076
Breast& Cervix 0 5 0 10
Breast& Ovary 0 3 0 6
Breast & Others 0 30 0 60
Cervix& Ovary 0 1 0 2
Cervix& Other 0 13 0 26
Ovary& Other 0 5 0 10
Breast&Cervix&Others 0 0 1 3
Total subjects(2,299) 2,241 57 1 2,358
A.3 The unified cancer registry dataset (UCR)
There were a total of 5,379 unique subjects in the combined cancer registry dataset
after merging the CSP dataset with the CCR dataset. Of these 5,379 subjects, 2,370
(44.06%) were female and 3,009 (55.94%) were male. Among the 2,370 female
subjects, there were 804 (33.95%) cases of breast cancer; 377 (15.92%) cases of
cervical cancer and 94 (3.97%) cases of ovarian cancer. The remaining 1,158 cancers
(46.71%) consisted of other cancers. There were 2,433 records in total in the unified
cancer registry dataset because of multiple cancers occurring in the same subject (see
Table 3)
24
Table 3. Female cancer distribution in the unified cancer registry
Source Breast Cervix Ovary Others Total
CSP 112 85 14 156 367
CCR 784 359 90 1,125 2,358
CSP and CCR 92 67 10 123 292
CSP or CCR 804 377 94 1,158 2,433
There were 13 disagreements of cancer type between the CSP and CCR.
Disagreements included: 4 patients recorded as breast cancer in the CSP, but
recorded as other cancer in the CCR; 1 patient recorded as breast cancer in the CSP,
but recorded as ovarian cancer in the CCR; 3 patients recorded as cervix cancer in
the CSP, but recorded as breast cancer in the CCR; 2 patients recorded as cervix
cancer in the CSP, but recorded as other cancer in the CCR;1 patient recorded as
ovarian cancer in the CSP, but recorded as breast cancer in the CCR; 1 patient
recorded as other cancer in the CSP, but recorded as breast cancer in the CCR; and 1
patient recorded as other cancer in the CSP, but recorded as cervical cancer in the
CCR (see Table 3a).
Table 3a The distribution of agreement for female cancer between the CSP and CCR
Data Source Cancer in CCR
Cancer in CSP Breast Cervix Ovary Other Not recorded Total
Breast 92 0 1 4 19 116
Cervix 3 67 0 2 16 88
Ovary 1 0 10 0 3 14
Other 1 1 0 123 33 158
Not recorded 687 291 79 996 0 2,053
Total 784 359 90 1,125 71 2,358
25
B. The distribution of cancer cases in the self-reported questionnaire
B.1 Self respondents (SQ)
A total of 52,283 subjects completed the self-reported questionnaire in this study.
Among these 52,283 subjects, 29,113(55.68%) were female and 23,170 (44.32%)
were male. There were a total of 22,597 (43.22%) single respondents, whose co-twin
did not complete the questionnaire. There were 29,686 (56.78%) double
respondents, where both twin and co-twin responded. For these 52,283 subjects,
24,226 (46.34%) subjects completed the questionnaire during 1990-1992 and 28,057
(53.66%) subjects completed the questionnaire from January 1998 to December
2000.
Among the 29,113 female respondents, 262 (0.9%) self reported breast cancers,
528(1.81%) self reported cervical cancers and 152 (0.52%) self reported ovarian
cancers. Since there were 37 subjects who had multiple cases of breast, cervical or
ovarian cancer, the total female frequency was changed to 29,151 in the self-reported
dataset to reflect multiple occurrences of different cancers (see Table 4).
Table 4. Female cancer distribution in the self-reported questionnaire and proxy dataset
Type of Cancer
Data Source Breast
Cancer
Cervical
Cancer
Ovary
Cancer
No Breast, Cervix or Ovary
Cancer
Total
Sample
Self-Reported 262(0.9%) 528(1.82%) 152(0.52%) 28,209(96.89%) 29,151
Proxy data 116(1.12%) 168(1.62%) 89(0.86%) 10,041(96.71%) 10,414
26
B.1.2 Multiple occurrences of different cancers in the self respondent dataset
There were 37 subjects who had multiple cancers. Among them, 11 subjects had
breast cancer and cervical cancer; 5 subjects had breast cancer and ovary cancer; 20
subjects had cervical cancer and ovary cancer, and 1 subject had three cancers:
breast, cervix, and ovary. This resulted in 29,151 records in the self-reported dataset
(see Table 4a).
Table 4a Multiple cancer distribution in female self-reported questionnaire
Type of Multiple Occurrence
Type of Cancer One
time
Two
times
Three
times
Total(29,151)
Breast cancer 245 0 0 245
Cervical cancer 496 0 0 496
Ovary cancer 126 0 0 126
No breast, cervical or ovarian
cancer
28,209 0 0 28,209
Breast & Cervix 0 11 0 22
Breast &Ovary 0 5 0 10
Cervix & Ovary 0 20 0 40
Breast & Cervix & Ovary 0 0 1 3
Total Subjects 29,076 36 1 29,113
B.2 Proxy respondent dataset (PQ)
A total of 22,597 subjects were created in a proxy respondent dataset, based on the
original 22,597 single respondent twins who reported on their co-twins in the self-
reported questionnaire. Of these 22,597 new subjects, 10,383 (45.95%) were female
and 12,214 (54.05%) were male. Among the 10,383 female subjects, 116 (1.12%)
subjects were reported by their co-twin to have breast cancer, 168 (1.62%) were
reported to have cervical cancer and 89 (0.86%) were reported to have ovarian
27
cancer. The remaining 10,041 (96.71%) did not have any of these three cancers
reported by their co-twin. Since there were 25 subjects who had multiple
occurrences of different cancers, the total female frequency was 10,414 in self-
reported dataset (see Table 4).
B.2.1 Multiple occurrences of different cancers in the proxy respondent dataset
There were 25 subjects of the 10,383 proxy subjects who had multiple occurrences
of different cancers of interest. Among these 25 subjects, there were 3 subjects who
had breast cancer and cervical cancer; 6 subjects who had breast cancer and ovarian
cancer; and 10 subjects who had cervical cancer and ovarian cancer. 6 subjects who
were registered had three kinds of cancer: breast, cervix and ovary cancer. This
resulted in 10,414 records in the proxy dataset (see Table 5).
Table 5 Multiple cancer distribution in female proxy dataset
Type of Multiple Occurrence
Type of Cancer One time Two times Three times Total (10,414)
Sample
Breast cancer 101 0 0 101
Cervical cancer 149 0 0 149
Ovary cancer 67 0 0 67
No breast,
cervical or
ovarian cancer
10,041 0 0 10,041
Breast & Cervix 0 3 0 6
Breast & Ovary 0 6 0 12
Cervix & Ovary 0 10 0 20
Breast & Cervix
& Ovary
0 0 6 18
Total Subjects 10,358 19 6 10,383
28
B.3 Self respondent and Proxy respondent Dataset (SPQ)
A total of 74,880 subjects were obtained for the self-proxy dataset by merging the
self-reported questionnaire dataset and the proxy dataset and removing 7 duplicate
subjects. Among these 74,880 subjects, 39,493 (52.75%) were females. Among these
females, there were 378 (0.96%) who had breast cancer, 696 (1.76%) who had
cervical cancer, 241 (0.61%) who had ovarian cancers and 38,247 (96.85%) who did
not have any of these three cancers. Since multiple cancers from self-reported or
proxy data existed, this resulted in a total 39,562 records in the self-proxy dataset
(see Table 6).
Table 6. The distribution of female cancers in the self-reported questionnaire and proxy datasets
Data Source Breast Cervix Ovary
Others Total
(39,562)
Self-reported only 262 528 152 28,209 29,151
Proxy only 116 168 89 10,041 10,414
Self-report and Proxy 0 0 0 0 0
Self or proxy 378 696 241 38,247 39,562
C. The distribution of cancer cases in the working dataset
C.1 The unified cancer registry-self respondent without
proxy dataset (UCR-SQ)
A total of 55,420 subjects were obtained for the unified cancer registry-self
respondent without proxy dataset in this study through merging the unified cancer
registry and the self respondent without proxy dataset. There were 29,113 (52.53%)
female subjects, among them, 133 (0.46%) who had breast cancer both in the unified
cancer registry and self reported, 56 (0.19%) who had registry and self-reported
29
cervical cancer, and 13 (0.04%) who had registry and self-reported ovarian cancer
(see Table 7).
Table 7. The frequency of three cancers in the unified cancer registry and
self- reported questionnaire
Source Breast Cervix Ovary
Unified cancer registry 360 149 46
Self-respondent 262 528 152
Unified cancer registry and self-respondent 133 56 13
Unified cancer registry or self-respondent 489 621 185
C.2 The unified cancer registry-self respondent with proxy dataset (UCR-SPQ)
A total of 77,390 subjects were obtained for the unified cancer registry-self reported
questionnaire dataset in this study through merging the unified cancer registry and
the self respondent with proxy respondent dataset. There were 39,493 (51.03%)
female subjects; among these, 153 (0.39%) had breast cancer in the unified cancer
registry and also self-reported breast cancer, 65 (0.16%) had registry and self-
reported cervical cancer, and 15 (0.04%) had registry and self-reported ovarian
cancer (see Table 8).
Table 8. The frequency of three cancers in the unified cancer registry and self-respondents
with proxy dataset
Source Breast Cervix Ovary
Unified cancer registry 429 189 55
Self-respondents and proxy respondents 378 696 241
Unified cancer registry and self-respondents with proxy respondents 153 65 15
Unified cancer registry or self-respondents with proxy respondents 654 820 281
30
D. Stratification of the working dataset
D.1 Stratification by cancer diagnostic date in the unified cancer registry, self-
reported diagnostic date and date of questionnaire return
D.1.1 Distribution of three female cancers in the unified cancer registry-self
respondent without proxy respondent dataset (UCR-SQ)
After stratification by the cancer diagnostic date, self reported diagnostic date and
when the questionnaire was returned, the tables below show more details about the
distribution of self-reported cancer by occurrence of cancer in this dataset, presented
by year of diagnosis when known (Tables 9, 9a, and 9b).
D1.2 Distribution of three female cancers in the unified cancer registry-self
respondent with proxy respondent dataset (UCR-SPQ)
Details of the three cancers reported by proxy respondents are shown in Tables 10,
10a and 10b stratified by the date of the questionnaire was returned. After
stratification by the cancer diagnostic date, self reported diagnostic date and the
period the questionnaire was returned, Tables 11, 11a and 11b show more details
about the distribution of UCR-SPQ, presented by year of diagnosis when known.
31
Table 9. Distribution of breast cancer in UCR-SQ dataset by year of diagnosis when known
Data Source Unified Cancer Registry
Self-reported
Questionnaire
1972-
1987
1988-
1993
1994-
1997
199
8-
200
0
Not in
unified
cancer
registry
Total
Before 1972 0 0 0 0 0 0
1972-1987 0 1 0 0 12 13
1988-1993 0 4 5 0 6 15
1994-1997 0 0 8 3 11 22
1998-2000 0 0 0 1 4 5
Cancer occurred without reporting
date but returned questionnaire in
2000 or before
0 1 1 0 11 13
Cancer occurred without reporting
date but returned questionnaire in
2001-2002
0 0 0 0 9 9
Cancer occurred without reporting
date and unknown when returned
questionnaire
0 0 0 0 0 0
New
Twinset
No report of cancer in survey 0 31 21 25 17,448 17,525
Cancer reported with unknown date
and returned questionnaire in 1993
or before
8 65 1 1 48 123
Cancer reported with unknown date
and returned questionnaire in 1994-
1997
0 0 0 0 0 0
Cancer reported with unknown date
and returned questionnaire in 1998-
2000
1 6 4 11 20 42
Cancer reported with unknown date
and returned questionnaire in 2001-
2002
0 0 0 0 4 4
Cancer reported with unknown date
and unknown when returned
questionnaire
1 8 2 1 4 16
No report of cancer in survey-
returned questionnaire in 1993 or
before
0 11 61 64 8,708 8,844
No report of cancer in survey-
returned questionnaire in 1994-1997
0 0 0 1 15 16
No report of cancer in survey-
returned questionnaire in 1998-2000
0 2 2 4 1,825 1,833
No report of cancer in survey-
returned questionnaire in 2001-2002
0 0 0 0 55 55
Old
Twinset
No report of cancer in survey-
unknown when returned
questionnaire
0 0 5 0 573 578
Total 10 129 110 111 28,753 29,113
32
Table 9a. Distribution of cervical cancer in UCR-SQ dataset, by year of diagnosis when known.
Data Source Unified Cancer Registry
Self-reported
Questionnaire
1972-
1987
1988-
1993
1994-
1997
1998-
2000
Not in
unified
cancer
registry
Total
Before 1972 0 0 0 0 0 0
1972-1987 1 4 0 0 74 79
1988-1993 0 16 4 0 84 104
1994-1997 0 0 6 1 43 50
1998-2000 0 0 0 0 41 41
Cancer occurred without
reporting date but returned
questionnaire in 2000 or
before
0 2 1 0 31 34
Cancer occurred without
reporting date but returned
questionnaire in 2001-2002
0 0 1 1 20 22
Cancer occurred without
reporting date and unknown
when returned questionnaire
0 0 0 0 0 0
New
Twinset
No report of cancer in survey 2 29 18 3 17,220 17,272
Cancer reported with
unknown date and returned
questionnaire in 1993 or
before
6 10 0 0 127 143
Cancer reported with
unknown date and returned
questionnaire in 1994-1997
0 0 0 0 0 0
Cancer reported with
unknown date and returned
questionnaire in 1998-2000
0 2 1 0 30 33
Cancer reported with
unknown date and returned
questionnaire in 2001-2002
0 0 0 0 2 2
Cancer reported with
unknown date and unknown
when returned questionnaire
0 0 0 0 20 20
No report of cancer in
survey-returned
questionnaire in 1993 or
before
1 23 9 2 8,789 8,824
No report of cancer in
survey-returned
questionnaire in 1994-1997
0 0 0 0 16 16
No report of cancer in
survey-returned
questionnaire in 1998-2000
0 3 0 0 1,839 1,842
No report of cancer in
survey-returned
questionnaire in 2001-2002
0 0 0 0 57 57
Old
Twinset
No report of cancer in
survey-unknown when
returned questionnaire
0 2 1 0 571 574
Total 10 91 41 7 28,964 29,113
33
Table 9b. Distribution of ovary cancer in UCR-SQ dataset, by year of diagnosis when known
Data Source Unified Cancer Registry
Self-reported
Questionnaire
1972-
1987
1988-
1993
1994-
1997
1998-
2000
Not in
unified
cancer
registry
Total
Before 1972 0 0 0 0 1 1
1972-1987 0 0 0 0 19 19
1988-1993 0 1 0 0 15 16
1994-1997 0 0 1 0 11 12
1998-2000 0 0 0 0 3 3
Cancer occurred without
reporting date but returned
questionnaire in 2000 or
before
0 0 0 0 24 24
Cancer occurred without
reporting date but returned
questionnaire in 2001-2002
0 0 0 0 16 16
Cancer occurred without
reporting date and unknown
when returned questionnaire
0 0 0 0
New
Twinset
No report of cancer in survey 5 3 7 17,496 17,511
Cancer reported with
unknown date and returned
questionnaire in 1993 or
before
4 2 0 0 30 36
Cancer reported with
unknown date and returned
questionnaire in 1994-1997
0 0 0 0 0
Cancer reported with
unknown date and returned
questionnaire in 1998-2000
0 1 1 2 15 19
Cancer reported with
unknown date and returned
questionnaire in 2001-2002
0 0 0 0 1 1
Cancer reported with
unknown date and unknown
when returned questionnaire
0 1 0 0 4 5
No report of cancer in
survey-returned
questionnaire in 1993 or
before
0 3 10 2 8,916 8,931
No report of cancer in
survey-returned
questionnaire in 1994-1997
0 0 0 0 16 16
No report of cancer in
survey-returned
questionnaire in 1998-2000
0 0 1 0 1,855 1,856
No report of cancer in
survey-returned
questionnaire in 2001-2002
0 0 0 0 58 58
Old
Twinset
No report of cancer in
survey-unknown when
returned questionnaire
0 1 0 1 587 589
Total 4 14 16 12 29,067 29113
34
Table 10. Distribution of breast cancer in Self-reported Questionnaire (Proxy Responses)
Data Source Unified Cancer Registry
Proxy responses 1972-
1987
1988-
1993
1994-
1997
1998-
2000
Not in
unified
cancer
registry
Total
Cancer occurred without
reporting date but returned
questionnaire in 2000 or
before
0 1 2 2 15 20
Cancer occurred without
reporting date but returned
questionnaire in 2001-2002
0 0 1 0 1 2
Cancer occurred without
reporting date and unknown
when returned questionnaire
0 0 0 0 0 0
Proxy
Reports
from New
Twinset
No report of cancer in
survey
0 11 8 6 6,465 6,490
Cancer reported with
unknown date and returned
questionnaire in 1993 or
before
1 8 0 0 58 67
Cancer reported with
unknown date and returned
questionnaire in 1994-1997
0 0 0 0 0 0
Cancer reported with
unknown date and returned
questionnaire in 1998-2000
0 2 1 0 14 17
Cancer reported with
unknown date and returned
questionnaire in 2001-2002
0 0 0 1 4 5
Cancer reported with
unknown date and unknown
when returned questionnaire
0 1 0 0 4 5
No report of cancer in
survey-returned
questionnaire in 1993 or
before
0 2 10 8 2,751 2,771
No report of cancer in
survey-returned
questionnaire in 1994-1997
0 0 0 0 6 6
No report of cancer in
survey-returned
questionnaire in 1998-2000
0 0 2 0 711 713
No report of cancer in
survey-returned
questionnaire in 2001-2002
0 0 0 0 12 12
Proxy
reports
from Old
Twinset
No report of cancer in
survey-unknown when
returned questionnaire
0 0 1 1 273 275
Total 1 25 25 18 10,314 10,383
35
Table 10a. Distribution of cervical cancer in Self-reported Questionnaire (Proxy Responses)
Data Source Unified Cancer Registry
Proxy responses 1972-
1987
1988-
1993
1994-
1997
1998-
2000
Not in
unified
cancer
registry
Total
Cancer occurred without
reporting date but returned
questionnaire in 2000 or
before
0 1 2 0 82 85
Cancer occurred without
reporting date but returned
questionnaire in 2001-2002
0 1 0 0 23 24
Cancer occurred without
reporting date and unknown
when returned questionnaire
0 0 0 0 3 3
Proxy
Reports
from New
Twinset
No report of cancer in
survey
1 10 5 1 6,383 6,400
Cancer reported with
unknown date and returned
questionnaire in 1993 or
before
0 3 0 0 37 40
Cancer reported with
unknown date and returned
questionnaire in 1994-1997
0 0 0 0 0 0
Cancer reported with
unknown date and returned
questionnaire in 1998-2000
0 0 2 0 11 13
Cancer reported with
unknown date and returned
questionnaire in 2001-2002
0 0 0 0 0 0
Cancer reported with
unknown date and unknown
when returned questionnaire
0 0 0 0 3 3
No report of cancer in
survey-returned
questionnaire in 1993 or
before
0 4 3 0 2,791 2,798
No report of cancer in
survey-returned
questionnaire in 1994-1997
0 0 0 0 6 6
No report of cancer in
survey-returned
questionnaire in 1998-2000
0 3 1 0 713 717
No report of cancer in
survey-returned
questionnaire in 2001-2002
0 0 0 0 17 17
Proxy
reports
from Old
Twinset
No report of cancer in
survey-unknown when
returned questionnaire
0 3 0 0 274 277
Total 1 25 13 1 10,343 10,383
36
Table 10b. Distribution of ovarian cancer in Self-reported Questionnaire (Proxy Responses)
Data Source Unified Cancer Registry
Proxy responses 1972
-1987
1988-
1993
1994-
1997
1998-
2000
Not in
unified
cancer
registry
Total
Cancer occurred without
reporting date but returned
questionnaire in 2000 or
before
0 0 1 0 30 31
Cancer occurred without
reporting date but returned
questionnaire in 2001-2002
0 0 0 0 14 14
Cancer occurred without
reporting date and unknown
when returned questionnaire
0 0 0 0 0 0
Proxy
Reports
from New
Twinset
No report of cancer in
survey
0 2 0 0 6,465 6,467
Cancer reported with
unknown date and returned
questionnaire in 1993 or
before
0 0 0 0 26 26
Cancer reported with
unknown date and returned
questionnaire in 1994-1997
0 0 0 0 0 0
Cancer reported with
unknown date and returned
questionnaire in 1998-2000
0 0 1 0 14 15
Cancer reported with
unknown date and returned
questionnaire in 2001-2002
0 0 0 0 0 0
Cancer reported with
unknown date and unknown
when returned questionnaire
0 0 0 0 3 3
No report of cancer in
survey-returned
questionnaire in 1993 or
before
0 4 1 0 2,807 2,812
No report of cancer in
survey-returned
questionnaire in 1994-1997
0 0 0 0 6 6
No report of cancer in
survey-returned
questionnaire in 1998-2000
0 0 0 0 715 715
No report of cancer in
survey-returned
questionnaire in 2001-2002
0 0 0 0 17 17
Proxy
reports
from Old
Twinset
No report of cancer in
survey-unknown when
returned questionnaire
0 0 0 0 277 277
Total 0 6 3 0 10,374 10,383
37
Table 11. Distribution of breast cancer in combined Self-Proxy dataset
Data Source Unified Cancer Registry
Self-reported
Questionnaire
1972-
1987
1988-
1993
1994-
1997
1998-
2000
Not in
unified
cancer
registry
Total
Before 1972 0 0 0 0 0 0
1972-1987 0 1 0 0 12 13
1988-1993 0 4 5 0 6 15
1994-1997 0 0 8 3 11 22
1998-2000 0 0 0 1 4 5
Cancer occurred without
reporting date but returned
questionnaire in 2000 or
before
0 2 3 2 26 33
Cancer occurred without
reporting date but returned
questionnaire in 2001-2002
0 0 1 0 10 11
Cancer occurred without
reporting date and unknown
when returned questionnaire
0 0 0 0 0 0
New
Twinset
No report of cancer in survey 0 42 29 31 23,910 24,012
Cancer reported with
unknown date and returned
questionnaire in 1993 or
before
9 73 1 1 106 190
Cancer reported with
unknown date and returned
questionnaire in 1994-1997
0 0 0 0 0 0
Cancer reported with
unknown date and returned
questionnaire in 1998-2000
1 8 5 11 34 59
Cancer reported with
unknown date and returned
questionnaire in 2001-2002
0 0 0 1 8 9
Cancer reported with
unknown date and unknown
when returned questionnaire
1 9 2 1 8 21
No report of cancer in
survey-returned
questionnaire in 1993 or
before
0 13 71 72 11,459 11,615
No report of cancer in
survey-returned
questionnaire in 1994-1997
0 0 0 1 21 22
No report of cancer in
survey-returned
questionnaire in 1998-2000
0 2 4 4 2,536 25,46
No report of cancer in
survey-returned
questionnaire in 2001-2002
0 0 0 0 67 67
Old
Twinset
No report of cancer in
survey-unknown when
returned questionnaire
0 0 6 1 846 853
Total 11 154 135 129 39,064 39,493
38
Table 11a. Distribution of cervical cancer in combined Self-Proxy dataset
Data Source Unified Cancer Registry
Self-reported
Questionnaire
1972-
1987
1988-
1993
1994-
1997
1998-
2000
Not in
unified
cancer
registry
Total
Before 1972 0 0 0 0 0 0
1972-1987 1 4 0 0 74 79
1988-1993 0 16 4 0 84 104
1994-1997 0 0 6 1 43 50
1998-2000 0 0 0 0 41 41
Cancer occurred without
reporting date but returned
questionnaire in 2000 or
before
0 3 3 0 113 119
Cancer occurred without
reporting date but returned
questionnaire in 2001-2002
0 1 1 1 43 46
Cancer occurred without
reporting date and unknown
when returned questionnaire
0 0 0 0 3 3
New
Twinset
No report of cancer in survey 3 39 23 4 23,600 23,669
Cancer reported with
unknown date and returned
questionnaire in 1993 or
before
6 13 0 0 164 183
Cancer reported with
unknown date and returned
questionnaire in 1994-1997
0 0 0 0 0 0
Cancer reported with
unknown date and returned
questionnaire in 1998-2000
0 2 3 0 41 46
Cancer reported with
unknown date and returned
questionnaire in 2001-2002
0 0 0 0 2 2
Cancer reported with
unknown date and unknown
when returned questionnaire
0 0 0 0 23 23
No report of cancer in
survey-returned
questionnaire in 1993 or
before
1 27 12 2 11,580 11,622
No report of cancer in
survey-returned
questionnaire in 1994-1997
0 0 0 0 22 22
No report of cancer in
survey-returned
questionnaire in 1998-2000
0 6 1 0 2,552 2,559
No report of cancer in
survey-returned
questionnaire in 2001-2002
0 0 0 0 74 74
Old
Twinset
No report of cancer in
survey-unknown when
returned questionnaire
0 5 1 0 845 851
Total 11 116 54 8 39,304 39,493
39
Table 11b. Distribution of ovarian cancer in combined Self-Proxy dataset
Data Source Unified Cancer Registry
Self-reported
Questionnaire
1972-
1987
1988-
1993
1994-
1997
1998-
2000
Not in
unified
cancer
registry
Total
Before 1972 0 0 0 0 1 1
1972-1987 0 0 0 0 19 19
1988-1993 0 1 0 0 15 16
1994-1997 0 0 1 0 11 12
1998-2000 0 0 0 0 3 3
Cancer occurred without
reporting date but returned
questionnaire in 2000 or
before
0 0 1 0 54 55
Cancer occurred without
reporting date but returned
questionnaire in 2001-2002
0 0 0 0 30 30
Cancer occurred without
reporting date and unknown
when returned questionnaire
0 0 0 0 0 0
New
Twinset
No report of cancer in survey 0 7 3 7 23,958 23,975
Cancer reported with
unknown date and returned
questionnaire in 1993 or
before
4 2 0 0 56 62
Cancer reported with
unknown date and returned
questionnaire in 1994-1997
0 0 0 0 0 0
Cancer reported with
unknown date and returned
questionnaire in 1998-2000
0 1 2 2 29 34
Cancer reported with
unknown date and returned
questionnaire in 2001-2002
0 0 0 0 1 1
Cancer reported with
unknown date and unknown
when returned questionnaire
0 1 0 0 7 8
No report of cancer in
survey-returned
questionnaire in 1993 or
before
0 7 11 2 11,723 11,743
No report of cancer in
survey-returned
questionnaire in 1994-1997
0 0 0 0 22 22
No report of cancer in
survey-returned
questionnaire in 1998-2000
0 0 1 0 2,570 2,571
No report of cancer in
survey-returned
questionnaire in 2001-2002
0 0 0 0 75 75
Old
Twinset
No report of cancer in
survey-unknown when
returned questionnaire
0 1 0 1 864 866
Total 4 20 19 12 39,438 39,493
40
D.2 Stratification by residence in the unified cancer registry-self respondent
without proxy respondent dataset (UCR-SQ)
A total of 27,566 (94.69%) subjects were coded as California residents and 4,968
(17.06%) were coded as Los Angeles County residents out of 29,113 in the unified
cancer registry-self respondent without proxy respondent dataset (UCR-SQ). Since
there was no resident information available in the proxy respondent dataset, we did
not stratify by residence in the unified cancer registry-self respondent with proxy
respondent dataset (UCR-SPQ).
E. Accuracy measurement
E.1 Sensitivity
The following Tables 12a, 12b, 13a, 13b, 14a, and 14b show the distribution and
measures of sensitivity, specificity, ppv and kappa of the self-reported three female
cancers according to their status in the unified cancer registry which used as the
‘gold standard’. The ‘a’ tables include only the responses based on direct self reports,
whereas the ‘b’ tables include the proxy reported cases in addition to the self-
reported cases.
When comparing the self-reported data (without the proxy information) to the
unified cancer registry (from Tables 12a, 13a and 14a), the site-specific sensitivities
showed great variation (Table 15). 58.08% of the woman who truly had breast cancer
(according to the UCR) was also self-reported by the respondent in the questionnaire,
41
Table 12a: Sensitivity, Specificity, PPV and agreement for self-reported
breast cancer with UCR as the gold standard
Reported in UCR
Yes No Total
Yes 133 129 262
No 96 28,624 28,720
Self-report
Total 229 28,753 28,982
Sensitivity = 0.5808
Specificity = 0.9955
PPV = 0.5076
Kappa = 0.5379
Observed agreement = 0.9922
Table 12b. Sensitivity, Specificity, and PPV for self and proxy-reported
breast cancer with UCR as gold standard
Reported in UCR
Yes No Total
Yes 153 225 378
No 125 38,839 38,964
Self + proxy
report
Total 278 39,064 39,342
Sensitivity = 0.5504
Specificity = 0.9942
PPV = 0.4048
Kappa = 0.4621
Observed agreement = 0.9911
Table 13a: Sensitivity, Specificity, and PPV for self-reported
cervical cancer with UCR as gold standard
Reported in UCR
Yes No Total
Yes 56 472 528
No 81 28,492 28,573
Self-report
Total 137 28,964 29,101
Sensitivity = 0.4088
Specificity = 0.9837
PPV = 0.1061
Kappa = 0.1621
Observed agreement = 0.9810
42
Table 13b. Sensitivity, Specificity, and PPV for self and proxy-reported
cervical cancer with UCR as gold standard
Reported in UCR
Yes No Total
Yes 65 631 696
No 109 38,673 38,782
Self + proxy
report
Total 174 39,304 39,478
Sensitivity = 0.3736
Specificity = 0.9839
PPV = 0.0934
Kappa = 0.1434
Observed agreement = 0.9813
Table 14a: Sensitivity, Specificity, and PPV for self-reported
ovarian cancer with UCR as gold standard
Reported in UCR
Yes No Total
Yes 13 138 151
No 20 28,928 28,948
Self-report
Total 33 29,066 29,099
Sensitivity = 0.3939
Specificity = 0.9953
PPV = 0.0861
Kappa = 0.1397
Observed agreement = 0.9946
Table 14b. Sensitivity, Specificity, and PPV for self and proxy-reported
ovarian cancer with UCR as gold standard
Reported in UCR
Yes No Total
Yes 15 226 241
No 26 39,212 39,238
Self + proxy
report
Total 41 39,438 39,479
Sensitivity = 0.3659
Specificity = 0.9943
PPV = 0.0622
Kappa = 0.1048
Observed agreement = 0.9936
43
40. 88% of the woman who truly had cervical cancer was self-reported and 39.39%
of the woman who truly had ovarian cancer was self-reported.
After stratifying by survey version (with the original or ‘old’ version of the
questionnaire mailed before 1993 and a second or ‘new’ version mailed between
1998-2001), the sensitivity for breast cancer was 85.16 % in the earlier cohort and
was 23.76% for breast cancers reported in the surveys returned after 1998. For
cervical cancer, the sensitivity was 39.58% before 1993, and 41.57% after 1998. For
ovarian cancer, before 1993, the sensitivity was 68.75% and 11.76% after 1998.
Table 15 Summary of sensitivity of self-reported three female cancers in the UCR-SQ dataset, by site
Type of cancer Cancers found in registry True positive self reports Sensitivity 95% CI*
(No) (No) (%) (%)
Breast 229 133 58.08 51.40-64.55
Cervix 137 56 40.88 32.56-49.60
Ovary 33 13 39.39 22.91-57.86
*CI: Confidence interval
When proxy information on the non-responding co-twins was included (from Tables
12b, 13b, and 14b) the corresponding sensitivities were decreased for these three
female cancers: 55.04% for breast cancer, 37.36% for cervical cancer, and 36.59%
for ovarian cancer (see Table 16).
44
Table 16. Summary of sensitivity of self-reported three female cancers in the California
twin study including self and proxy reports, by site
Type of cancer Cancers found in registry True positive reports Sensitivity 95% CI*
(No) (No) (%) (%)
Breast 278 153 55.04 48.98-60.98
Cervix 174 65 37.36 30.15-45.00
Ovary 41 15 36.59 22.12-53.06
*CI: Confidence interval
After stratifying by survey date, before 1993, the sensitivity was 84.25 % and
decreased to 22.73% after 1998 for breast cancer. For cervical cancer, the sensitivity
was 37.50% before 1993, and 37.27% after 1998. For ovarian cancer, before 1993,
the sensitivity was 57.14% and 15.00% after 1998.
E.2 Specificity
The specificity in the unified cancer registry-self respondent without proxy
respondent (UCR-SQ) dataset are reported in Tables 12a, 13a and 14a. 99.55% of the
women who did not have breast cancer (according to the UCR) were also negative
according to self-reports, 98.37% of the women who did not have cervical cancer
were also negative according to self-reports and 99.53% of the women who did not
have cervical cancer were also negative according to self-reports. (see Table 17 for
summary by cancer site).
45
Table 17. Summary of Specificity of self-reported three female cancers in the UCR-SQ dataset,
by site
Type of cancer Cancers not found in registry True negative reports Specificity 95% CI*
(No) (No) (%) (%)
Breast 28,753 28,624 99.55 99.47-99.63
Cervix 28,964 28,492 98.37 98.22-98.51
Ovary 29,066 28,928 99.53 99.44-99.60
*CI: Confidence interval
After stratifying by survey date, before 1993, the specificity was 99.32 % and
99.70% after 1998 survey for breast cancer. For cervical cancer, the specificity was
98.44% before 1993, and 98.33% after 1998. For ovarian cancer, before 1993, the
specificity was 99.56% and 99.50% after 1998.
When proxy reports were included (Tables 12b, 13b and 14b), the corresponding
specificities were almost the same for each of these three female cancers: 99.42% for
breast cancer; 98.39% for cervical cancer and 99.43% for ovarian cancer (see Table
18 ).
Table 18. Summary of Specificity of self-reported three female cancers in the California twin study
including self and proxy reports, by site
Type of cancer Cancers not found in registry True negative self reports Specificity 95%
CI*
(No) (No) (%) (%)
Breast 39,064 38,839 99.42 99.34-99.50
Cervix 39,304 38,673 98.39 98.27-98.52
Ovary 39,438 39,212 99.43 99.35-99.50
*CI: Confidence interval
46
After stratifying by survey date, before 1993, the specificity was 98.97 % and
99.71% after 1998 survey for breast cancer. For cervical cancer, the specificity was
98.50% before 1993, and 98.33% after 1998. For ovarian cancer, before 1993, the
specificity was 99.39% and 99.45% after 1998.
E.3 Positive predictive value (PPV)
From the above tables (12a, 13a, and 14a), the PPV was obtained for the self-report
without proxy. 50.76% of the breast cancer cases reported in the questionnaire was
confirmed as cases by the UCR; 10.61% of the cervical cancer cases and 8.61% of
ovarian cancer cases reported in the questionnaire were confirmed as cases by the
UCR respectively. When the proxy information was included (Tables 12b, 13b and
14b), the corresponding PPV decreased to 40.48% for breast cancer, 9.34% for
cervical cancer and 6.22% for ovarian cancer.
E.4 Kappa value
E.4.1 The agreement between the self-respondent without
proxy information and the unified cancer registry
In Table 12 a, Table 13 a and Table 14 a, corrected for chance agreement was 0.5379
(95% confidence interval was 0.4843-0.5914) and the observed agreement was
0.9922 (95% CI was 0.9912-0.9932) for breast cancer between the self-respondent
without proxy information and the unified cancer registry. The agreement corrected
for chance was 0.1621 (95% confidence interval was 0.1238-0.2005) and the
47
observed agreement was 0.9810 (95% CI was 0.9793-0.9825) for cervical cancer and
the agreement corrected for chance was 0.1397 (95% confidence interval was
0.0712-0.2082) and the observed agreement was 0.9946 (95% CI was 0.9793-
0.9825) for ovarian cancer between the self-respondent without proxy information
and the unified cancer registry (see Table 19).
Table 19. Agreement of self-respondent female cancers excluding proxy respondent in the California
twin study, by site
Type of cancer Kappa 95% CI* Straight agreement 95% CI *
Breast 0.5379 0.4843-0.5914 0.9922 0.9912-0.9932
Cervix 0.1621 0.1238-0.2005 0.9810 0.9793-0.9825
Ovary 0.1397 0.0712-0.2082 0.9946 0.9937-0.9954
*CI: Confidence interval
E.4.2 The agreement between the self-respondent with proxy
Information and the unified cancer registry
When the proxy respondent data was included (Table 12 b, Table 13 b and Table
14b), the agreement corrected for chance was 0.4621 (95% confidence interval was
0.4146-0.5096) and the observed agreement was 0.9911 (95% CI was 0.9901-
0.9920) for breast cancer between the self-respondent with proxy information and
the unified cancer registry. The agreement corrected for chance was 0.1434 (95%
confidence interval was 0.1113-0.1755) and the observed agreement was 0.9813
48
(95% CI was 0.9799-0.9826) for cervical cancer, and the agreement corrected for
chance was 0.1048 (95% confidence interval was 0.0554-0.1542) and the observed
agreement was 0.9936 (95% CI was 0.9928-0.9944) for ovarian cancer between the
self-respondent with proxy information and the unified cancer registry (see Table
20).
Table 20. Agreement of self-respondent female cancers including proxy respondent
in the California twin study, by site
Type of cancer Kappa 95% CI* Straight agreement 95% CI*
Breast 0.4621 0.4146-0.5096 0.9911 0.9901-0.9920
Cervix 0.1434 0.1113-0.1755 0.9813 0.9799-0.9826
Ovary 0.1048 0.0554-0.1542 0.9936 0.9928-0.9944
*CI: Confidence interval
49
V. Discussion
Measures of the accuracy of self-reported cancer incidence in our twin cohort data
with the cancer registry data is very important, as it will help us to better understand
the patterns of inaccurate reporting on cancer history and to predict the future cancer
burden in our twin cohort. Knowledge of the accuracy of self - reported cancer in the
twin cohort dataset is also important in analyses to determine whether the etiology of
cancer is genetic or environmental. We are also assessing the ability to use the
linkage with the cancer registry to identify future cases in the cohort when contact
with the respondent may not be possible.
In this study, overall 46.12 percent of self-reported female breast cancer, cervical
cancer and ovarian cancer were verified by the unified cancer registry. The highest
sensitivity was observed for breast cancer (58.08%), and the lowest sensitivity was
for ovarian cancer (39.39%) using the unified cancer registry as the gold standard.
This was consistent with a previous study that breast cancer is the most accurately
reported, while cervical and ovarian cancers are less accurately reported by survey
respondents using a cancer registry as a gold standard (Schrijvers, Stronks et al.
1994)(Berthier, Grosclaude et al. 1997). Early validation studies that compared self-
reports with medical records found that only 33-61 percent of documented cancers
were reported at interview (Krueger 1957; Rockville 1965; Madow 1973). More
recently, self-reported data compared with cancer registry records found that the
50
estimates of the overall sensitivity of self-reported cancers ranged from 27 percent to
more than 90 percent (C olditz, Martin et al. 1986; Paganini -Hill and Chao 1993;
Berthier, Grosclaude et al. 1997; Desai, Bruce et al. 2001).
In terms of specificity, we had a very high specificity for these three female cancers
in the present study. Breast cancer and ovarian cancer had almost the same
specificity (99.55%, 99.53%), followed by cervical cancer (98.37%). The high
specificity of these three female cancers using the unified cancer registry as the gold
standard implies that the self-reported questionnaire has very few false positives.
These results were consistent with previous validation studies (Parikh-Patel, Allen et
al. 2003). When proxy data was included, the sensitivity and specificity all decreased
for breast cancer, cervical cancer and ovarian cancer compared to self respondents
only indicating uncertainty in proxy was significant.
The kappa statistics of agreement in these three female cancers in the present study
showed relatively poor agreement between our self-reported questionnaire and the
unified cancer registry; the overall agreement between our twin data and the unified
cancer registry was 0.28. Breast cancer had a relatively moderate agreement (0.54,
95% CI: 0.48-0.59), which was compatible with a previous study (Schrijvers,
Stronks et al. 1994). However cervical cancer and ovarian cancer had very poor
51
agreement. Since breast cancer incidence and mortality is higher than for cervical
and ovarian cancers, detection and health promotion methods may be more
aggressive in these areas. This may be reflected in the higher agreement for breast
cancer due to higher awareness of this cancer in the population. When proxy data
was included, the agreement all decreased for breast cancer, cervical cancer and
ovarian cancer compared to self respondents only indicating uncertainty in proxy
was significant.
In the present study, the overall sensitivity and agreement were relatively lower than
some previous studies (Bergmann, Calle et al. 1998) (Parikh-Patel, Allen et al.
2003). One possible explanation for the observed difference in validity between this
study and previous studies is that the characteristics of the study populations
differed. Previous studies have shown higher sensitivity, but the method of data
collection is not similar to the present study. For example, in some of the previous
studies, accuracy was assessed only on persons who were defined as having cancer
by one of the data sources. In some previous studies, subjects were restricted to
persons who had self-reported cancer. In this case, the measure of agreement does
not assess false negatives. Similarly, in other previous studies, interviews were
conducted among persons whose cancer records contain documented evidence of
cancer. In this case, the measure of agreement does not incorporate false positives. In
our study, twin subjects were collected based on a general health interview survey
not based on their cancer status or cancer registry records. Therefore, the
52
characteristics of participants in our study may differ from previous studies, and
these studies may not be directly comparable to the present study.
Another issue may be that self-reported data might be nonspecific and prone to
errors. Previous studies have shown that individuals tend to underreport rather than
overreport cancer history in the general population (Desai, Bruce et al. 2001).
Although past research has shed some light on the accuracy of self-reported data,
patterns of misclassification of self-report are still not clear, especially in the context
of large population-based twin cohort studies. In a comparison of self-reported
cancer with cancer registry data, false-negative reporting was found to correlate with
increased time since cancer diagnosis and older age (Desai, Bruce et al. 2001). The
false negative report may occur in this study because respondents may have trouble
recalling their medical history, in particular over several decades. Second, the
respondents could be affected by poor memory due to age of the patient. For
instance, when stratified by survey date, the younger group (before 1993) had higher
sensitivity than the older group (surveyed around 2000). It was 85.16 percent versus
23.76 percent for breast cancer, and 68.75 percent versus 11.76 percent for ovarian
cancer. Third, respondents may lack the knowledge to accurately answer the
questions posed. Fourth, the respondents may lack understanding regarding their
cancer diagnosis. Fifth, if the cancer case was diagnosed outside of California, it was
possible that the unified cancer registry would have missed it. Finally, general errors
53
in completing the survey might account for the discrepancy between the self-reported
California twin study and the unified cancer registry data. Another reason may be
that there was some problem with how the linkage was done.
False positives also occurred in this study, which might have been due to subjects
who did not understand their disease diagnosis. For example, a physician may report
an abnormal examination, such as a breast mass, that was not confirmed as a cancer
through a tissue biopsy. Or, a cervical exam may result in an abnormal pap smear
that was not cervical cancer. In both of these cases, the patient might report this as
the corresponding cancer on the self-reported questionnaire, either for themselves or
their non-respondent co-twins. In addition, a patient with a diagnosed co-twin may
assume they would have the cancer due to the co-twin’s diagnosis or family history.
Our findings suggest that it is better to verify through medical records the self-
reported cancer status in this twin data set for further cancer epidemiologic studies.
The rates of self-reported breast cancer, cervical cancer and ovarian cancer in this
twin survey are not high. Use of self-reported twin data alone without validation in
studies that include certain cancers like breast cancer, cervical cancer and ovarian
cancer could lead to biased estimates of incidence and relative risk. The inaccuracy
of cancer status regarding breast cancer, cervical cancer and ovarian cancer in this
self-reported questionnaire will have a great impact on classical twin comparisons.
54
As previously mentioned, classical twin studies give us a greater insight into
genetics and environmental factors among twins. A higher concordance rate of
monozygotic twins versus dizygotic twins can imply a genetic effect. If a higher
concordance rate occurs among dizygotic versus monozygotic twins an
environmental effect may have occurred. In either case, it is important to ensure
accuracy in reporting of these twin cases to take advantage of this unique insight into
genetic versus environmental factors related to incident cancer.
The inaccuracy of self-reported cancer cases diminishes the ability to determine
whether genetic or environmental factors caused the cancer. For this reason
accuracy in these cancer reports is important. Knowledge of the etiology of cancer,
whether genetic or environmental, is important in determining where health
resources should be allocated. Optimal allocation can assist in reducing cancer
burden in society and the impact for individuals.
This validation study of the self-reported cancer incidence using the unified cancer
registry data has some major advantages over previous studies. First, we used a
unified cancer registry (UCR) as a gold standard, which combined two registry
datasets- the California Cancer Registry and the Cancer Surveillance Program. This
provided more comprehensive information on cancers in these California twins, at
least for Los Angeles County. In addition, the joining of the Cancer Surveillance
55
Program in Los Angeles County extended the California cancer registry to as early as
1972 and spanned 16 years more than the California Cancer Registry itself for Los
Angeles County residents. Furthermore, this validation study is by far the largest to
date using twin data and a unified cancer registry.
Despite these strengths, this study has several limitations that are important to
consider when interpreting our results. First of all, when we selected our study
sample through data linkage analysis, we did not have the social security number for
all individuals in both the self-reported questionnaire data and the cancer registry. To
collect more samples, some combined common available variables were used to
conduct the linkage. Mismatched cases will play an important role in our estimates
of agreement and such as reporting accuracy, but there is no further information to
assess the mismatch rate. Second, the California Cancer Registry did not initiate
state-wide data collection until 1988, although by merging the Cancer Surveillance
Program data extends the initiate year to 1972 (only for Los Angeles County, not the
entire counties in California). Third, the diagnostic date for breast cancer, cervical
cancer and ovarian cancer was not available or was missing in more than two thirds
of the self-reported cancers in the survey; therefore diagnosis date could often not be
used to check the accuracy and reliability of a self-reported cancer. Fourth,
geographical migration of the participants might also explain some of the low
sensitivity and poor agreement. For example, cancer cases who moved out of
56
California after their cancer diagnosis and did not answer the cancer questions would
lead to a mismatch between cancer registry and self report. Also, subjects diagnosed
with cancer outside of California, would likely not have appeared in the unified
cancer registry. Generalization of our results is limited to female twin populations.
Additionally, since the California Cancer Registry and Cancer Surveillance Program
in Los Angeles are well-established, high-quality cancer registries, these results may
not be generalized to newer, non-SEER state cancer registries.
57
VI. Conclusion
In this validation study, we found the overall sensitivity of self-reported female
cancer incidence to be low, but it was still consistent with some previous validation
studies. Breast cancer was the most accurately reported by this twin cohort, while
cervical cancer and ovarian cancer were less accurately reported. For specificity, the
results were high for all three female cancers and consistent with other previous
studies. We found that the agreement between the self-reported cancer cases and the
unified cancer registry cases was poor for breast cancer, cervical cancer, and ovarian
cancer. The kappa value for breast cancer was highest, but lower in cervical cancer
and ovarian cancer. However, for the kappa statistics, based on the confidence
intervals, none of these results were significant.
Proxy data for cancer studies in this twin cohort did not improve the accuracy of self-
reported female cancer and should be validated by a pilot study prior to analysis.
Additional studies similar to this in the future will not only help to address these
issues, but also assist in improving cancer reporting.
58
Bibliography
Bergmann, M. M., E. E. Calle, et al. (1998). "Validity of self-reported cancers in a
prospective cohort study in comparison with data from state cancer registries."
Am J Epidemiol 147(6): 556-62.
Berthier, F., P. Grosclaude, et al. (1997). "Prevalence of cancer in the elderly:
discrepancies between self-reported and registry data." Br J Cancer 75(3): 445-7.
Cockburn, M., J. Collett, et al. (2001). "Validation of the saliva-based H. pylori test,
heliSAL, and its use in prevalence surveys." Epidemiol Infect 126(2): 191-6.
Cockburn, M. G., A. S. Hamilton, et al. (2001). "Development and representativeness of
a large population-based cohort of native Californian twins." Twin Res 4(4): 242-
50.
Colditz, G. A., P. Martin, et al. (1986). "Validation of questionnaire information on risk
factors and disease outcomes in a prospective cohort study of women." Am J
Epidemiol 123(5): 894-900.
Deapen & Cockburn (2003). "Cancer in Los Angeles County ": pg4.
Desai, M. M., M. L. Bruce, et al. (2001). "Validity of self-reported cancer history: a
comparison of health interview data and cancer registry records." Am J Epidemiol
153(3): 299-306.
Fleiss, J. (1981). "Statistical Methods for Rates and Proportions." New York: Wiley.
Hamilton, A. S. and T. M. Mack (2000). "Use of twins as mutual proxy respondents in a
case-control study of breast cancer: effect of item nonresponse and
misclassification." Am J Epidemiol 152(11): 1093-103.
Harlow, S. D. and M. S. Linet (1989). "Agreement between questionnaire data and
medical records. The evidence for accuracy of recall." Am J Epidemiol 129(2):
233-48.
59
Harris, E. L. (1997). "Importance of heritable and nonheritable variation in cancer
susceptibility: evidence from a twin study." J Natl Cancer Inst 89(4): 270-2.
Hawkes, C. H. (1997). "Twin studies in medicine--what do they tell us?" Qjm 90(5): 311-
21.
Hrubec, Z. and C. D. Robinette (1984). "The study of human twins in medical research."
N Engl J Med 310(7): 435-41.
Kasriel, J. and L. Eaves (1976). "The zygosity of twins: further evidence on the
agreement between diagnosis by blood groups and written questionnaires." J
Biosoc Sci 8(3): 263-6.
Kerber, R. A. and M. L. Slattery (1997). "Comparison of self-reported and database-
linked family history of cancer data in a case-control study." Am J Epidemiol
146(3): 244-8.
Krueger, D. E. (1957). "Measurement of prevalence of chronic disease by household
interviews and clinical evaluations." Am J Public Health 47(8): 953-60.
Landis, J. R., J. M. Lepkowski, et al. (1982). "A statistical methodology for analyzing
data from a complex survey: the first National Health and Nutrition Examination
Survey." Vital Health Stat 2(92): 1-52.
Linet, M. S., S. D. Harlow, et al. (1989). "A comparison of interview data and medical
records for previous medical conditions and surgery." J Clin Epidemiol 42(12):
1207-13.
Madow, W. (1973). "Net differences in interview data on chronic conditions and
information derived from medical records." Vital and health statistics, series 2,
no. 57. Rockville, MD: National Center for Health Statistics (DHEW pubkication
no. (HSM) 73-1331).
Martin, N., D. Boomsma, et al. (1997). "A twin-pronged attack on complex traits." Nat
Genet 17(4): 387-92.
60
Nelson, L. M., W. T. Longstreth, Jr., et al. (1990). "Proxy respondents in epidemiologic
research." Epidemiol Rev 12: 71-86.
Nevitt, M. C., S. R. Cummings, et al. (1992). "The accuracy of self-report of fractures in
elderly women: evidence from a prospective study." Am J Epidemiol 135(5): 490-
9.
Paganini-Hill, A. and A. Chao (1993). "Accuracy of recall of hip fracture, heart attack,
and cancer: a comparison of postal survey data and medical records." Am J
Epidemiol 138(2): 101-6.
Paganini-Hill, A. and R. K. Ross (1982). "Reliability of recall of drug usage and other
health-related information." Am J Epidemiol 116(1): 114-22.
Parikh-Patel, A., M. Allen, et al. (2003). "Validation of self-reported cancers in the
California Teachers Study." Am J Epidemiol 157(6): 539-45.
Rapoport, J., D. Teres, et al. (1990). "Explaining variability of cost using a severity-of-
illness measure for ICU patients." Med Care 28(4): 338-48.
Rockville (1965). "Health interview responses compared with medical recoreds." Vital
and health statistics, series 2, no. 7. National Center for Health Statistics (USPHS
publication no. 1000).
Schrijvers, C. T., K. Stronks, et al. (1994). "Validation of cancer prevalence data from a
postal survey by comparison with cancer registry records." Am J Epidemiol
139(4): 408-14.
Tretli, S., P. G. Lund-Larsen, et al. (1982). "Reliability of questionnaire information on
cardiovascular disease and diabetes: cardiovascular disease study in Finnmark
county." J Epidemiol Community Health 36(4): 269-73.
Yost, K. S., Robert (2002). "Colorectal Cancer in California." pg8.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Development and evaluation of standardized stroke outcome measures in a population of stroke patients in rural China
PDF
Association between body mass and benign prostatic hyperplasia in Hispanics: Role of steroid 5-alpha reductase type 2 (SRD5A2) gene
PDF
beta3-adrenergic receptor gene Trp64Arg polymorphism and obesity-related characteristics among African American women with breast cancer: An analysis of USC HEAL Study
PDF
Extent, prevalence and progression of coronary calcium in four ethnic groups
PDF
Determinants of mammographic density in African-American, non-Hispanic white and Hispanic white women before and after the diagnosis with breast cancer
PDF
Association between latchkey status and smoking behavior in middle school children
PDF
BRCA1 mutations and polymorphisms in African American women with a family history of breast cancer identified through high throughput sequencing
PDF
Recreational physical activity and risk of breast cancer: The California Teachers Study
PDF
Cigarettes and alcohol in relation to colorectal cancer within the Singapore Chinese Health Study
PDF
Descriptive epidemiology of thyroid cancer in Los Angeles County, 1972-1995
PDF
A descriptive analysis of medication use by asthmatics in the Children's Health Study, 1993
PDF
Family history, hormone replacement therapy and breast cancer risk on Hispanic and non-Hispanic women, The New Mexico Women's Health Study
PDF
A pilot survey of medical abortion knowledge and practices among obstetrician/gynecologists and family practitioners in Los Angeles County
PDF
A linear model for measurement errors in oligonucleotide microarray experiment
PDF
Comparisons of metabolic factors among gestational diabetes mellitus probands, siblings and cousins
PDF
Dietary fiber intake and atherosclerosis progression: The Los Angeles Atherosclerosis Study
PDF
Validation of serum cotinine as a biomarker of environmental tobacco smoke exposure: Validation with self-report and association with subclinical atherosclerosis in non-smokers
PDF
P53 and bladder cancer outcome: A combined analysis from the Keck School of Medicine
PDF
A case-control study of passive smoking and bladder cancer risk in Los Angeles
PDF
Androgens and breast cancer
Asset Metadata
Creator
Lian, Xinyun
(author)
Core Title
Evaluation of the accuracy and reliability of self-reported breast, cervical, and ovarian cancer incidence in a large population-based cohort of native California twins
School
Graduate School
Degree
Master of Science
Degree Program
Applied Biostatistics and Epidemiology
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
biology, biostatistics,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Cockburn, Myles G. (
committee chair
), Hamilton, Ann S. (
committee member
), Mack, Wendy (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-50956
Unique identifier
UC11338274
Identifier
1437575.pdf (filename),usctheses-c16-50956 (legacy record id)
Legacy Identifier
1437575.pdf
Dmrecord
50956
Document Type
Thesis
Rights
Lian, Xinyun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
biology, biostatistics