Close
The page header's logo
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computational design for analysis of SNP association studies
(USC Thesis Other) 

Computational design for analysis of SNP association studies

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content  
                                                                                                                                             


COMPUTATIONAL DESIGN FOR ANALYSIS OF SNP ASSOCIATION STUDIES

by

Chris Hsu









A Thesis Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(BIOSTATISTICS)


December 2008




Copyright 2008          Chris Hsu

 
                                                                                                                                             
ii
Table of Contents
         
List of Tables                                                                                                             iv

Abstract                     v  

Introduction                                                                                             1

Chapter 1: Specialized Softwares       2
PLINK          2
TagSNPs         3
 SNP Imputation       4
 Haplotype Prediction       5
SAS Genetics         7

Chapter 2: General Software                                                                                      8
SAS          8
 Data Step Commands       8
 Output Delivery System      9
 External Program Execution      9
SAS Macro                  10
 Definition and Variables               10
 Statements/Functions                11
 Macro Execution                14
 Easy Usage                 14
 Debugging                                                                                        15
 SNPBLK macro                 16

Chapter 3: SNPBLK Documentation                17
Software Requirements                17
Execution                   17  
Input Datasets and Program Directories              18
Outputs                  20
Scratch Files/Datasets                 20
SAS Log/Macro Log                 20
Genotyping Reports                  21
SNP/Sample Screening                22
Sequential Screening                 25
SNP Clustering                  25
Sample Stratification                  26
MAF                   28
HWE                   28
Logistic Regression                 29  
“Tall” Format                               29
 


 
                                                                                                                                             
iii
SAS vs. PLINK                31
OR Output                 32
P-value Significant Figures                36
Haplotype/Imputed SNP Association               36
LD block Display                 37  
Permutation Testing                 38

References                   40  

Appendix A – Basic PLINK Usage                42  
Appendix B – TagSNPs Imputation                44  
Appendix C – Selected SAS Macro Code                45  
Appendix D – SNPBLK Execution Panel (Macro Call)             46  
Appendix E – SNPBLK Parameter Definitions              48  
Appendix F – Published Work                                                                                 52

















 
                                                                                                                                             
iv
List of Tables
                 
Table 1: SNP Call Rate and Exclusion       23
 
Table 2: Sample Call Rate and Exclusion       23

Table 3: MAF for Whites and by Case-Control Status      24

Table 4: Exact HWE Test for Black Controls        33

Table 5: SNP Dosage Association for Overall Population      33  

Table 6: SNP Dosage/Ordinal Association for Overall Population     34























 
                                                                                                                                             
v
Abstract
SNP genotyping technology has advanced considerably in recent years, allowing
for faster data generation at significantly lower cost. Investigators can now test a large
number of SNPs across the human genome to locate putative risk alleles. Case-control
study design, in particular, offers direct and unbiased estimation of disease risk.
However, as the number of SNPs that can be genotyped continues to increase rapidly, the
complexity and intensity of computation become important issues to consider. We offer a
simple and automated approach to the computation of case-control genotype data
especially of interest to users of the statistical analysis package SAS
®
.  














 
                                                                                                                                             
1
Introduction
We have implemented a SAS macro interface program called SNPBLK that can
dynamically process input datasets, execute external genotype analysis software, and sort
outputs. SNPBLK is essentially a control module that directs and regulates the computing
procedures according to user’s specifications. The design of such interface program
would greatly facilitate various types of genotype computations and ensure accuracy of
results, especially for analysis of highly stratified and large-scale genotyping scans.
Because SNPBLK directly mediates the executions of external specialized programs, the
users (who would typically be somewhat familiar with SAS) are relieved from the task of
learning the usage of each more specialized program, which can vary markedly from
program to program. Particularly, SNPBLK automatically formats the input genotype
datasets to fit the requirement of each specialized program.  
As an interface program, SNPBLK unifies the usages of multiple specialized
programs under a single execution panel using simple parameters, requiring minimal
learning curve for users. Because it employs multiple specialized softwares, SNPBLK
can output various types of genotype statistics from basic summary statistics like Minor
Allele Frequency (MAF) and Hardy-Weinberg Equilibrium (HWE) to more empirical
haplotype prediction in a single execution pass. The results are formatted and outputted to
different spread sheets in one MS EXCEL file. In short, SNPBLK offers two distinct
advantages to the computation of genotype data. First, it provides a continuous and
dynamic execution for a variety of SNP analyses. Secondly, it requires minimal learning
curve for users without extensive experience in computing and genotype data analysis.


 
                                                                                                                                             
2
Chapter1: Specialized Softwares
Before we describe the functionality of SNPBLK, we would like introduce several
specialized programs that are prevalently used to analyze case-control genotype data.
SNPBLK assigns these specialized programs to compute genotype statistics that best
suits each.
PLINK  
PLINK
1
(http://pngu.mgh.harvard.edu/~purcell/plink/) is a WIN DOS/UNIX command
line based program that offers a comprehensive array of statistical analyses for genotype
data. The main strength of PLINK is function diversity. In addition to summary statistics
and haplotype based analyses, PLINK also contains other important statistical tools like
permutation testing, epitasis (SNP*SNP interaction), data simulation, and family-based
association analysis. It has also implemented a novel SNP annotation feature that links
various SNP databases and compiles summary information on user-inputted SNPs.
However, because of its wide ranging implementations, some PLINK features are not as
competent as other specialized programs. The selection of tag SNPs in PLINK, for
example, does not perform as well as TagSNPs or Haploview and will occasionally select
tag SNPs that yield lower R
2
than TagSNPs and Haploview.  Despite this shortcoming,
PLINK remains a very well-developed statistical program. For example, the SNP
imputation procedure in PLINK is completely automated - from the selection of tag SNPs
to imputation of untyped SNPs to the association testing of imputed SNPs. PLINK is
frequently updated with new features and options, so perhaps its sub-par tagging SNP
selection may be improved in the future.
 

 
                                                                                                                                             
3
PLINK can be automated by SNPBLK via DOS prompt (c.f. SAS External
Program Execution p. 9).  PLINK automation is necessary because its execution
commands lack flexibility. For example, PLINK does not allow the user to compute
statistics for particular groups of SNPs or samples (i.e. ethnic groups). PLNK can only
output results for all possible strata, not selective strata. To perform analyses for selective
clusters of SNP or samples, the user must first manually reformat the genotype dataset
and SNP information file. Another weakness is that PLINK can only execute one type of
analysis (e.g., MAF) at a time. For example, PLINK cannot output results for MAF and
HWE in one execution pass. HWE analysis, in particular, cannot be stratified, adding to
inconvenience in execution since separate runs are required for each stratum. These
issues hinder the computation flow but are resolved by the automation mechanism of
SNPBLK (c.f. SNP clustering/Sample Stratification p. 24-25). Currently, SNPBLK only
assigns PLINK to compute MAF and HWE. However, other novel methods (e.g., SNP
annotation, epistasis) from PLINK will be considered for automation in SNPBLK.
For reference, we have included some basic PLINK execution commands in Appendix A.
TagSNPs

TagSNPs
2
(http://www-rcf.usc.edu/~stram/tagSNPs.html) is one of the pioneer
haplotype analysis programs. Like PLINK, it can be executed from DOS or UNIX
command line. As the name implies, TagSNPs specializes in selection of haplotype
tagging SNPs. It selects the optimal tag SNPs based on 3 types of R
2
measures: haplotype
R
h
2
, SNP R
s
2
and pairwise R
2
. By selectively maximizing these 3 statistics, the
uncertainty of predicting common haplotype is minimized. PLINK, on the other hand,  


 
                                                                                                                                             
4
relies only on pairwise R
2
to search for tag SNPs; therefore the selected tags are
occasionally less optimal. TagSNPs defines Gabriel style LD blocks with limited
haplotype variations and estimates haplotype frequencies using the partition ligation E-M
algorithm
3
.
          SNP Imputation
In addition to tag SNP selection and haplotype frequency estimation, another
important feature in TagSNPs is SNP imputation, which predicts the allele dosage of
untyped SNPs in densely genotyped reference panel (e.g., International Hapmap) using
the typed haplotype/pairwise tagging SNPs in the case-control dataset. SNP imputation
assumes that the haplotype variations in the untyped samples are well captured by the
typed reference samples. SNP imputation allows us to estimate disease risk for untyped
SNPs in the case-control samples and has been performed in candidate pathway analysis
4
.
Let us describe how SNP imputation is performed using TagSNPs. First, we need to
select htSNPs from the densely genotyped reference dataset using a certain R
2
threshold.
Next, we need to generate the tag test files, which list untyped SNPs and the respective
pairwise tags (usually 1 or 2) that predict the untyped SNPs with certain minimum
pairwise R
2
threshold. The test files must be generated separately for each ethnic group
since haplotype structures usually differ remarkably among different ethnic groups. The
test files must also be formatted with TagSNPs commands that specify the names of
reference, case-control datasets as well as the prediction output files. Each imputation of
an untyped SNP generates a prediction output file that lists the predicted allele dosage for
every case control sample. Note that some case control samples with rare haplotypes not  


 
                                                                                                                                             
5
captured by the reference samples will have uncertain dosage predictions that deviate
from 0, 1, or less than 2. Sometimes such deviation can be large (e.g., 0.5 or 1.5). Again,
SNP imputation must be done separately for each ethnic population since haplotype
structure usually varies significantly among ethnic populations. The final step involves
merging of individual prediction output files into a collective imputed genotype dataset
that can be analyzed for imputed SNP association.  
In the DNA repair breast cancer
4
and colorectal cancer studies, we performed
SNP imputation using the densely genotyped MEC reference panel containing 2630
SNPs. SNP imputation was much more time-consuming than other analyses mainly
because of the excessive data formatting required to create the pairwise tag test files and
merge the output prediction files described above. SNP imputation using TagSNPs can be
expedited if we can automate the processes of tag selection in reference samples,
imputation of untyped samples, and merging of prediction output files. Such automation
would enable us to perform SNP imputation for much larger densely genotyped reference
panels. Recently, several promising softwares specialized in SNP imputation were
released, namely, BIM-BAM
5
and MACH
6
. Unfortunately these softwares are currently
restricted to Linux OS, thus preventing SNPBLK automation via DOS command line.  
         Haplotype Prediction  
SNP imputation for a densely genotyped reference dataset is an extension of
haplotype dosage prediction in case-control samples based on the estimated haplotype
frequency
2
. Given defined LD block structures in a genomic region, TagSNPs will search
for optimal htSNPs and predict both additive and categorical dosages of common  


 
                                                                                                                                             
6
haplotypes (usually > 5%) in each block. This process of defining blocks and repeatedly
impute haplotypes for multi-ethnic populations has been automated by a SAS macro
interface program called GENEBLK
7
, the model program for SNPBLK. GENEBLK
automates haplotype prediction on a single gene or region basis. To execute GENEBLK,
the user creates a genotype dataset containing the SNPs intended for haplotype
prediction. The user should also specify haplotype block structures (in terms of SNP
order in the input genotype dataset), minimum haplotype frequency, and TagSNPs
program directory. The execution of GENEBLK will repeatedly invoke TagSNPs to
perform block-specific haplotype prediction. The predicted haplotype dosages for each
sample will be merged into a SAS dataset that can be used to estimate haplotype risk
using logistic regression. Several case-control studies have focused on haplotype
association testing using GENEBLK/TagSNPs with notable results
8,9
. We will further
discuss the automation of haplotype association testing (c.f. SNPBLK – Haplotype
Association p. 34).
Overall, TagSNPs is a well-designed with several very important features like
SNP imputation and haplotype prediction. It also has SNP screening function that drop
SNPs base on MAF, HWE, and genotyping rates. Haplotype prediction in TagSNPs is
particular useful because it has been automated by GENEBLK, which allows haplotype
prediction for multiple populations. The overlap between SNPBLK and GENEBLK
allows the user to easily link the two macro programs and perform automated haplotype
prediction.



 
                                                                                                                                             
7
SAS Genetics  
Using our SAS interface program to automate the execution of external programs
like PLINK and TagSNPs, while efficient, still requires much data processing highly
dependent on formats of the external programs. It would be ideal if SAS can compute all
types of SNP analyses internally without relying on other programs. Using SAS
exclusively for computation of genotype data is actually feasible. SNP association, as we
will discuss, is best computed using logistic procedure in SAS, which is superior to
PLINK’s logistic function for reasons we will discuss. SAS also has an add-on package
designed for genetic marker analysis. “SAS Genetics”, as it is called, contains procedures
such as proc allele and proc haplotype that output nearly all of the essential genotype
statistics we require. Recently, it has implemented proc HTSNP that addresses SNP
tagging and imputation. However, there is one significant reason that prevented us from
automating procedures in SAS genetics to our interface program, which would be much
simpler than automating external programs. Although SAS genetics is included in the
main SAS package, the user must activate its usage by paying additional licensing fee.
We felt that most users would not want to pay additional licensing fee for SAS genetics
when softwares like PLINK and TagSNPs are free. The main SAS software, though
prevalently used for data analysis, is already quite costly. Nevertheless, because it is fully
integrated with the rest of SAS, SAS Genetics remains an attractive software that we may
consider to utilize in future developments.




 
                                                                                                                                             
8
Chapter 2: General Software
SAS  
Although there are several available specialized programs that can be used to
compute all types of genotype statistics, we still require a general platform that can
manage and process data and outputs. Statistical Analysis System, or SAS, is the ideal
general software to execute various data tasks. While specialized programs can be
interchanged to compute certain genotype statistics without significant compromise in
results, there is a wide base of SAS users for whom SNPBLK offers an attractive
alternative to direct use of these other programs, since for many epidemiologists and data
analysts SAS is the most suitable general software for basic data processing.  
Data Step Commands
There are a number of SAS features that specifically address our needs in data
handling for genotype analysis. First and foremost, the data step procedure in SAS
contains many simple yet effective commands. Inputting of raw genotype files or raw
outputs from external programs, for example, can be done using infile and input.
Particularly, there are two useful options in the infile command. One option is DLM,
which allows the user to input genotype data formatted with different types of delimiters
(i.e. space, tab, comma). Another useful option is DSD, which treats two consecutive
delimiters as missing value (i.e. missing genotypes). SAS input is fast. The inputting of a
raw genotype file with 1400 SNPs and 4000 samples can be completed in seconds. Once
inputted, various data formatting tasks can be performed using commands such as array,
keep, drop, rename, merge, as well as logic statements. Output of data files can be done  


 
                                                                                                                                             
9
in similar fashion using file/put. Users may also resort to proc import and proc export to
input and output data files, respectively, though these two procedures are much slower
than infile/file, particularly for very large data files.  
Output Delivery System  
An useful feature in SAS is the Output Delivery System (ODS), which
automatically outputs results from statistical procedures to SAS datasets. This is a global
feature that applies to most SAS procedures from the simple proc freq to advanced proc
mixed. The usage of this feature is quite simple. The user first wraps the commands ods
trace on/ods trace off around a statistical procedure to ascertain the ODS output names of
statistics in the procedure. The user then applies the command ods output
name=data_name to the procedure, where name is the ODS name for a specific statistic
and data_name is the name of SAS dataset that the intended statistic is outputted to.
Please see Appendix C for a detailed example.
External Program Execution  
Another useful SAS feature is the ‘X’ command, which allows the user to execute
operating system commands (e.g., DOS command prompt) from a SAS program. For
example, we can remotely execute external programs like PLINK or tagSNPs from
SNPBLK by typing ‘X’ followed by program directory and specific executing
commands. ‘X’ will automatically open DOS command prompt, change from root to
specified program directory, and execute the external program using  commands that
proceed ‘X’. The resulting execution is the same as if we were to manually open
command prompt and execute the external program, without the hassle of changing
directories and inputting executing commands. ‘X’ is not directly related to data

 
                                                                                                                                             
10
processing but it is an essential command that keeps the flow of computation
uninterrupted, which is especially convenient when many repeated executions of external
programs are required (e.g., stratified MAF/HWE analyses using PLINK). Moreover,
because executing commands for external programs can be embedded within a SAS
program, we can dynamically program these commands to take on different parameter
values specified by the user.  
SAS Macro  
We have described a number of basic features in SAS that facilitate the analyses
of genotype data. Next we would like describe macro programming, a crucial SAS
programming facility that allows us to dynamically regulate data processing and
computation. The majority of code in SNPBLK is, in fact, made up of ordinary data step
and statistical procedures. Using relatively few lines of macro code, we can automate
these data steps and statistical procedures to perform a variety of computations.  
Definition and Variables
The purpose of macro programming is to repeatedly execute a set of routine data
steps and statistical procedures based on certain specifications imposed by the user. For
starters, let us describe the basic format of a simple macro program as shown here
%macro sample (var1, var2,..);
routine code  
%mend;
 
A macro program or function is initiated by macro declaration %macro, followed by the
name of macro, sample, and then a list of macro variables that are enclosed in brackets
and separate by comma. The %mend statement terminates the macro definition. Inside  


 
                                                                                                                                             
11
the macro contains a set of routine SAS code, which can be data steps, statistical
procedures, macro statements/commands, or macro call. We can invoke the %sample
macro with a macro call %sample(value1, value2,..), where each value in the macro call
corresponds, in the exact same order, to each variable defined in the macro. A common
error frequently occurs when the values listed in macro call are inconsistent with the
variables defined in the macro. In %sample, we see that macro variables are defined at
the inception of the program and they take on the user-inputted values in the macro call.
However, macro variables can also be declared inside macro using the %let statement
followed by variable name and value. For example, %let race=w; declares a macro
variable called race. Note that there should not be single or double quotes around the
declared value, because macro variables are neither string nor numeric, a peculiar
property that has confused many. When referring to macro variables, ‘&’ must be
attached to the variable name(i.e. &race). Lastly, macro variables can be used as
placeholders for dataset variables or as global variables for controlling certain
mechanisms. For example, we can include macro variables in PLINK executing
commands to dynamically store user-inputted values for various executing options.
Statements/Functions
Next, let us discuss SAS macro statements and functions. Some macro statements
are analogous to standard SAS statements. For example, %If..%Then and %Do..%While..
are two commonly used logic macro statements. Similarly, there are some macro
functions that are derived directly from standard SAS functions. An example is
%scan(var, i, ‘dlm’), which returns the i
th
partitioned string of the macro variable var  


 
                                                                                                                                             
12
delimited by ‘dlm’ (e.g., /,*,|, etc,..). It is important to note that macro
statements/functions (e.g., %scan) are used strictly for manipulations of macro variables.
One should not mix standard dataset variables with macro statements/functions.
SAS macro is initially developed with very limited number of functions. Many
standard SAS functions cannot be directly used as macro functions. For example, the
function substr(var,i,l), which returns a partitioned string of length l starting at i
th
position
of the values of dataset variable var, cannot be directly used as a macro function %substr,
though tempting as it may be. To use substr as a macro function, one must first apply
%sysfunc, which acts as a conversion function. For example, one should do
%sysfunc(substr(var,i,l)), which returns the partitioned string of a macro variable var. We
should store the partitioned string in another macro variable, for example, %let
rc=%sysfunc(substr(var,i,l)).  
%sysfunc can also be used to verify dataset observations and attributes. For example, the
following macro code checks whether the dataset is empty, or no observations.
%let dsid=%sysfunc(open(data_name));
%let rc=%sysfunc(fetch(&dsid));
%sysfunc(close(&dsid));  
We first open the SAS datsset and assign its name (data_name) to a macro variable
(dsid), then we use the fetch function to check whether the 1
st
observation in the dataset
can be fetched successfully. The macro variable rc will take on the value of 1 if the1
st

observation exists, -1 if not. It is important to close the dataset whenever it is opened.
In addition to macro functions directly derived or indirectly converted from
standard SAS functions, there are also macro functions specific to macro facility. An  


 
                                                                                                                                             
13
essential macro-specific function is %call symputx(var1,var2), which converts var2, a
dataset variable or a string, into var1, a macro variable. In SNPBLK, %call symputx is
used to store the number of SNPs in a cluster of genes or number of samples in some
strata (e.g., race groups) into macro variables. %call symputx is an example of data-
driven programming, which data values are used as object variables. Previously, we
mentioned that macro variable is neither string nor numeric; therefore, to evaluate macro
variables as numbers for numerical manipulation, we must use the function %eval(&var),
which converts macro variable &var to numeric type. The macro functions we just
described are built-in functions. User can also implement his/her own macro functions to
perform desired tasks. In Appendix C, we included a handy macro function
%numarg(phrase) that counts the number of arguments in a phrase delimited by space.
%numarg(phrase) is defined in pure macro code that can be used universally in any
macro programs. Of course, user may choose to implement macro functions designed for
very specific tasks not applicable to other macro programs.
We can utilize macro variables, macro statements and macro functions to design
an algorithm to automatically parse a genotype dataset. For example,
%let nrace=%numargs(&racegrps);  
 %do i=1 %to &nrace;
      %let race = %scan(&racegrps,%eval(&i),' ');
      data &data._∽̱
   set &data;
    if (&racevar eq "&race");  
  run;  
end;
This macro code will parse a full genotype dataset into race-specific datasets. Using  


 
                                                                                                                                             
14
%numargs, we first ascertain the number of race groups, &racegrps, inputted by the user
and store this number as a macro variable, &nrace. Then we execute a %do loop to parse
the full dataset by outputting observations whose race, &racevar, equates the designated
race value (&race) in each loop. This simple algorithm can be applied to N-way sample
stratification as well, by continuously parsing the original dataset using nested %do
loops.
 Earlier we showed a simple undivided macro program. A formal macro program
is, in fact, consisted of several macro functions. Each macro function should be
individually implemented to perform a particular task. As a programming note, it is
generally inappropriate to nest one macro function inside another macro function. Such
practice often results in errors that are difficult to decipher (c.f. Debugging).    
Macro Execution  
Let us describe macro execution in greater detail. To execute a macro program or
a function, we need to invoke it with a function call, which contains a list of user-inputted
values that match the declared macro variables in the function. When a macro is
submitted, the macro processor translates the macro variables and statements to standard
SAS code that can be compiled and executed. Therefore, a macro program is a template
that writes out executable code. Each time a macro is invoked, the macro processor
renews the macro variables in the macro with a new set of parameter values from the
function call, thereby producing desired variations in the executing code.
Easy Usage
By designing an efficient SAS macro program, computations can be precisely  


 
                                                                                                                                             
15
controlled by user-inputted parameter values in the macro call. We like to emphasize that,
in the execution of macro, the user does not need to edit any portion of the macro
program code; only the parameter values in the macro call should be modified. Therefore,
it is not necessary for users to understand the details of the program. This allows users
with minimal SAS grounding to competently execute the program. Earlier, we mentioned
that the design of an SAS macro interface program would allow the users to execute
specialized programs without profound familiarity of the programs. Now, we can extend
this statement to SAS also; that an automated SAS macro greatly simplifies the use of
specialized programs, as well as SAS itself.  
Debugging
Although SAS macro offers great convenience and efficiency, novice and even
experienced programmers will agree that it can be inextricably frustrating to debug a
faulted SAS macro. Keep in mind, in macro programming, we are writing a program that
writes out another program, so the macro code that we implemented still needs to be
executed by the macro interpreter. It is during macro interpretation where most errors
occur, especially when many macro variables and statements are used in the program.
Furthermore, macro programming errors are difficult to decipher because they refer to the
compiled code rather than the raw macro code. Having said this, there are several
remedies to alleviate the agony of macro debugging. First, SAS has log printing options
(mprint symbolgen macrogen and spool) that can output the interpreter’s output code
with values substituted for each macro variable in the log output window during the
execution of macro program. This option is quite helpful in detecting and deciphering  


 
                                                                                                                                             
16
potential macro bugs (c.f. SAS log/macro log). Another tip is that, as we are writing a
macro program, we should view macro code as compiled executable code, despite their
syntax differences. Keep in mind that every macro variable and command need to
translated by the macro interpreter to perfectly executable SAS statements. Lastly, as a  
general programming practice, one should write multiple macro functions each designed
to perform a specific task. An undivided macro program will likely to yield errors,
sometimes undetected with acceptable syntax, in the program. The errors from a
continuous and nested macro program are more difficult to debug because they are often
harder to pinpoint, even with detailed macro log output. By writing a program divided
into multiple independent functions, we can test the operation of each function
individually and attribute potential errors to specific functions. Although designing a
macro program consisted of multiple macro functions may initially seem more difficult,
this programming practice will greatly reduce debugging time.  
SNPBLK Macro
We felt compelled to take the reader for a lengthy excursion in SAS macro
programming because of its practicality in intensive computation of genotype data. Using
SAS macro programming, we were able to design an efficient interface program that can
automate a variety of computations for genotype datasets with varying sample strata and
SNP clusters. From our experience in genotype data analysis, it almost always occurs that
some or all of intended genotype analyses need to be repeated several time because of
certain changes to the genotype data (e.g., re-genotyping, re-sampling). SNPBLK allows
the user to re-run the desired analyses in one pass without extensive manual data re-


 
                                                                                                                                             
17
formatting. Efficiency and automation, which we repeatedly emphasized, are the key
properties of SNPBLK. The design of SNPBLK was culminated through analyzing
several SNP association studies. In Appendix F, we attached a published work that
utilized many features in SNPBLK we discussed.

SNPBLK Documentation  
The following serves as a preliminary documentation for SNPBLK.
Software Requirements
In addition to SAS, user should install PLINK
http://www.broad.mit.edu/mpg/haploview/download.php and TagSNPs  
http://www-rcf.usc.edu/~stram/tagSNPs.html. SNPBLK v1.0 and formal user
documentation will be available at http://www-scf.usc.edu/~chrishsu/SNPBLK.html in
November 2008.
Execution  
SNPBLK is an automated SAS macro interface program that computes basic
summary and haplotype-based statistics for case-control SNP genotype data. The
program is consisted of a SAS macro code and its execution panel, or macro call. To
execute the program, the user simply needs to execute the macro call by inputting
alphanumeric values into each parameter in the macro call. Not all parameters are
required to have input values. Each parameter in the macro call is separated by a comma.
There can be spaces or non-alphanumeric symbols in the parameter values. For exact
format and detailed definition of each parameter, please see Appendix D and E. Most  


 
                                                                                                                                             
18
executing errors are caused by incorrect parameter values. User should carefully verify
the definition and format of each parameter.
We stress that it is not necessary to download the macro code for SNPBLK. The
SNPBLK macro call includes a URL pointing to the macro code. When the macro call is
executed, it automatically downloads macro via web, provided that internet access is
available. The user can also choose to save SNPBLK macro permanently. Simply
download the source code http://www-scf.usc.edu/~chrishsu/SNPBLK.sas and replace
the URL after filename in the macro call with file address of the source code. Note that
for either execution methods, it is not necessary to explicitly open the macro code in SAS
program during execution. In fact, to prevent inadvertent modifications, we recommend
that macro code should remain closed.  
It is also possible to batch execute SNPBLK without opening SAS software.
Simply right click on the file icon of the macro call (saved as .sas file) and select “batch
submit”. Of course, the user should still check the validity of each parameter  prior to
execution. Note that the macro call is simply a text file and can be opened in any text
editor.
SNPBLK will call external programs (e.g., PLINK, TagSNPs, Haploview) to
compute particular genotype statistics, but the user does not need to directly interact with
these programs. The DOS command prompts will repeatedly appear and disappear
indicating the executions of external programs.  
Input Datasets and Program Directories
There are 3 required input datasets: genotype, SNP information, and covariate.  


 
                                                                                                                                             
19
Currently, these input datasets must be in SAS format. Future revisions may allow for
automatic inputs of non-SAS genotype files with different delimiters (i.e. comma, tab).
The genotype dataset must have an alphanumeric sample ID variable followed by SNP
variables, which can be coded either as numbers (“1 1” “2 2” “3 3” “4 4”) or letters (“A
A” “C C” “G G” “T T”) separated by a space between each allele. Missing genotypes
must be coded as “0 0”. There can be covariate variables (e.g., sex, age, etc..) in the
genotype dataset, though not required. The name of genotype dataset must be entered
separately in the macro call (data) as well as variable name for sample IDs (ID), which
must be alphanumeric strings less than 30 characters in length.
The second required dataset is SNP information, which is required to have at least
3 variables: SNP ID, chromosome #, and SNP base position. SNP ID must be an
alphanumeric string less or equal to 30 characters in length. The chromosome # must be
an integer from 1 to 22, or characters ‘X’ and ‘Y’ for sex chromosomes. Base position
must be a positive number. If chromosome # and base position are unknown, their values
can be arbitrarily assigned. The chromosome #s can be identical but base positions must
be distinct for every SNP. The variable names of SNP ID, chromosome #, and base
position must be inputted in the macro call using the parameters snpvar, chromo, and
posvar, respectively. Other relevant SNP information like gene or metabolic pathway can
also be included, though not required. These additional fields will be outputted in the
results along with the 3 required fields.  
One crucial point concerning genotype and information datasets is that the order
of SNP genotypes in the genotype dataset must match the order of SNP IDs in the SNP  


 
                                                                                                                                             
20
information dataset. For example, the 1
st
genotype column in the genotype dataset must
refer to the 1
st
SNP ID row in the information dataset. The 2
nd
column… must refer to the
2
nd
row… etc..
The last required input dataset is the covariate dataset, which must contain sample
ID, disease status (statusvar), and all other covariate and stratifying variables required for
analysis. If there are no covariates or stratifying variables (e.g., univariate analysis), then
only sample ID and disease status are required in the covariate dataset. Disease status, or
case-control status, must be coded as 1=cases and 0=controls. Sex, if present in the study,
should be coded as M=males and F=females.  
The user should also input the directories of PLINK, TagSNPs, Haploview
executables (plinkdir, tagsnpsdir, haploviewdir) in the macro call. The SAS library name
where input datasets are saved must also be specified (lname).
Outputs
All results are outputted as EXCEL spreadsheets compiled in a single file, except
for the LD display in Haploview. The user should specify the directory of the EXCEL file
(outdir), as well as the name of the file (outname). The user can also view the outputs in
SAS format datasets, which are identical to the EXCEL spreadsheets.
Scratch Files/Datasets
During execution, SNPBLK will generate output text files from PLINK and
TagSNPs as well as extraneous SAS datasets. These output files and datasets can take up
much disk space. The user can choose to keep or automatically delete these files and
datasets using the parameters delsasdataflag and delplinkdataflag. Note that raw output
files from PLINK can be useful for verifying results.

 
                                                                                                                                             
21
SAS Log/Macro Log
For analysis of studies with more than 1000 SNPs and multiple stratifying
variables, we recommend outputting SAS log file to a separate text file. SAS log window
has a limited size. Upon reaching the limit, SAS will prompt the user to clear the log
window. Such message prompt stops SAS execution until the user responds to the promt.
To avoid interruption in computation, the user can output SAS log to a separate text file
using logflag. Another useful log option (macrologflag) pertains to the debugging of
macro program. In the circumstance that errors occur during execution, the user can
review detailed log of compiled SAS code with values substituted for each macro
variable.  
Genotyping Reports  
SNPBLK can output the call rates for each SNP and sample. Using the parameter
callreportflag, the user can generate SNP call report and Sample call reports. The SNP
call report lists the number and percentage of cases/controls/all samples that failed
genotyping for each SNP. Likewise, the sample call report lists the number and
percentage of SNPs that failed genotyping for each sample. Below are the examples of
genotyping call reports. Note that the % of SNP/sample failure is based on # of failed
SNPs/samples divided by total unscreened SNPs/samples in the dataset.  
SNP/Sample Screening
We implemented several screening criteria for exclusion of SNPs and samples,
which can be switched on or off by the parameter screenflag. The user can exclude SNPs
by 1) SNP genotyping failure rate (screensamplerate) 2) overall MAF (screenmaf) 3)  


 
                                                                                                                                             
22
HWE p-value (screenhwe) and 4) force-out SNPs (forcesnps). For 1), the user can specify
the maximum threshold for genotyping failure. Any SNPs with higher failure rate will be
excluded. For 2), the user can specify the minimum overall MAF threshold. Any SNP
with lower overall MAF will be excluded. For 3), the user can specify the p-value
threshold for HWE test. For study with multiple ethnic populations, the user can also
specify the minimum number of populations that must fail HWE in order for a SNP to be
excluded (npophwe). For 4) the user can force exclude individual SNPs by listing their
SNP IDs.
For sample screening, user can exclude samples based on 1) sample genotyping
failure rate (screensubjects) and 2) force-out samples (forcesubjects) by specifying
maximum genotyping failure rate and listing sample IDs, respectively.
SNP/sample screening is best used in conjunction with genotyping call reports,
which will contain an extra column “exclude” indicating whether a particular sample or
SNP was excluded and for what reason(s). For passing samples and SNPs, this exclusion
column will show as blanks.  
Sample outputs for SNP/Sample screening and exclusion are shown in Table 1
and Table 2 on p. 23 and p.24.







     
                                                                                                                                             
23
Table 1: SNP Call Rate and Exclusion
SNP fail_ca %_fail_ca fail_co %_fail_co fail _all %_miss_all MAF exclude
EP300_0002 6 0.0037 10 0.0051 16 0.0045 0.124 HWE fail
CYCLIND_01 412 0.2556 81 0.0413 493 0.138 0.452 Genotype fail
FOXA1_003 51 0.0316 54 0.0275 105 0.0294 0.017 Low MAF
MPG_001 25 0.0155 21 0.0107 46 0.0129 0.021 Low MAF/Force out
CREBBP_02 10 0.0062 11 0.0056 21 0.0059 0.263  
Reasons for SNP exclusions (exclude): HWE fail=HWE p-value below intended p-value threshold. Genotype fail=# of sample failures per SNP exceed failure
threshold. Low MAF=MAF below threshold (SNP too rare). Force-out=SNPs force-excluded  
by the user.  Empty cell in “exclude” indicates that a SNP passes screening criteria.  

Table 2: Sample Call Rate and Exclusion
Labid n_fail_snp percent_fail status race sex age ageg exclude
C000002 2 0.0014 1 L M 52 1  
C000003 298 0.21 1 L F 69 3 Sample fail
C000283 3 0.0021 1 J F 68 3 Force out
C000331 502 0.35 1 J F 61 2 Sample/Force out
C000334 4 0.0028 1 J F 77 3  
Reasons for sample exclusions (exclude): Sample fail=# of SNP failures per sample exceed failure threshold.  
Force-out=Samples force-excluded by the user.  Empty cell in “exclude” indicates that a sample passes screening criteria.  
 






     
                                                                                                                                             
24
Table 3: MAF for Whites and by Case-Control Status  
SNP minor major n_W maf_W n_W_ca maf_W_ca n_W_co maf_W_co
CYP19_004 A G 454 0.403 120 0.396 334 0.406
CYP19_023 A G 460 0.035 122 0.029 338 0.037
pgr_012 C T 428 0.357 119 0.328 309 0.369
igfbp1_001 G A 451 0.356 121 0.393 330 0.342
igfbp3_05 C T 459 0.207 122 0.225 337 0.200
hsd17b2_023 G A 455 0.410 122 0.422 333 0.405
Minor and major alleles based on over-all populations. ca=cases, co=controls.







 
                                                                                                                                             
25
Sequential Screening  
By default, the above screenings of SNPs and samples are performed
independently. The screening of SNPs is based on the genotyping rates of unscreened
samples, and the screening of samples is based on the genotyping rates of unscreened
SNPs. However, users may want to first screen SNPs based on unscreened samples,
exclude the failed SNPs, and then screen the samples using the screened SNPs. This
sequential screening method (screenseqflag) can reduce the number of samples dropped
because SNPs with poor assays are removed prior to sample screening.  
SNP Clustering  
We implemented several methods of SNP clustering (snpcluster) that allow the
user to analyze selective clusters of SNPs, rather than the every SNP in the study. This is
particularly useful for genome-wide scans, where the users may be more interested in
SNPs confined to certain regions of genome. SNPBLK will only compute SNPs in the
selected clusters, thus greatly reducing the computation time.
1) Individual SNPs (snplist) – the user can select individual SNPs by listing SNP
IDs.  
2) SNP clustering by gene (genelist) – the user can select individual genes by
listing the gene names (i.e. TP53 FANCA MLH3). Only SNPs located within each gene
will be analyzed.
3) SNP clustering by position (chrposlist) – the user can select SNPs within
specific regions of genome by indicating the start and end positions of the regions (i.e.
82516749-82694327 48747484-48782349 53339009-55615387). Only SNPs confined to  


 
                                                                                                                                             
26
the regions will be analyzed.
4) All SNPs – every SNP in the study will be included in the analysis.
For 1), 2), and 3), the user can input as many as 5000 individual SNPs, genes, and
position ranges.  
Sample Stratification  
The automation of sample stratification is a key component of SNPBLK. For
studies with multiple stratifiers, SNPBLK can automatically create stratified sample
datasets that undergo routine computations. We implemented several sample stratification
schemes.
One-way stratification (stratvarsflag) – user can perform 1-way stratification by
indicating the names of stratifying variables (stratvars) and the specific categories within
each stratifying variable (stratvarsvals). The user can input as many stratifying variables
as desired. The user can also limit analyses to particular strata in each stratifying variable.
For example, if the user only wants to compute a particular BMI stratum (i.e. BMI>23) of
all possible BMI strata defined in the data, the user would input the code value for
BMI>23 in the parameter stratvarsvals. Note that, in inputting the stratifying variable
names and stratum values, each variable name is separated by a space and each group of
strata values of each variable are separated by a vertical bar (c.f. Appendix D).
Furthermore, the group order of strata values must be identical to the order of variable
names. In other words, the first group of strata values should refer to the first stratifying
variable entered.  The second group… should refer to the second variable… etc.
Case stratification (castratflag) – this is almost identical to one-way stratification  


 
                                                                                                                                             
27
except that stratification is only performed on the case samples while the control samples
remained unstratified. For example, the user may be interested in the comparison between
localized and advanced diseased case samples. Like before, the user can enter as many
case stratifying variables (castrat) and the desired strata for each variable (castratval).  
Two-way stratification (twostratflag) – users may want to stratify the analyses on
two variables jointly. A common example of two-way stratification is race and sex.
Suppose there are 5 race groups in the study, then two-way stratification with sex would
yield 10 race, sex-specific strata. To perform two-way stratification, the user should input
the names of pair stratifying variables (twostratvars) separated by ‘*’ like race*sex,
assumed the variables are named accordingly in the dataset. As many stratifying variable
pairs can be entered. The user can also indicate the specific two-way stratified groups to
be analyzed. Suppose the 5 race groups are Whites (W), African Americans (B), Japanese
(J), Hawaiians (H), Latino (L), the user can limit the analyses to, for example, African
American women and Japanese men by entering B*F and J*M in the parameter
twostratvarvals. As before, as many pair stratified groups can be entered, as long as their
input order matches the order of the pair variable names they refer to.  
N-way stratification - we are currently implementing N-way sample stratification,
even though it is uncommon to stratify samples on more than 3 variables. However, once
implemented, N-way stratification will simplify the sample stratification scheme. The
stratification parameters previously described will be generalized. The user will be
allowed to input different types of stratification (i.e.,1-way, 2-way, 3-way) in a single
parameter in the macro call (i.e., BMI race sex race*sex BMI*race BMI*sex  


 
                                                                                                                                             
28
race*sex*BMI). The desired strata for each type of stratification can also be inputted in a
single parameter.
SNP clustering and samples stratification can be used in conjunction, allowing the
user to limit analyses to particular clusters of SNPs for particular strata of samples.
MAF  
Minor allele frequency (mafflag) is a simple yet essential summary statistics,
particularly for SNP screening. MAF computation is performed using PLINK. However,
SNPBLK facilitates the computation procedure by automatically formatting input
genotype datasets and raw MAF results. The user is not required to input PLINK-specific
commands. The user can select any possible combinations of 1- or 2-way stratification
for MAF analysis. Within each stratum, the user can choose to further stratify MAF by
disease status (mafstatusflag). The user can also select the decimal places of the MAF
(mafdecimal). Below is sample MAF output. Note that the MAF in each stratum is based
on the minor allele of overall samples, so occasionally stratum-specific MAF may exceed
0.5 if allele inversions occur among population strata. (c.f. Table 3  on p. 23)
HWE
Hardy-Weinberg equilibrium (hweflag) is another important summary statistics
computed using PLINK. As with MAF, the user does not need input any PLINK-specific
commands in SNPBLK. Like MAF, HWE analysis can be stratified by 1 or 2 variables,
and each stratum is further stratified by disease status, though users may only be
interested in HWE among control samples. There are two HWE tests available (hwetest):
The standard Pearson chi-square test (chisq) and the more recently developed Exact
test
10
(exact). The latter test is more suitable for testing rare SNPs (MAF<0.01).

 
                                                                                                                                             
29
Besides p-values from HWE testing, genotype frequencies, expected and observed
heterozygosity are also outputted. (c.f. Table 4 on p. 33)
Logistic Regression  
Odds ratio (OR) and associated confidence interval (CI) are the most indicative
measures for SNP association. SNPBLK computes linear trend and ordinal OR and
associated CIs using the maximum likelihood estimation in the logistic regression
procedure in SAS. SNPBLK also performs Breslow-Day heterogeneity test for stratum-
specific risks. The association testing is controlled by the parameter assoflag.  
“Tall” Format
Before actual computation, one essential formatting task is automatically
performed by SNPBLK. The standard “wide” format genotype dataset is transposed to
“tall” format dataset. This transposition greatly accelerates the computation speed in proc
logistic. This simple but effective genotype data reformatting has been adopted for a
genome-wide association study
11
. Let us describe why and how this transposition is done.
Ordinarily, to analyze the standard “wide” genotype dataset in SAS, one logistic
procedure must be run for each single SNP variable. This is highly inefficient since,
regardless of simplicity of the computation, each regression procedure requires a certain
base execution time and overhead computer memory. Running a high number of
procedures would result in considerable slow down in computation. Eventually, as
memory is depleted, computation would retard to a near halt. SAS has attempted to
address this “memory leak” issue by offering a patch program, which we found to be
ineffective. The best solution to this problem is to reduce the number of regressions  


 
                                                                                                                                             
30
required by converting a horizontal genotype dataset with M SNP genotype variables and
N samples (assuming M>>N) to a vertical dataset with 1 genotype variable and M*N
rows. The single SNP genotype variable would include all genotypes for every sample.
Intuitively, we can think of this data transformation as chopping up the standard “wide”
genotype dataset by each SNP variable and vertically stacking each single SNP genotype
dataset (including the covariates) on top of each other. Such reconfiguration can be easily
done using the transpose procedure in SAS. After converting from “wide” to “tall”
format, we now only need to apply one logistic procedure to compute every SNP in the
dataset. In the ‘tall’ format, each SNP is now treated as a stratum, rather than as a
variable. Therefore, we are running proc logistic with one genotype variable and other
covariate variables in the model BY each SNP. Obviously, different sample clusters and
types of OR still require separate logistic procedures, but the number of required calls to
proc logistic are now greatly reduced. For example, the DNA repair colorectal cancer
study
12
, which has roughly 1400 SNPs with 2 stratifying variables (race, sex), will require
almost separate 50,000 calls to Proc Logistic to compute all possible 1-way and 2-way
stratified trend and ordinal OR estimation, using standard “wide” genotype format. In
comparison, the “tall” genotype format requires only 36 procedure calls to compute the
same analyses. Because we are running a low number of regression procedures using
“tall” format, we do not exhaust computer memory, allowing computation speed to
remain undiminished. To gauge the speed of logistic computation in SAS using “tall”
format, we simulated a genotype dataset with 120K SNPs and 4300 samples. Using a
modest CPU (Pentium Centrino II 2.0 GHz), SAS computed the overall trend OR and p-


 
                                                                                                                                             
31
values for 120K SNPs in approximately 3 hours, averaging over 600 SNPs per minute.
This is almost a 10-fold improvement over the same analysis using untransposed “wide”
format, where computation speed averages only about 60 SNPs per min, not to mention
the eventual deceleration due to memory leak. One weakness of this data transposition
approach, however, is that the “tall” format requires a tremendously large disk space. Our
120K simulated SAS dataset occupies over 20 gigabytes. We can expect “tall” genome
scan dataset is over 200 gigabytes in size. “Tall” format is, in fact, a terribly inefficient
way to store a genotype dataset, because every sample is needlessly duplicated by the
number of SNPs in the study. “Tall” formatting is a maneuver to fit the stratum-specific
computation in SAS proc logistic. Nevertheless, in reality, the disk space issue for “tall”
dataset is rather inconsequential since hard drives over terabytes in memory can be
readily obtained. The SAS code for conversion of “tall” genotype dataset and subsequent
logistic regression are shown in Appendix C.  
SAS vs. PLINK
Some may inquire why we do not to call PLINK to compute OR as we did for
HWE and MAF, instead resorting to a seemingly clumsy SAS regression procedure.
After all, PLINK has implemented a proficient multivariate logistic regression function
that outputs identical trend and ordinal OR, CI, and p-values as SAS. It was also initially
perceived that PLINK would exceed SAS in computation speed, since SAS requires
larger overhead memory that hinders CPU processing. However, we benchmarked a
speed test in logistic regression between SAS (using “tall” formatted genotype dataset)
and PLINK and found that two softwares are actually equally fast in their logistic  


 
                                                                                                                                             
32
computation - PLINK also averages about 600 SNPs per minute for a dataset with 4300
samples. Stratified analyses with smaller sample size would of course be even faster for
both softwares. With such computation speed, both PLINK and SAS are certainly capable
of computing large-scale genotype datasets. However, we are much more inclined to use
SAS for logistic computation over PLINK because it is more troublesome to set up
regression model in PLINK than SAS. In the adjustment of categorical covariates in
PLINK, for example, the user is required to first convert the categorical variables to
dummy coding variables before they can be included in the model. SAS, on other hand,
allows the user to more easily define categorical covariate variables using the class
statement in proc logistic without any additional variable formatting. SAS allows the user
to explicitly declare logistic model while PLINK offers a somewhat implicit model
definition. There are also other issues in output format that makes PLINK much less
attractive than SAS for association testing. Overall, we find SAS to be the more suitable
platform for logistic regression computation than PLINK.
OR Output
As with HWE and MAF, SNPBLK can automatically stratify logistic regression
for different possible sample and SNP clusters. SNPBLK automatically adjusts for
traditional confounders like age, sex, and race if they are present in the dataset. The user
can also adjust for other potential confounders using the parameter covarlist. Two types
of ORs can be outputted: ordinal and/or trend (ordorflag). The user can choose the
decimal places of OR (ordecimal) and concatenation of OR and CI (orclflag). In the case
that SNP screening was performed, user can select to show results with or without the  


 
                                                                                                                                             
33
excluded SNPs (exsnpsflag). Of course, the ORs are computed without excluded samples.
Another useful feature is tophitsflag, which outputs the top N SNPs with most significant
overall p-values. N can be any number smaller than total SNP number in the study  
(ntophits). This feature provides a succinct summary for the most significant SNPs in
association analysis. Lastly, it is possible to combine association, MAF, and HWE
outputs in a single summary table using the parameter assohwemafflag. Sample outputs
for dosage and ordinal SNP association (OR) from SNPBLK can be found in Tables 5
and 6 on p. 34 and p. 35.






























34
Table 4: Exact HWE test in Black Controls  
snp gene n_rec_B_co n_het_B_co n_dom_B_co o_het_B_co e_het_B_co p_hwe_B_co
rs4150474 XPB 36 244 317 0.4087 0.3892 0.25
rs1805409 PARP1 0 7 592 0.01169 0.01162 1.0
rs3800378 FANCE 39 71 55 0.4303 0.4953 0.12
rs958535 XRCC4 0 24 579 0.0398 0.03901 1.0
rs3803467 NEIL1 0 0 167 0 0 1.0
Genotype counts for recessive (n_rec), heterozygote (n_het), and homozygote alleles (n_dom). Observed  
heterozygosity (o_het) and expected heterozygosity (e_het) and p-value for exact test (p_hwe)

Table 5: Linear Dosage Association for Overall Population  
snp minor major n_all_ca n_all_co or_all lcl_all ucl_all p_all or_cl_w het_race_p
CYP19_001 T A 840 2104 1.07 0.93 1.22 0.34 1.07(0.93,1.22) 0.79
CYP19_002 C G 840 2065 1.02 0.88 1.18 0.81 1.02(0.88,1.18) 0.36
CYP19_003 T C 838 2081 0.94 0.83 1.05 0.27 0.94(0.83,1.05) 0.22
Odds ratio (or), confidence interval (lcl, hcl), and p-values (p) estimated from unconditional logistic regression adjusted for potential covariates (i.e., age, race) in
the study. P-value for heterogeneity among race groups estimated using Breslow-Day heterogeneity test (het_race_p)













35
Table 6: SNP Dosage/Ordinal Association for Overall Population  
snp effect chr gene n_all_ca n_all_co or_all lcl_all ucl_all or_cl_all p_all
rs4150474 trend 2 XPB 1263 2983 0.99 0.85 1.14 0.99 (0.85, 1.14) 0.84
rs4150474 A/C 2 XPB 275 755 0.92 0.77 1.10 0.92 (0.77, 1.1) 0.36
rs4150474 C/C 2 XPB 49 108 1.13 0.78 1.63 1.13 (0.78, 1.63) 0.51
Ordinal OR: wildtype/variant vs. wildtype/wildtype and variant/variant vs. wildtype/wildtype









 
                                                                                                                                             
36
P-value Significant Figures
SAS and other statistical softwares often output raw p-values with many digits.
Unlike other statistics such as OR and CI, raw p-values should not be rounded to specific
decimal place defined by the user. Instead, p-values should be rounded to contain a
designated number of significant figures regardless of their decimal places. To do so, we
have implemented a parameter for designating the number of significant figures in  
p-values (pvalsig). Surprisingly, SAS does not have a specialized function for designating
significant figures, so this macro function must be self-implemented (c.f. SAS Macro
Functions). The parameter pvalsig applies to all types of p-values, whether they are from
HWE testing or logistic OR estimation.  
Haplotype/Imputed SNP Association  
Two important haplotype based association analyses are currently being
automated in SNPBLK: predicted haplotype and imputed SNP associations. Haplotype
prediction and SNP imputation are well-implemented in TagSNPs, which, like PLINK,
can be automatically executed from SAS macro. In fact, haplotype prediction using
TagSNPs has already been automated in GENEBLK
12
, a SAS macro program that
outputs haplotype prediction dataset containing various types of predicted haplotype
dosages for each case-control sample according to user-defined haplotype blocks in an
intended gene. SNPBLK is modeled largely after GENEBLK. Because they share many
design cues, SNPBLK and GENEBLK can be merged without execution conflicts.
SNPBLK can enhance GENEBLK by automating haplotype prediction for multiple genes
or chromosomal segments using its SNP clustering function described earlier. The user  


 
                                                                                                                                             
37
can input multiple genes or chromosomal segments intended for haplotype prediction.
The user can also choose to manually define block structures for each gene/chromosomal
segment or allow TagSNPs to automatically define haplotype blocks. After predicted
haplotype dataset is generated, SNPBLK will compute various types of OR. The linking
of GENEBLK and SNPBLK promises a complete automation of haplotype association
analysis, requiring the user to only input putative genes or chromosomal segments
(coordinates) for testing haplotype risk. We believe such approach will greatly facilitate
haplotype analysis for genome-wide scans, where numerous genes and chromosomal
segments are likely to be considered for putative haplotype risk.  
Similarly for SNP imputation, we would like to also automate imputation
procedures in SNPBLK usingTagSNPs. There are several procedures that are needed to
be automated. First, we need to generate pairwise tag tests for a densely genotyped
reference panel (e.g., Hapmap) using typed tags (in case-control dataset) as predictors.
Secondly, we need to impute allele dosage for untyped SNPs using the pairwise tests.
Lastly, we need to generate an imputed genotype dataset that can be computed for
imputed SNP association. Once completed, the user should only be required to specify
the reference panel dataset and criteria for pairwise tagging.
LD block Display
One of the most useful features in Haploview, a pioneer software that specialize
in haplotype analysis, is the cleverly designed graphical display of Linkage
disequilibrium (LD) block structures, which has been sited frequently in literatures.
Therefore, we would like to automate Haploview LD display to SNPBLK as well. Like  


 
                                                                                                                                             
38
TagSNPs and PLINK, Haploview can be remotely executed from SNPBLK via DOS
command line. Ordinarily, to display LD structures for genes or regions of interest in
Haploview, the user needs to input the formatted genotype and SNP information text files
containing the exact SNPs in the intended genes or regions. For large-scale genotype
scans where many putative genes and regions are involved, the preparation of input
genotype and SNP information text files is time-consuming and error-prone. Again,
because SNPBLK has implemented a simple and comprehensive SNP clustering
function, the user can easily input multiple genes or chromosomal regions in the macro
call without any manual formatting. SNPBLK would automatically call haploview to
process the formatted files and display LD block structures as well other haplotype
statistics.
Permutation Testing
Another future implementation on our agenda is permutation testing, which
provides verification for significant hits from multiple association tests. One simple but
effective permutation testing is the reshuffling of samples, which alters the relationships
between genotypes and disease status but does not affect relationship among SNPs (e.g.,
LD structure). For large-scale genotype datasets, this is clearly a very computationally
intensive procedure. Therefore we would like to adopt an adaptive permutation approach,
which we preferentially permute SNPs that appear to have better chance to achieve high
significance during permutation. SNPs with relatively large p-values (e.g., 0.8) will be
dropped after a few permutations since it is unlikely that these SNPs will ever achieve
high significance. Because association testing is already done in SAS, we would like to  


 
                                                                                                                                             
39
implement this adaptive permutation approach using SAS as well without relying on
external programs. We believe permutation testing can be efficiently implemented using
SAS macro programming, as shown by published macro programs
13
, particularly for
case-control genotype data
14
.  






































 
                                                                                                                                             
40
References

14. Dana Aeschliman SAS MACROs for SNP-phenotype association studies:
implementations of the MAX test and MAX-maxT algorithms , Marie-Pierre Dub´e
Statistical Genetics Research Group Montreal Heart Institute. October 19, 2007

13. G.K. Balasubramani, Stephen R. Wisniewski, Hongwei Zhang, Heather F. Eng
Development of an efficient SAS macro to perform permutation tests for two independent
samples Computer Methods and Programs in Biomedicine (2005) 79, 179—187

9. Cheng I, Penney KL, Stram DO, Le Marchand L, Giorgi E, Haiman CA, Kolonel LN,
Pike M, Hirschhorn J, Henderson BE, Freedman ML. Haplotype-based association
studies of IGFBP1 and IGFBP3 with prostate and breast cancer risk: the multiethnic
cohort. Cancer Epidemiol Biomarkers Prev. 2006 Oct;15(10):1993-7.

11. Gauderman, WJ et al., A Genome-wide Association Study of Childhood Respiratory
Outcomes. [In press]

4. Haiman CA, Hsu C, de Bakker PI, Frasco M, Sheng X, Van Den Berg D, Casagrande
JT, Kolonel LN, Le Marchand L, Hankinson SE, Han J, Dunning AM, Pooley KA,
Freedman ML, Hunter DJ, Wu AH,
Stram DO, Henderson BE. Comprehensive association testing of common genetic
variation in DNA repair pathway genes in relationship with breast cancer risk in multiple
populations.
Hum Mol Genet. 2008 Mar 15;17(6):825-34. Epub 2007 Dec 3.

8. Haiman CA, Stram DO, Pike MC, Kolonel LN, Burtt NP, Altshuler D, Hirschhorn J,
Henderson BE.  A comprehensive haplotype analysis of CYP19 and breast cancer risk:
the Multiethnic Cohort.  Hum Mol Genet. 2003 Oct 15;12(20):2679-92. Epub 2003 Aug
27.

12. Haiman CA, de Bakker PI, Frasco M, Sheng X, Van Den Berg D, Stram DO,
Henderson BE, Comprehensive association testing of common genetic variation in DNA
repair pathway genes in relationship with colorectal cancer risk in multiple populations.
[In Preparation]

6. Li Y and Abecasis GR (2006) Mach 1.0: Rapid Haplotype Reconstruction and Missing
Genotype Inference. Am J Hum Genet S79 2290

1. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J,
Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007)  PLINK: a toolset for whole-
genome association and population-based linkage analysis. American Journal of Human
Genetics, 81.




 
                                                                                                                                             
41
3. Qin ZS, Niu T and Liu JS (2002) Partition-Ligation EM Algorithm for Haplotype
Inference with Single Nucleotide Polymorphisms. Am. J. Hum. Genet. 71 1242-7

5. Servin B, Stephens M. Imputation-based analysis of association studies: candidate
regions and quantitative traits. PLoS Genet. 2007 Jul;3(7):e114. Epub 2007 May 30.
2. Daniel O. Stram, Christopher A. Haiman, Joel N. Hirschhorn, David Altshuler,
Laurence N. Kolonel, Brian E. Henderson, Malcolm C. Pike Choosing Haplotype-
Tagging SNPS Based on Unphased Genotype Data Using a Preliminary Sample of
Unrelated Subjects with an Example from the Multiethnic Cohort Study  Hum Hered
2003;55:27-36
7. Stram DO. GENEBLK-SAS Macro for Haplotype Imputation

10. Wigginton JE, Cutler DJ and Abecasis GR  A Note on Exact Tests of Hardy-
Weinberg Equilibrium.
Am J Hum Genet (2005) 76: 887-93



























 
                                                                                                                                             
42
Appendix A – Basic PLINK Usage

Pedfile (ID, sex, status, genotypes..)  
ID2722   2    0    A G C T
ID2220   1    1    A A T T

Mapfile (chromosome, SNP name, positions)
12  snp1 83294214  
4   snp2  12453566

// The order of genotypes and SNP names must match.

Cluster file (individual ID, family ID, strata)

ID2722    ID2722         J_0
ID2220    ID2220         W_1
ID1602    ID1602         J_1
ID1826    ID1826         W_0

// This yields 4 strata in the analysis – does not apply to HWE

//Covariate files (individual ID, family ID, covariates)
// two linear covariate variables
ID2722    ID2722         6    1
ID2220    ID2220         4    3
ID1602    ID1602         5    4

// two categorical(dummy) covariates with 6 and 4 levels, respectively.
FID IID COV1_5 COV1_1 COV1_3 COV1_2 COV1_4 COV2_3 COV2_4 COV2_2  
ID2722    ID2722     0 0 0 0 0 0 0 0  
ID2220    ID2220     0 0 0 0 0 1 0 0  
ID1602    ID1602     1 0 0 0 0 0 1 0

//dummy coding can be converted from linear coding using
--write-covar --dummy-coding  

MAF
--ped name.ped // genotype file ‘A A’ or ‘1 1’ ‘0 0’ for missing                
--no-fid  // no family ID required
--no-parents  // no linkage information required
--map name.map  // SNP information file  
--map3 // 3 fields in SNP information file
--out result // output file name
--freq  // MAF analysis



 
                                                                                                                                             
43
--within cluster.txt // stratification cluster file
--mind 1 // % missing individuals for SNP exclusion
--geno 1 // % missing SNP genotypes for sample exclusion
--maf 0  // MAF threshold for SNP exclusion
--hwe 0  // HWE p-value threshold for SNP exclusion
--1     // disease status coding control=0 case=1
--noweb  // skipping web checking of most recent version

HWE
--hardy   //Cannot use --within for stratifaction

Recoding  
--recodeA   //from genotype coding to minor allele coding  

Association
--logistic  // run logistic regression
--genotypic // Ordinal OR
--hethom   // heterogeneous model instead of dominant model
--ci 0.95  // Confidence interval  
--covar covariate.txt  // adj. for linear or categorical covariates



























 
                                                                                                                                             
44
Appendix B – TagSNPs Imputation

snps="snpinfo.txt";  //SNP info. for reference dataset
file="reference.txt"  //  reference genotype dataset
format="(a4,t26,8000i2)" // format of genotype dataset
                           Sample ID has length 4
       Genotype data starts at 26 and had
       “A A” format
code=2 lrecl=15000;   // record length
use rs7085679 rs5019235 ; em; // SNP imputation using EM  
predict s pif="caco.ped" // case-control genotype dataset
                   population-specific
fmt="(a7,t26,8000i2)"   // format of caco genotype dataset
readtags="allcasecontrol.map" // SNP info. for caco dataset

pof="C:..\rs7085679.dat" //directory and name of prediction  
                         output file    
quit;  //end of imputation

























 
                                                                                                                                             
45
Appendix C – Selected SAS Macro Code  

// macro function for counting # of arguments in a phrase
%macro numargs(arg);  
  %let n=1;
  %do %while (%scan(&arg,%eval(&n),%str( ))^=%str());  
     %let n=%eval(&n+1);
     %end;  
  %eval(&n-1)  
%mend numargs;

//Association testing using “Tall” format
proc sort data=geno;  // standard “wide” genotype data
by varlist;  // all variables except SNP variables should  
               be included here
// convert the “wide” to “tall” format
proc transpose data=geno out=geno_tall name=snpn prefix=geno;
by varlist;
var snplist; //listing SNP variables  
proc logistic data=geno_tall DESCENDING;
    class covarlist; // listing categorical covariates
    model status = covarlist geno;
  by snpn;
title "overall trend OR";  
ods output oddsratios=or ParameterEstimates=p;
run;
   










 
                                                                                                                                             
46

Appendix D - SNPBLK Execution Panel (Macro Call)
filename snpblk url “http://www.scf.usc.edu/~chrishsu/snpblk.sas”;  
%include snpblk;
filename snpblk clear;

%SNPBLK (
/*files and directories*/
logflag=0,
macrologflag=1,
delsasdataflag=0,
delplinkdataflag=0,
plinkdir=C:\chris\stram\PLINK\plink-1.03-dos,  
lname=bccoe,  
dir=C:\Chris\Stram\bccoe\SASdatasets,
data=geno_ed,
snpinfo=snpinfo,  
covar=covar,
outname=bccoe_tab1,  
outdir=C:\chris\stram\bccoe\results,  
/*variable names*/
id=labid,
statusvar=caco,  
agegvar=ageg,
racevar=ethnic,  
sexvar=,  
castrat=,  
stratvars=epuse_rec,
twostratvars=,
covarlist=,
snpvar=snp_id,  
chromo=chromo,  
genevar=gene,
posvar=position,  
/*snp and subject screening*/  
screenflag=0,
seqflag=0,
snprate=0.1,
forcesamples= ,
samplerate=0.1,
forcesnps=,
minmaf=0,   /*if dont want to exclude any SNPs (even monomormphic) set to  



 
                                                                                                                                             
47
negative values*/  
minhwep=0,
Nstrathwe=,
/*snp and sample clustering*/
allflag=1,
raceflag=0,  
racegrps=B H J L W,  
sexflag=0,  
sexgrps=M F,    
racesexflag=0,  
castratflag=0,  
castratval=1 2 | 1 2,
stratvarsflag=0,
stratvarsvals=0 1,
twostratflag=0,
twostratvarvals=,
snpcluster=1,  
genelist= TP53 FANCA MLH3,  
chrposlist=82516749-82694327 48747484-48782349 53339009-55615387,
snplist= ,  
/*output selection*/
psigfig=2,
callreportflag=1,  
mafflag=0,  
mafstatusflag=1,
mafdecimal=0.001,
hweflag=0,  
hwetest=exact,
assoflag=0,  
ordorflag=0,  
ordecimal=0.01,
orclflag=1,
exsnpsflag=1,
tophitsflag=0,
ntophits=100,
assohwemafflag=0
);






 
                                                                                                                                             
48
Appendix E - SNPBLK Parameter Definitions
All Boolean flag variables are defined as 1=yes or 0=no. Parameters marked by ‘*’ must
have input values; other parameters can be left blank.

*logflag – output of SAS execution to a separate text file.  

*macrologflag – detailed macro execution log  

*delsasdataflag – deletion of sas datasets generated during execution.

*delplinkdataflag – deletion of PLINK text files generated during execution

*plinkdir – directory where PLINK executable is stored.

*tagSNPsdir – directory where TagSNPs executable is stored.

*haploviewdir – directory where Haploview executable is stored.

*lname - name of SAS library where input (genotype, SNP info., covariate) datasets are
stored.

*dir – computer directory address for the folder containing input files

*data - name of SAS genotype dataset coded as "1 1" "2 2"... or "AA" "CC"... with "0 0"
for missing genotype). Besides SNP genotype variables, this dataset must also contain at
least SNP ID and disease status.

*snpinfo - name of SNP information SAS dataset. It should contain at least 3 variables:
SNP ID, chromosome #, and base positions.

*cover – name of covariate dataset.

*outname – name of output EXCEL file desired by the user.

*outdir – directory of output EXCEL file.

*id - variable name for sample identification #(alphanumeric values less than 30
characters in length)

*statusvar - the name of phenotype variable (i.e. disease status), which should
be coded as 0=controls, 1=cases.  





 
                                                                                                                                             
49
agegvar- variable name for age groups, which should be coded as discrete variable (e.g.,
1, 2, 3...).  

racevar - name of race variable.  

sexvar - name of race variable.
castrat – case stratification variable names.

stratvars – ordinary stratification variable names.
twostratvars – list of variables names for two-way stratifcation

covarlist – list of covariates other than age, sex, and race.

*snpvar - variable name list of SNP names (e.g., rs #) in the snpinfo dataset.

*chromo - variable name for chromosome # (1-22) in the snpinfo dataset.

genevar – variable name for genes.

*posvar - variable name for snp position/coordinate in the snpinfo dataset.

*screenflag – initiation of SNP/sample screening. If no screening is required, all
proceeding screening options can be left blank  

seqflag – sequential screening (first SNP then sample)

snprate – minimum genotyping failure rate for SNPs in each sample.

forcesamples – force sample exclusion by listing sample IDs.
 
samplerate – minimum genotyping failure rate for samples in each SNP.
 
forcesnps -force SNP exclusion by listing SNP IDs.

minmaf – minimum MAF for excluding SNPs.
 
minhwep - minimum HWE p-value (controls only) for   excluding SNPs in
controls.
 
Nstrathwe – number of control strata that must fail HWE for excluding a SNP.

*allflag – analysis for all samples.

*raceflag – stratify by race.



 
                                                                                                                                             
50
racegrps – selective race groups for stratification.

*sexflag - stratify by sex.

Sexgrps – selective sex groups for stratification.

*racesexflag – joint stratification for race and sex.

*castratflag – case stratification.
 
castratval – desired strata for case stratification.

*stratvarsflag – normal stratification.
 
stratvarsval – desired strata for normal stratification.

*twostratflag – two-way stratification.

twostratvarvals – strata values for two-way stratification.

*snpcluster – 1=all SNPs, 2=by genes, 3=by positional ranges, 4=individual SNPs.

*psigfig – number of significant figures in p-values.

*callreportflag – output of SNP/sample genotype call reports.

*mafflag – output of MAF analysis.

mafstatusflag – stratify MAF by disease status.

mafdecimal – decimal places in MAF.

*hweflag – output of HWE analysis.

hwetest – types of HWE tests (exact or chisq)

*assoflag – output of association analysis.

ordorflag – output of ordinal OR in addition to trend OR.
 
ordecimal – decimal places of OR.

 orclflag – output of concatenated OR and CI.




 
                                                                                                                                             
51
exsnpsflag – output of excluded SNPs if SNP screening was applied

*tophitsflag – output of top N SNPs with most significant over-all p-values.

Ntophits – specifying N.
*assohwemafflag – combined outputs of association, HWE, and MAF in one table.









































 
                                                                                                                                             
52
Appendix F – Published Work

Comprehensive Association Testing of Common Genetic Variation in DNA Repair
Pathway Genes in Relationship with Breast Cancer Risk in Multiple Populations

Christopher A. Haiman
1*
, Chris Hsu
1
, Paul de Bakker
2
, Melissa Frasco
1
, Xin Sheng
1
,
David Van Den Berg
1
, John T. Casagrande
1
, Laurence N. Kolonel
3
, Loic Le Marchand
3
,
Susan E. Hankinson
4
, Jiali Han
4
, Alison M. Dunning
5
, Karen A. Pooley
5
, Matthew L.
Freedman
2,6
, David J. Hunter
4
, Anna H. Wu
1
, Daniel O. Stram
1
, Brian E. Henderson
1

1
Department of Preventive Medicine, University of Southern California, Keck School of
Medicine, Los Angeles, California, 90089 USA;
2
Program in Medical & Population
Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, 02142 USA;
3
Epidemiology Program, Cancer Research Center, University of Hawaii, Honolulu,
Hawaii, 96813 USA;
4
Epidemiology Department, Harvard School of Public Health,
Boston, Massachusetts, 02115 USA;
5
Cancer Research UK, Department of Oncology,
Strangeways Research Laboratory, University of Cambridge, UK;
6
Dana-Farber Cancer
Institute, Department of Medical Oncology, Boston, Massachusetts, 02115 USA;


*To whom correspondence should be addressed
E-mail: haiman@usc.edu
Telephone: (323) 442-7755
Fax: (323) 865-0127 
Abstract (if available)
Abstract SNP genotyping technology has advanced considerably in recent years, allowing for faster data generation at significantly lower cost. Investigators can now test a large number of SNPs across the human genome to locate putative risk alleles. Case-control study design, in particular, offers direct and unbiased estimation of disease risk. However, as the number of SNPs that can be genotyped continues to increase rapidly, the complexity and intensity of computation become important issues to consider. We offer a simple and automated approach to the computation of case-control genotype data especially of interest to users of the statistical analysis package SAS®. 
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button
Conceptually similar
Methodology and application of modern genetic association tests in admixed populations
PDF
Methodology and application of modern genetic association tests in admixed populations 
Two-stage genotyping design and population stratification in case-control association studies
PDF
Two-stage genotyping design and population stratification in case-control association studies 
The effects of sample size on haplotype block partition, tag SNP selection and power of genetic association studies
PDF
The effects of sample size on haplotype block partition, tag SNP selection and power of genetic association studies 
Screening and association testing of coding variation in steroid hormone coactivator and corepressor genes in relationship with breast cancer risk in multiple populations
PDF
Screening and association testing of coding variation in steroid hormone coactivator and corepressor genes in relationship with breast cancer risk in multiple populations 
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping
PDF
Identification and fine-mapping of genetic susceptibility loci for prostate cancer and statistical methodology for multiethnic fine-mapping 
X-linked repeat polymorphisms and disease risk: statistical power and study designs
PDF
X-linked repeat polymorphisms and disease risk: statistical power and study designs 
Polygenic analyses of complex traits in complex populations
PDF
Polygenic analyses of complex traits in complex populations 
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics 
A study of methods for missing data problems in epidemiologic studies with historical exposures
PDF
A study of methods for missing data problems in epidemiologic studies with historical exposures 
Genome-wide association study of coronary heart disease in multiethnic populations [dataset]
XLSX
Genome-wide association study of coronary heart disease in multiethnic populations [dataset] 
Polygenes and estimated heritability of prostate cancer in an African American sample using genome-wide association study data
PDF
Polygenes and estimated heritability of prostate cancer in an African American sample using genome-wide association study data 
The multiethnic nature of chronic disease: studies in the multiethnic cohort
PDF
The multiethnic nature of chronic disease: studies in the multiethnic cohort 
Analysis of genomic polymorphism in Arabidopsis thaliana
PDF
Analysis of genomic polymorphism in Arabidopsis thaliana 
Analysis of SNP differential expression and allele-specific expression in gestational trophoblastic disease using RNA-seq data
PDF
Analysis of SNP differential expression and allele-specific expression in gestational trophoblastic disease using RNA-seq data 
Body size and the risk of prostate cancer in the multiethnic cohort
PDF
Body size and the risk of prostate cancer in the multiethnic cohort 
Two-step study designs in genetic epidemiology
PDF
Two-step study designs in genetic epidemiology 
A comparison of methods for estimating survival probabilities in two stage phase III randomized clinical trials
PDF
A comparison of methods for estimating survival probabilities in two stage phase III randomized clinical trials 
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study
PDF
Efficient two-step testing approaches for detecting gene-environment interactions in genome-wide association studies, with an application to the Children’s Health Study 
Phase I clinical trial designs: range and trend of expected toxicity level in standard A+B designs and an extended isotonic design treating toxicity as a quasi-continuous variable
PDF
Phase I clinical trial designs: range and trend of expected toxicity level in standard A+B designs and an extended isotonic design treating toxicity as a quasi-continuous variable 
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Hierarchical approaches for joint analysis of marginal summary statistics 
Action button
Asset Metadata
Creator Hsu, Chris (author) 
Core Title Computational design for analysis of SNP association studies 
Contributor Electronically uploaded by the author (provenance) 
School Keck School of Medicine 
Degree Master of Science 
Degree Program Biostatistics 
Publication Date 10/22/2008 
Defense Date 08/30/2008 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag automation,HWE,linear regression,logistic regression,MAF,OAI-PMH Harvest,PLINK,SAS Macro,SNP Association 
Language English
Advisor Stram, Daniel O. (committee chair), Haiman, Christopher A. (committee member), Setiawan, Wendy (committee member) 
Creator Email chrishsu@usc.edu,chrishsu1978@pcc.edu 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-m1694 
Unique identifier UC1298630 
Identifier etd-Hsu-2441 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-116263 (legacy record id),usctheses-m1694 (legacy record id) 
Legacy Identifier etd-Hsu-2441.pdf 
Dmrecord 116263 
Document Type Thesis 
Rights Hsu, Chris 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Repository Name Libraries, University of Southern California
Repository Location Los Angeles, California
Repository Email cisadmin@lib.usc.edu
Tags
automation
HWE
linear regression
logistic regression
PLINK
SAS Macro
SNP Association