Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
(USC Thesis Other)
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NONPARAMETRIC ESTIMATION OF AN UNKNOWN
PROBABILITY DISTRIBUTION USING MAXIMUM LIKELIHOOD
AND BAYESIAN APPROACHES
by
Alona Chubatiuk
__________________________________________________
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(APPLIED MATHEMATICS)
August, 2013
Copyright 2013 Alona Chubatiuk
i
Dedication
To my family
ii
Acknowledgments
I would like to start with “Thank you, Thank you, Thank you” to my advisor Professor Alan
Schumitzky. I have been blessed to have an opportunity to work with him, to learn from him, to
listen to him and simply knowing him as a person. I cannot thank him enough for all his
patience, help and support during the times I went through the screening exams, my oral exam,
the birth my son, finally preparation of the thesis manuscript and the defense. He was always
there for me. Also I would like to thank Professors Jay Bartroff and David D’Argenio for serving
on both my oral exam and defense committee and making valuable suggestions and advises.
My sincerest thanks to my parents Valentyna and Valentyn Chubatiuk for their love and
patience. Having them as an example, I always dreamed of becoming a professor of
mathematics. And they believed and supported me all the way through without any doubt.
A special thank you to my husband Oleksandr and my son Pavel for creating a happy, loving and
peaceful atmosphere at home that always gave me energy for another working day. Also many
thanks to my mother in law, Irina who basically took over the work in my house to give me time
to focus on my research.
None of this would be possible without the encouragement and support from my friends and
colleagues Meg Aboulian, Tatiana Tatarinova, Dmytro Chebotarov, Vitalii Ostrovskyi, Jinlin
Song, Selenne Garcia-Torres, Xin Wang, Kasia Wasilewska and many more. Thank you to all of
you.
Finally, thank you to the USC Graduate School and WiSE for the Dissertation Completion
Fellowship and Fellowship Support to Accommodate Pregnancy, Birth of a Child, and Adoption
for PhD Students. These helped me no less than all I described above.
iii
Table of Contents
Dedication………………………………………………………………………………………….i
Acknowledgments………………………………………………………………………………...ii
List of Tables……………………………………………………………………………………..iv
List of Figures……………………………………………………………………………………..v
Abstract……………………………………………………………………………………………x
Chapter 1. Introduction…………………………………………………………………………1
Chapter 2. Nonparametric Maximum Likelihood (NPML) Estimation………………………...6
Chapter 3. Computer Simulations for NPML………………………………………………….31
Chapter 4. NPML Estimation for Stochastic Models………………………………………...109
Chapter 5. Nonparametric Bayesian (NPB) Estimation……………………………………..135
Chapter 6. Theoretical Foundations of the Gibbs Sampler…………………………………...144
Chapter 7. Computer Simulations for Gibbs Sampler………………………………………..147
Chapter 8. Stick-breaking Approach in NPB Estimation…………………………………….155
Chapter 9. Maximum Likelihood of Stick-Breaking (Connection between NPML and
NPB)……………………………………………………………………………………………171
Bibliography……………………………………………………………………………………174
Appendix………………………………………………………………………………………..178
iv
List of Tables
Table 1: This table shows the number of 95 percent bootstrapped confidence intervals out
of a 100 that contain the true value of the expected value and the standard
deviation of the ‘’bimod’ data example………………………………………………..30
Table 2: Galaxy data……………………………………………………………………………..36
Table 3. Rolling Thumbtacks example data……………………………………………………..97
Table 4: Eye Tracking example, measurements……………………………………………......102
Table 5: The table shows the results of two NPML methods NPEM1 and EM1+IP that
were calculated using different levels of process noise. It also compares the
results when the process noise was included in the simulation but was present or
not in the model………………………………………………………………………123
Table 6: This table shows the comparison of three different methods for estimating the
unknown probability distribution for the Galaxy Data example………………….168
v
List of Figures
Chapter 2
Fig 1: The graph of for …………………………………..24
Chapter 3
Fig.1: The graph of the D function for Lindsay simple example………………………………..34
Fig.2: The graph of the , Lindsay simple example………………………………………….35
Fig. 3: The graph of the D function for Galaxy Data for K=3…………………………………...40
Fig. 4: The graph of the discrete distribution of for K=3 shows 3 support points. Galaxy data
(The height of each vertical line is the corresponding probability on the support point) ……….40
Fig. 5: The graph of the discrete distribution of for K= 82 shows 4 support points. Galaxy data
…………………………………………………………………………………………………...43
Fig. 6: The graph of the D function for Galaxy Data with 3 support points. Galaxy data………44
Fig. 7: The graph of the discrete distribution of for K= 4 shows 4 support points. Galaxy
data……………………………………………………………………………………………….45
Fig 8: The graph of the simulated data for Case 1 and Model 1…………………………………56
Fig 9: The graph of the D function for K = 20, Case 1 and Model 1……………………………57
Fig 10: The graph of the for K = 20, Case 1 and Model 1 it shows 4 support points……...57
Fig 11: True distribution of Model 1………………………………………………………….58
Fig 12: The graph of the D function for K = 4, Case 1 and Model 1……………………………59
Fig 13: The graph of the for K = 4, Case 1 and Model 1 it shows 4 support points……….60
vi
Fig 14: The graph of the simulated data for Case 1 and Model 2………………………………..62
Fig 15: The graph of the D function for K = 20, Case 1 and Model 2…………………………..62
Fig 16: The graph of the for K = 20, Case 1 and Model 2 it shows 5 support points……...63
Fig 17: True distribution of Model 2………………………………………………………….63
Fig 18: The graph of the D function for K = 5, Case 1 and Model 273…………………………65
Fig 19: The graph of the for K = 5, Case 1 and Model 2 it shows 5 support points……….65
Fig 20: The graph of the simulated data for Case 2 and Model 1………………………………..67
Fig 21: The graph of the D function for K = 20, Case 2 and Model 1…………………………..68
Fig 22: The graph of the D function for K = 3, Case 2 and Model 1……………………………69
Fig 23: The graph of the D function for K = 8, Case 2 and Model 2……………………………71
Fig 24: The graph of the simulated data for Case 3 and Model 1………………………………..73
Fig 25: The graph of the D function for K = 20, Case 3 and Model 1…………………………..74
Fig 26: The graph of the for K = 20, Case 3 and Model 1 it shows 3 support points……...74
Fig 27: The graph of the D function for K = 3, Case 3 and Model 1……………………………75
Fig 28: The graph of the for K = 3, Case 3 and Model 1 it shows 3 support points……….76
Fig 29: The graph of the D function for K = 6, Case 3 and Model 2……………………………77
Fig 30: The graph of the for K = 6, Case 3 and Model 2 it shows 6 support points……….78
Fig 31: True distribution of Model 2………………………………………………………….78
Fig 32: The graph of the D function for K = 3, Case 4 and Model 1……………………………80
Fig 33: The graph of the simulated data for Case 4 and Model 2………………………………..81
Fig 34: The graph of the D function for K = 6, Case 4 and Model 2……………………………82
Fig. 35: a) Smoothed graph of for K = 6, Case 4 and Model 2 it shows 6 support points
;
b)
Graph of the true distribution
………………………………………………………………...82
vii
Fig 36: The graph of the D function. PK example with additive noise………………………….86
Fig 37: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This figure
shows the graphs of (solid line) true distribution of the volume of distribution (dashed line ‘--
’) smoothed MLE of the true distribution and (spikes) …………………………………….87
Fig 38: The graph of the D function. EM1+IP…………………………………………………..89
Fig 39: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . EM1+IP….90
Fig 40: The graph of the D function. NPEM 2…………………………………………………..92
Fig 41: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . NPEM2.....93
Fig 42: The graph of the D function. EM2+IP…………………………………………………..95
Fig 43: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . EM2+IP.....96
Fig 44: The graph of for K=10. Rolling thumbtacks…………………………………….....99
Fig 45: The graph of the D function for K=10. Rolling thumbtacks…………………………...100
viii
Fig 46: The graph of for K=3. Rolling thumbtacks……………………………………….101
Fig 47: The graph of the D function for K=3. Rolling thumbtacks…………………………….101
Fig 48: The graph of for K=6. Eye Tracking example…………………………………….104
Fig 49: The graph of the D function for K=6. Eye Tracking example…………………………104
Fig 50: The graph of for K=101. Eye Tracking example…………………………………108
Fig 51: The graph of the D function for K=101. Eye Tracking example………………………108
Chapter 4
Fig 1: This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) in case when
(a) simulated and assumed noise are equal to 0
(b) simulated noise is equal to 0.1 and assumed noise is 0
(c) simulated and assumed noise are equal to 0.1
(d) simulated noise is equal to 1 and assumed noise is 0
(e) simulated and assumed noise are equal to 1……………………………………………128
Fig 2: This figure shows the graphs of (solid line) true distribution of the volume of distribution
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) in case when
(a) simulated and assumed noise are equal to 0
(b) simulated noise is equal to 0.1 and assumed noise is 0
(c) simulated and assumed noise are equal to 0.1
(d) simulated noise is equal to 1 and assumed noise is 0
(e) simulated and assumed noise are equal to 1……………………………………………133
ix
Chapter 7
Fig. 1: The graph of Galaxy data………………………………………………150
Fig. 2: The graph of (6 support points). Galaxy Data…………………………………….151
Fig 3: The graph of the simulated data for the above values of parameters. PK example……..153
Fig. 4: Frequency Histogram of Gibbs sample for 20 subjects (elimination rate constants on x – axis). PK
example…………………………………………………………………………………………153
Fig. 5: The graph of for (5 support points). PK example……………………….154
Fig. 6: The graph of true distribution of . PK example……………………………………...154
Fig. 7: Frequency Histogram of Gibbs sample for 100 subjects (elimination rate constants on x-
axis). PK example………………………………………………………………………………154
Chapter 8
Fig 1: Illustration of Stick Breaking (taken from Zoubin Ghahramani’s Tutorial in 2005)……158
Chapter 9
Fig 1: The graph of , there are 6 support points. Galaxy data……………………………..172
Fig 2: The graph of , there are 20 support points in total but only 7 of them have significant
weight. Galaxy data…………………………………………………………………………….173
x
Abstract
Suppose we observe a random sample , of independent but not necessarily identically
distributed random variables , for . Assume also that the conditional density of
given is known and denoted by , where the ’s are unobserved random
parameters that are independent and identically distributed with common but unknown
distribution function .
The objective is to estimate given the data . We used two different approaches to get
the estimate of : Nonparametric Maximum Likelihood (NPML) and Nonparametric Bayesian
(NPB).
For the nonparametric maximum likelihood approach, convex analysis shows that the maximum
likelihood (ML) estimator of is a discrete distribution with no more than N support
points where N is the number of subjects in the population. The position and weights of the
support points are unknown. We developed four different methods of estimation: NPEM1,
NPEM 2, EM1+IP and EM2+IP. These methods are based on the Expectation –Maximization
(EM) and Primal-Dual Interior Point (IP) algorithms. The NPEM1 algorithm uses the EM
algorithm to maximize the log-likelihood function over the set of support points and weights.
NPEM 2 fixes the large grid of point at first and them uses EM algorithm to maximize the log-
likelihood only over the set of weights. The NPEM 2 algorithm is much faster than the NPEM 1
algorithm, but to find the optimal solution a very large grid of points has to be used. Therefore in
case of multidimensional parameters, NPEM 2 can be impractical. The EM1+IP algorithm is a
hybrid of the NPEM1 and IP algorithms in the way that support points are updated the same way
as in NPEM 1, but the weights are updated using Primal-Dual Interior Point (IP) algorithms. This
xi
method is much faster then NPEM 1 and comparable in speed with NPEM 2, however it does not
need a large grid of points. EM1+IP requires no more then N support points to start the
estimation process and therefore can be used in cases of multidimensional parameters. The
EM2+IP algorithm is a hybrid of NPEM2 and IP. It starts with a large grid of support points and
then updates the weights using the IP algorithm. This method gives results instantly if compared
to the speeds of the rest of the methods. However the optimality of the results, as in case of the
NPEM 2 algorithm, very much depends on the initial grid of points.
Another highlight of this thesis is an ability of building nonparametric maximum likelihood
estimates of distributions of the parameters of linear stochastic models, i.e. population models
with process noise. The advantages of the models with process noise are that, in addition to the
measurements errors, the uncertainties in the model itself are taken into the account, i.e. process
noise. For example, in case of the PK problem the errors like
• dose errors
• sample timing errors
• model misspecifications
are not included in the deterministic models. The stochastic models on the other hand can
overcome this problem.
We used linear Kalman-Bucy filtering [J70] to find the likelihoods of these models and then
employed all the above methods to find the maximum likelihood estimates of an unknown
distribution F.
In the NPB approach, the distribution is itself a random variable. The first requirement is to
place a prior distribution on . A convenient way of doing this is with the so-called Dirichlet
process (DP) prior, see [F73, WW97, WW98]. A Dirichlet process prior on a random
xii
distribution is defined by a probability distribution and a scale parameter , where is
our prior guess of and the parameter will be the strength of the guess. We write
for the Dirichlet process.
The main reason we consider the NPB approach in the PK Population Problem is because of its
ability to calculate Bayesian credibility intervals, no matter what the sample size is. This is not
possible with NPML in general, however it is possible to get bootstrapped confidence intervals
for NPML that are somewhat similar to Bayesian credibility intervals. In both cases we get
approximations of accuracy of the estimates. Even though NPML and NPB give us an estimates
of accuracy, the Bayesian nature of the estimated population distribution is more preferable.
Two algorithms were developed to find NBP estimates of the distribution F, Marginal NPB and
Full Conditional NPB. The Marginal NPB approach uses Gibbs sampling to estimate the
posterior expectation of the distribution F given the data, but it does not solve the classical NPB
problem of finding the posterior distribution of F given the data. The Full NPB approach uses the
Stick-Breaking representation of the Dirichlet process and gives a way to sample from the
posterior distribution of F given the data. This approach allows us to build credibility intervals
on all the desired quantities of the distribution F.
Finally we established a very nice connection between NPML and NPB methods. Using the Full
NPB approach we can sample potential estimates of the true distribution F, moreover we can
calculate the value of the log-likelihood function for each of those distributions. All these values
of likelihoods are smaller than or equal to the maximum likelihood value that was found by
NPML methods. The equality is possible with a big enough number of draws from the posterior
distribution of F given the data.
1
Chapter 1. Introduction
In this thesis, we focus on the “Nonparametric Estimation of an Unknown Probability
Distribution Using Maximum Likelihood and Bayesian Approaches.” The general statistical
problem is to estimate an unknown probability distribution of a population of nonlinear
models from noisy data. There are many applications of this problem medicine, computational
biology and engineering.
One important medical example is in the analysis of clinical trials. A new drug is given to a
population of subjects. The drug’s behavior is stochastically determined by an unknown subject-
specific vector parameter . This pharmacokinetic parameter varies significantly (usually
based on genetic or epigenetic factors) between the subjects and that accounts for the variability
of the drug response in the population. The pharmacokinetic population problem is to determine
the probability distribution of based on the noisy clinical data.
In the general case, our objective is to use two different approaches, nonparametric maximum
likelihood (NPML) and nonparametric Bayesian (NPB) to establish new algorithms for
estimating the unknown probability distribution given the data.
For the maximum likelihood approach, convex analysis shows that the maximum likelihood
(ML) estimator of is a discrete distribution with no more than N support points where N
is the number of subjects in the population. The position and weights of the support points are
unknown. We have programmed an algorithm based on the Expectation – Maximization (EM)
2
method to determine , see [S91]. The discrete property of is important for optimal
drug dosage design as it reduces integration problems over to summation problems.
In 1983, Lindsay [L83] proved the fundamental result about nonparametric maximum likelihood
(NPML) estimation for mixing distributions. However, this result was not widely known in PK
circles until 1986 when Mallet’s ground-breaking paper [86] appeared. Both Lindsay’s and
Mallet’s results were based on convexity theory. In 1991, Schumitzky [S91] derived an EM
algorithm for the NPML estimator called NPEM1. What is interesting is that in 1984, before the
Mallet’s result, Katz and D’Argenio [KD84] proposed a nonparametric estimation method which
can now be viewed as the first iteration of the NPEM1 algorithm of Schumitzky.
In the early 1990’s, Leary and Burke developed a primal-dual interior point NPML method
which was hundreds of times faster than the NPEM1 [L01]. This last algorithm is essentially the
nonparametric adaptive grid (NPAG) algorithm supplied by the Laboratory of Applied
Pharmacokinetics at USC [LJSG01]. A recent version is now available in an R-based program
called P-metrics, which was developed by Dr. Michael Neely of USC, see [NGYSJ 12] and
[T13].
In this thesis we propose a hybrid method (EM1+IP) which uses the best aspects of NPEM1 and
NPAG and which should be more efficient for high dimensional problems.
Another highlight of this thesis is an ability of building nonparametric maximum likelihood
estimates of distributions of the parameters of the linear stochastic models, i.e. population
3
models with process noise. We used linear Kalman-Bucy filtering [J70] to find the likelihoods of
theses models and then employ the NPEM1 methods and EM1+IP to find the maximum
likelihood estimate of the unknown distribution.
In the NPB approach, the distribution is itself a random variable. The first requirement is to
place a prior distribution on . A convenient way of doing this is with the so-called Dirichlet
process (DP) prior, see [F73, WW97, WW98]. A Dirichlet process prior on a random
distribution is defined by a probability distribution and a scale parameter , where is
our prior guess of and the parameter will be the strength of the guess. We write
for the Dirichlet process.
The main reason we consider the NPB approach in the PK Population Problem is because of its
ability to calculate Bayesian credibility intervals, no matter what the sample size is. This is not
possible with NPML in general, however it is possible to get bootstrapped confidence intervals
for NPML that are somewhat similar to Bayesian credibility intervals. In both cases we get
approximations of accuracy of the estimates. Even though NPML and NPB give us and estimates
of accuracy, the Bayesian nature of the estimated population distribution is more preferable.
One of the goals in the Bayesian case, to determine , i.e. the conditional expectation
of given the data. The distribution takes the place of in drug dosage design.
One important result for pharmacokinetic applications is that, like , the distribution
is also discrete. Of course the number of support points is no longer less than N, but
that makes more robust.
4
Most methods that find employ a marginal approach: it turns out that can be
“integrated out” leaving a much simpler problem for the . The resulting model for the
is
given by the Polya Urn representation:
The marginal approach leads to a Gibbs sampler algorithm for estimating i.e., the
expected value of given the data – not . This approach is used by practically all
researchers, and is sufficient for the problems they consider. However, it does not solve the
classical NPB population analysis problem, i.e. to estimate the full conditional distribution of
.
To solve this problem we propose the Full Conditional Method. In this method we estimate we
the full conditional distribution of given the data , using Stick-Breaking priors and show
some examples of how build the credibility intervals.
In very broad terms the major Results we achieved in this thesis are as follows:
Result 1. To use the NPML approaches to find an estimate of given the data. And to program
algorithms based on the EM methods to determine the NPML Estimate of and to
illustrate the results on different data sets. Also to show the way of how to build the bootstrap
confidence intervals of the moments of .
5
Result 2. To use a combination of the EM and Interior Point methods to find an estimate of
given the data and to program the EM1+IP algorithm that is more efficient for high dimensional
problems than any other methods that find .
Result 3. To use NPEM1 and EM1+IP methods to find of the stochastic population models,
i.e. with process noise.
Result 4. To use the NPB approach to find an estimate of by placing a DP prior on it and
determining the posterior distribution of . And to program two algorithms based on Gibbs
Sampling methods to determine , the posterior expected distribution of given the
data and based on Stick Breaking priors (the representation of the DP prior) to determine
and build credibility intervals of the moments of distribution .
Result 5. To use Stick Breaking priors to find and thus to establish the connection between
NPB and NPML.
6
Chapter 2. Nonparametric Maximum Likelihood (NPML) Estimation
Consider the following statistical model:
, (2.1)
(2.2)
where
is a vector of independent but not necessarily identically distributed measurements with
known density
is a vector of unknown parameters defined on a space in finite dimensional
Euclidean space ( )
is an unknown probability distribution on (the Borel subsets of) .
In general will mean that or has distribution .
Assume that given the are independent and identically distributed random vectors with
common (but unknown) probability distribution on .
Our main problem is to estimate based on the data
Nonparametric Maximum Likelihood (NPML):
The log likelihood function of can be written as follows
(2.3)
where
7
(2.4)
Then is a maximum likelihood estimate of if it solves the following optimization
problem
(2.5)
Convex optimization framework:
Notice that the maximum of the likelihood function depends only on the values of the mixture
likelihood vector
(2.6)
Therefore we can reformulate the problem of finding that maximizes as finding
which mixture likelihood vector gives the largest value to the likelihood. It can be done in
three steps:
1. Construct the set of feasible solutions in :
2. Construct the objective function that has to be maximized over
3. If is the element of that maximizes the objective function , then it is
straightforward to obtain such that .
We start with the construction of the set of feasible solutions .
Let us first define the vector of atomic probability densities
(2.7)
8
such that
(2.8)
It follows from fundamental results of convex geometry that if we define
(2.9)
and if is compact then
where is the convex hull of . This representation of allows us to think about the
elements of as linear combinations of the elements of .
Now we present the main theorem of Lindsay [L83] that proves the existence and uniqueness of
the solution to the problem of maximizing the objective function over the set .
Theorem 1: Assume is compact and contains at least one point with positive
likelihood. Then there exists a unique vector that maximizes over the set .
Moreover it follows by the Caratheodory’s theorem that can be represented as where
has at most points of support.
Therefore we can conclude that can be found in the class of discrete distributions with at
most support points, which we write as:
(2.10)
9
where are the support points in and are the weights such
that .
The terms represents the delta distribution on with the defining property
for any continuous function or more generally
. This property of transforming integration into summation is
crucial in the pharmacokinetic application of dosage regimen design.
Thus the likelihood maximization problem can now be written as
(2.11)
However the above problem of maximizing the likelihood function is still very difficult if
approached directly because of high dimensionality of the parameters. That is why we use an
iterative method based on the Expectation-Maximization algorithm, see [S91], which we present
next, that determines the global maximum of the likelihood function.
Notes: 1) In the pharmacokinetic world, the father of Nonparametric Maximum Likelihood is
Alain Mallet. Although Mallet’s paper [M86] appeared after Lindsay’s paper [L83], Mallet’s
1983 PhD Dissertation (in French) proved the main results of the Lindsay paper.
10
2) In [S91], there were two algorithms described for solving the optimization problem in Eq.
(2.11). One of these is the basis for the method used by the USC Laboratory of Applied
Pharmacokinetics see [LJSG01]. The other one is the algorithm described in this Thesis. It has
never been implemented before now.
In this thesis we will define and implement four different algorithms for finding , called
NPEM1, NPEM2, NPEM1+IP and EM2+IP. We describe each separately.
Nonparametric Expectation-Maximization (NPEM1) algorithm for solving optimization
problem (2.11):
Step 1. Initiate:
Step 2. E-step: compute
Step 3. M-step: find that maximizes
Step 4. If , where is a very small number, then stop. Otherwise, go to
step 2.
Where that maximizes can be found as follows:
(2.12)
(2.13)
where
(2.14)
11
The derivations of these formulas together with the proof of the fact that EM algorithm never
decreases the value of the likelihood function at each iteration can be found in [S91]. However, it
is well known that in general, the EM algorithm does not always converge to a global maximum.
Therefore we used so-called directional derivatives to check if the distribution that we found
using our iterative scheme is in fact the global maximum of the likelihood function.
Define
(2.15)
Theorem 2 (Lindsay [L83]):
is the distribution that maximizes if and only if
Moreover the support of is contained in the set .
Nonparametric Expectation-Maximization 2 (NPEM2) algorithm for solving optimization
problem (2.11) with fixed support points:
In this case a (usually large) set of points is fixed. The corresponding
optimization problem is P2:
12
NPEM2 Algorithm
Step 1. Initiate:
Step 2. E-step: compute
Step 3. M-step: find that maximizes
Step 4. If , where is a very small number, then stop. Otherwise, go to
step 2.
It follows:
that maximizes is given by:
(2.16)
where
(2.17)
The derivations of these formulas together with the proof of the fact that NPEM2 algorithm
increases the value of the likelihood function at each iteration can be found in [S91]. Again
since, in general, the EM algorithm does not always converge to a global maximum; we used the
directional derivative of Eq. (2.15) to check if the distribution that we found using our iterative
scheme is in fact the global maximum of the likelihood function. On the other hand, because of
13
the convexity of the optimization problem as a function of , any limit point of the NPEM2
algorithm is a global maximum. This is discussed below
Computational Aspects of NPEM1
1. NPEM1 Algorithm Vector-Matrix notation
E-M Steps
Support Point Update
1)
2) (2.18)
Eq. (2.18) is just a nonlinear least squares problem and straightforward to solve.
When Y is a linear (or affine) function of , then the optimization problem in Eq. (2.18) can be
solved analytically. Using
14
Now taking transpose of both sides:
3) (2.19)
If is independent of then
4) (2.20)
Convergence of EM Algorithms and Convex Optimization
The PhD thesis of Baek [B06], written under the supervision of James Burke of the University of
Washington, gives a nice connection between the EM algorithms and convexity theory. Since
Baek’s thesis is not readily accessible, we outline the main points here. Also from these results
we get a global convergence result for the NPEM2 algorithm.
The general optimization problem is
P1:
NPEM1 Algorithm
15
In this case and the NPEM1 algorithm optimizes over both
.
From Eqs. (2.12)-(2.13), the computational steps are then
where
and and let vector of weights (probabilities) such
that .
The optimization problem P1 is a constrained problem, convex in , but not necessarily convex
in . Including the constraints in the optimization of P1 gives the Lagrangian
function with multipliers and :
(2.21)
.
Then the Karush-Kuhn-Tucker (KKT) conditions [BV04] are:
16
where
Then KKT2)-KKT4) imply
KKT5)
Further, the solution
to KKT1) can be obtained by solving the optimization problem
KKT6)
Now compare eqs. (2.12)-(2.13) to KKT5)-KKT6). We see that a limit point of the NPEM1
algorithm is a KKT point of problem P1 . Therefore the KKT condition is a necessary condition
for the limit points of NPEM1. This gives a connection between the NPEM1 algorithm and the
NPAG algorithm.
NPEM2 Algorithm
In this case a (usually large) grid of points is fixed. The corresponding
optimization problem is
P2:
17
The NPEM2 algorithm now optimizes over only . Of course, by the Lindsay and
Mallet theorems, only at most points in will eventually have non-zero probability.
The NPEM2 algorithm is given in eqs. (2.16)-(2.17). The computational steps are then
where
The optimization problem P2 is now a constrained convex problem. Including the constraints in
the optimization of P2 gives the Lagrangian function with multipliers
and :
(2.22)
Now the Karush-Kuhn-Tucker (KKT) conditions are:
where
Then KKT2=b)-KKT4b) imply
18
KKT5b)
Now compare eqs. (2.16)-(2.17) to this last equation. We again see that a limit point of the
NPEM2 algorithm is a KKT point of problem P2. Therefore the KKT condition is a necessary
condition for the limit points of NPEM1. However, since P2 is convex and differentiable
[KT51], the KKT condition is also a sufficient condition for the limit points of NPEM2 to be a
global optimum of P2. That is, for a convex problem P2, a solution is a global optimum of P2
if and only if the KKT condition is satisfied. This proves the following theorem.
Theorem 3: Let be a limit point of NPEM2. Then is a global optimum of P2.
Remark: In general, the limit points of an EM algorithm are only stationary points.
The Primal-Dual Interior Point method for optimizing eq. (2.13) with fixes support points
First consider the NPEM 2 algorithm with fixed the supports points . The optimization of
(2.23)
with respect to is a convex optimization problem. And although the update given in
NPEM2:
19
is likelihood improving, it can be exceptionally slow to converge. The idea of Robert Leary
(now at the Pharsight Corporation) and James Burke (at the University of Washington) was to
replace this update of by an optimization method consistent with modern convexity theory.
Namely, optimize eq. (2.13) by the Primal-Dual Interior Point method.
Primal Dual Interior Point Method
We now describe the primal dual interior point method as it applies to our NPML problem.
The actual algorithm we use for this step is a Matlab program written Bradley Bell of the
University of Washington, see Bell [B12]. Since this algorithm is written in computer code, we
give a mathematical description in what follows.
First we derive the Karush-Kuhn-Tucker (KKT) conditions for our NPML optimization problem.
These conditions are essential for our Primal Dual Interior Point program. This material is
standard in Convex Optimization theory so no proofs will be given. An excellent reference is the
book by Boyd and Vandenberghe [BV04].
We first describe the KKT conditions in general setting. Consider the primal problem
P1:
The Lagragian for P1 is then
20
and the Lagrangian dual function is:
The subsequent dual problem is
D1:
Important properties are:
• The dual problem is always convex
• The primal and dual optimal values satisfy
• Slater’s condition: If there is an
• Duality gap is the quantity and
Note: we use this last property in the Primal Dual Interior Point algorithm as a stopping
condition.
Corresponding to the problems P1 and D1 are the KKT conditions:
(Stationarity)
(Primal feasibility)
(Dual feasibility)
(Complementarity)
For proof see [BV04]. The main result is then:
Theorem 4: If satisfies Slater’s condition, then are primal-dual optimal if
and only if
satisfy the KKT conditions.
For proof see [BV04].
21
We now apply the above theory to our NPML problem. Assume a fixed grid of support points
. As before the corresponding optimization problem is
P2: maximize: , w.r.t
subject to:
Now let and define the function
Then the problem P2 now can be written as:
P: minimize:
subject to: { }
where is the column vector of all ones
Note: We write P as a minimization problem because is a linear function of and is
a convex function of and we want to apply the above theory.
First for convenience we include the constraint in problem P as a penalty in the
objective function. For proof see [B06] ( Prop. 4.2).
P:
Now writing this problem in the form P1, we have
,
The Lagrangian and dual Lagrangian function become:
22
Since
is convex in , a necessary and sufficient conditions for the minimizer of
are the gradient conditions:
Regrouping terms in the Lagrangian
and substituting the above gradient conditions in
gives
The dual problem is then
D:
maximize −Φ(v) wrt (y,v): y≥ 0,v≥ 0, y+Ψ
T
v= N⋅1
K
And written in minimization form
D:
minimize Φ(v) wrt (y,v): y≥ 0,v≥ 0, y+Ψ
T
v= N⋅1
K
To get the KKT conditions we need the additional derivatives:
Then from the general KKT conditions we have:
Stationarity:
23
Primal Feasibity:
Dual Feasibity:
Complimentarity
Using matrix notation these conditions become
where we have combined the two conditions
and where
Interior Point Method:
Relaxed Problem
Now we want to remove the constraint in the optimization problem P
A perfect version would be
But is not differentiable. So we approximate by for
As . See graph below.
24
Fig 1: The graph of for
The new “relaxed” optimization problem becomes
Remember
Now the KKT conditions are exactly the same for the first two equations and the third equation
becomes . The resulting relaxed KKT conditions are then
25
The change of variables: in KKT makes the first equation become
, while the rest remains the same. The Interior Point Method applies Newton’s
method to the perturbed nonlinear equations
where is decreased to zero.
The procedure terminates when
where
is the distance between the current primal and dual solutions.
Further .
Newton’s Method: We now give a brief description of the Newton’s method applied to .
The vector function is defined by
The Jacobian of is then
26
The Newton step at a point is obtained by solving the linear equations
(2.24)
After each solution to eq. (2.24), the value of is reduced at the same rate as quantity
. For numerical details see the Bell [B12].
The program in this form is equivalent to the first cycle of the USC NPAG program.
Hybrid of the NPEM 1 and the Interior Point Method (EM1+IP):
There is no reason why the primal dual interior point method cannot be adjoined to the algorithm
of NPEM1. In this case, instead of a very large grid of fixed support points, there are only N
support points at each iteration. And eq. (2.12) is as before. But the update of the weights in eq.
(2.13) is replaced by the primal dual interior point method. More precisely, the computational
steps are then
EM1+IP
Step 1.
Step 2.
Since Step 1 is likelihood improving by the EM argument of [S91] and Step 2 is likelihood
improving by definition, it follows that Steps 1 and 2 together are likelihood improving.
Remarks: The EM1+IP algorithm has the potential to be more efficient for high dimensional
problems then the current USC-NPAG algorithm
27
Hybrid of the NPEM 2 and the Interior Point Method (EM2+IP):
We also decided to join the primal dual interior point method with the algorithm of NPEM2. In
this case we take of a very large grid of fixed support points as in NPEM2. But the update of the
weights in eq. (2.16 – 2.17) is replaced by the primal dual interior point method. More precisely,
the computational steps are then
EM2+IP
Step 1. Take a large grid of support points
Step 2.
Step 3. Throw away points with zero probabilities
Since Step 1 is fixed and Step 2 is likelihood improving by definition, it follows that Steps 1 and
2 together are likelihood improving.
Remarks: The EM2+IP is identical to the first iteration of the current USC-NPAG algorithm.
Accuracy of NPML estimates
It is known that in general it is not possible to build the confidence intervals on the unknown
distribution since it is an infinite dimensional parameter. However bootstrap confidence
intervals can be used to get the asymptotic confidence intervals on any desired quantity of the
NPML estimate of the distribution . We will show the way it can be done on the example
below.
“Bimod” example:
The “bimod” dataset [PG08] consists of 100 points drawn from the bimodal mixture
. The question is to estimate the distribution of the mean of this
data.
28
The following model was used in this example.
Consider and is the above simulated “bimod” dataset. Then for
(2.24)
which is a density of normal distribution with mean and standard deviation . We will consider
. Let be the number of support points of and be the vector of unknowns as it was
defined in (2.11) then
(2.25)
where
(2.26)
The updated values of can be found using (2.12) and (2.13). Thus for any we
have
29
To find the value of that maximizes the above function we will use the formula (2.20) with
:
Thus if we write it all out we get
(2.27)
and
(2.28)
where as before
(2.29)
The analysis was done the following way.
• The was found using NPEM1 algorithm
• The mean and sigma of was found
and
• The 95 percent bootstrap confidence interval was built using 100 bootstrap samples of the
“bimod” data set and the type of interval is chosen to be the bootstrap bias-corrected and
accelerated percentile method.
30
Bootstrap bias-corrected and accelerated percentile method (Bbca)
The bias-corrected and accelerated percentile method, suggested by Efron [E87], seeks to
adjust for the bias and skewness of the sampling distribution. It adjusts the percentiles
selected from the bootstrap percentile method to be the endpoints of the confidence
intervals. The adjusted percentiles are
where denotes the standard normal cumulative distribution function and
is the th percentile of the standard normal distribution. Here adjusts for the
skewness of the sampling distribution, and adjusts for the bias of
the estimator where is the estimate of calculated for the bootstrap sample and is
the number of bootstrap samples used to calculate the confidence interval.
• This process was repeated 100 times to check if we get approximately 5 of the confidence
intervals to miss the true values of the mean and sigma.
The results are shown in the following table.
True value # of intervals that contain the true value
93
91
Table 1: This table shows the number of 95 percent bootstrapped confidence intervals out of a
100 that contain the true value of the expected value and the standard deviation of the ‘’bimod’
data example.
31
Chapter 3. Computer Simulations for NPML
Lindsay simple example
This example is given in Lindsay [L83, Ex. 2]. It is a useful example because of its simplicity
and the fact that the optimal solution is known analytically.
Suppose that where and for some
(3.1)
Assuming that is the number of support points of and that is defined as in (2.11) we
have
(3.2)
Then for further simplification we write
(3.4)
Then by (2.12) and (2.13) for some the updated values of are
(3.5)
and
(3.6)
32
where
(3.7)
It was shown in Lindsay[L83] that the maximum likelihood estimate of has two support
points at with weight and at with weight . Notice that the number
of support points is the same as the size of the given data that is .
Consider now the NPEM1 algorithm described above. It was implemented using MATLAB. The
program Run_NPEM_Ex6_1.m works in the following way:
1. Type the name of the file in the command window and press return. It gives the output as:
>> Run_NPEM_Ex6_1
tolerance =
2. Input the tolerance level, which we denote by and press return. The
result is
>> Run_NPEM_Ex6_1
tolerance = 0.00001
theta_0 =
3. Input the initial value for = [2, 5] and press return.
>> Run_NPEM_Ex6_1
33
tolerance = 0.00001
theta_0 = [2, 5]
w_0 =
4. Input the initial value for and press return. Then the output is
>> Run_NPEM_Ex6_1
tolerance = 0.00001
theta_0 =[2, 5]
w_0 =[0.5000, 0.5000]
count =
13
theta =
1.0001 4.0006
w =
0.3333 0.6667
LogLikelihood =
-1.7919
x =
4.0000
max_value =
2.8395e-004
Where count is the number of iterations it takes the algorithm to converge; theta and w are the
maximum likelihood estimates of the true values of the vectors and ; LogLikelihood is the
34
value of the log-likelihood function evaluated at the resulting values of theta and w; x is the
value of that maximizes and max_value is the value of . We can see that the
value of the maximum is within acceptable numerical error from zero. Run_NPEM_Ex6_1.m
also gives the graph of the function which is shown in Fig.1 below:
Fig.1: The graph of the D function for Lindsay simple example.
We can see that is non-positive for all values of and moreover the set of values for
which it is equal to zero contains points 1.0001 and 4.0006 therefore by the theorem[3] we can
conclude that the algorithm converges to a global maximum and the resulting distribution is the
maximum likelihood estimate of the true distribution of .
35
Fig.2: The graph of the . Lindsay simple example.
If we compare the results we got with the solution presented by Lindsay we can see that they are
essentially the same up to some rounding error.
36
Galaxy Data
We will use a Galaxy Data to illustrate the NPEM1 algorithm. This data was published by
Roeder (1990) and it represents the recession velocities, in units of 10
3
km/s, of 82 galaxies from
six well-separated conic sections of space. The error in the calculations is said to be less than
0.05 units.
This problem is a benchmark for practically all density estimation programs, see [EW95].
The histogram of the data is given below:
Recession velocities in 1000 km/s of 82 galaxies
9.172 9.350 9.483 9.558 9.775 10.227 10.406 16.084 16.170 18.419 18.552 18.600 18.927
19.052 19.070 19.330 19.343 19.343 19.440 19.473 19.529 19.541 19.547 19.663 19.846 19.856
19.863 19.914 19.918 19.973 19.989 20.166 20.175 20.179 20.196 20.215 20.221 20.415 20.629
20.795 20.821 20.846 20.875 20.986 21.137 21.492 21.701 21.814 21.921 21.960 22.185 22.209
22.242 22.249 22.314 22.374 22.495 22.746 22.747 22.888 22.914 23.206 23.241 23.263 23.484
23.538 23.542 23.666 23.706 23.711 24.129 24.285 24.289 24.366 24.717 24.990 25.633 26.960
26.995 32.065 32.789 34.279
Table 2: Galaxy data
37
The distribution of velocities could answer the question about clumping of galaxies, which is if
galaxies are clumped then the distribution of velocities is multi-modal and if they are not
clumped then the distribution would increase at the beginning and then gradually tail off.
In terms of our notation the model can be described as follows. Consider and ’s from
the above table then for
(3.8)
which is a density of normal distribution with mean and standard deviation . We will
consider . Let be the number of support points of and be the vector of
unknowns as it was defined in (2.11) then
(3.9)
where
(3.10)
The updated values of can be found using (2.12) and (2.13). Thus for any we
have
38
To find the value of that maximizes the above function we will use the formula (2.20) with
:
Thus if we write it all out we get
(3.11)
and
(3.12)
where as before
(3.13)
The program Run_NPEM_Galaxy.m uses the NPEM1 algorithm described above to find the
MLE of the true distribution of . Here we will illustrate the results we got together with the
description of the way the program works.
1. Type Run_NPEM_Galaxy in the command window and press return. The result is
>> Run_NPEM_Galaxy
number of support points =
2. Input the number of support points, say K = 3 and press return. The output is
>> Run_NPEM_Galaxy
39
number of support points = 3
count =
13
theta =
9.7502 21.4032 32.9443
w =
0.0859 0.8769 0.0372
LogLikelihood =
-212.6803
max_value =
29.5478
Where similarly as in the previous example count is the number of iterations it takes the
algorithm to converge, theta and w are the maximum likelihood estimates of the true values of
the vectors and , LogLikelihood is the value of the log-likelihood function evaluated at the
resulting values of theta and w and max_value is the maximum value of the function .
If we compare the results we got with those presented in the paper written by Murray Aitkin we
can see that in case of three support points and the values are the same, but if we look at
the graph of the function which is also produced by Run_NPEM_Galaxy.m and is
presented in the Fig. 3 below
40
Fig. 3: The graph of the D function for Galaxy Data for K=3.
we can see that it takes positive values and attains it’s maximum 29.5478 at a point
therefore by theorem 3 we can conclude that the resulting distribution is not an MLE of the true
distribution of if . Run_NPEM_Galaxy.m also produces the Fig. 4 that represents the
resulting discrete distribution of which is presented below
Fig. 4: The graph of the discrete distribution of for K=3 shows 3 support points. Galaxy data
example (The height of each vertical line is the corresponding probability on the support point)
Moreover if we run the program for and the same the resulting values are:
41
theta =
Columns 1 through 8
9.7420 9.7420 9.7420 9.7420 9.7420 9.7420 9.7420 9.7420
Columns 9 through 16
9.7420 9.7420 9.7420 9.7420 9.7420 9.7420 9.7420 21.1342
Columns 17 through 24
21.1342 21.1342 21.1342 21.1342 21.1342 21.1342 21.1342 21.1342
Columns 25 through 32
21.1342 21.1342 21.1342 21.1342 21.1342 21.1342 21.1342 21.1342
Columns 33 through 40
21.1342 21.1342 21.1342 21.1342 21.1342 21.1342 21.1342 21.1342
Columns 41 through 48
21.1342 21.1342 21.1342 21.1342 21.1342 21.1342 21.1342 21.1342
Columns 49 through 56
21.1342 21.1342 21.1342 23.9844 23.9844 23.9844 23.9844 23.9844
Columns 57 through 64
23.9844 23.9844 23.9844 23.9844 23.9844 23.9844 23.9844 23.9844
Columns 65 through 72
23.9844 23.9844 23.9844 32.9931 32.9931 32.9931 32.9931 32.9931
Columns 73 through 80
32.9931 32.9931 32.9931 32.9931 32.9931 32.9931 32.9931 32.9931
Columns 81 through 82
32.9931 32.9931
42
w =
Columns 1 through 8
0.0088 0.0089 0.0088 0.0086 0.0084 0.0075 0.0070 0.0062
Columns 9 through 16
0.0053 0.0045 0.0039 0.0035 0.0026 0.0013 0.0004 0.0000
Columns 17 through 24
0.0004 0.0005 0.0013 0.0026 0.0028 0.0041 0.0055 0.0071
Columns 25 through 32
0.0098 0.0112 0.0134 0.0155 0.0203 0.0224 0.0268 0.0283
Columns 33 through 40
0.0310 0.0346 0.0361 0.0375 0.0397 0.0407 0.0407 0.0408
Columns 41 through 48
0.0403 0.0386 0.0374 0.0349 0.0332 0.0308 0.0283 0.0253
Columns 49 through 56
0.0205 0.0184 0.0135 0.0110 0.0117 0.0110 0.0092 0.0080
Columns 57 through 64
0.0071 0.0061 0.0046 0.0038 0.0032 0.0023 0.0018 0.0015
Columns 65 through 72
0.0010 0.0006 0.0002 0.0006 0.0009 0.0016 0.0018 0.0021
Columns 73 through 80
0.0025 0.0027 0.0028 0.0030 0.0032 0.0032 0.0032 0.0032
Columns 81 through 82
0.0031 0.0030
43
LogLikelihood =
-212.1969
Then we can see that the value of the log-likelihood function has increased and the resulting
discrete distribution of which is presented below shows that we have 4 different support points
Fig. 5: The graph of the discrete distribution of for K= 82 shows 4 support points. Galaxy data
Indeed if we run the program for the results are
>> Run_NPEM_Galaxy
number of support points = 4
count =
972
theta =
9.7420 21.1342 23.9844 32.9931
w =
0.0858 0.7941 0.0832 0.0369
44
LogLikelihood =
-212.1969
max_value =
1.1422e-006
We can see that the maximum value of the function is very close to zero and the graph,
which is presented below, shows that the support points are among those for which .
Therefore by the theorem 3 we can say that the distribution we found is the optimal one.
Fig. 6: The graph of the D function for Galaxy Data with 3 support points. Galaxy data
We also present the graph of the resulting MLE of the true distribution of ,
45
Fig. 7: The graph of the discrete distribution of for K= 4 shows 4 support points. Galaxy data
46
Pharmacokinetics (PK) example with multiplicative measurement noise
We will apply NPEM1 algorithm to the simulated data which represents the series of
measurements of the drug concentration in the blood of each patient in the group. The drug
concentration in the organism hours after administration of the initial dose can be
described by the following exponential decay function
(3.14)
where is the elimination rate constant and is the volume of distribution.
Suppose that we have subjects (patients) and they were administrated an initial dose of
some drug. Then we take a series of measurements of the drug concentration in the blood of
each subject. These measurements can be described by the following PK model:
(3.15)
where is the drug concentration for th subject at a time , is the volume of distribution
and is the elimination rate constant for th subject. The measurement error term is
assumed to be standard normally distributed and denotes possibly unknown scale factor.
The unknown parameters are . We assume that they are
independent and identically distributed given their common but unknown distribution
function .
Therefore using our earlier notation let and the objective is to estimate the
distribution given the data .
While illustrating the work of algorithm on this data we will consider two different approaches
linear and nonlinear.
47
Linear approach
Case 1. For simplicity we first assume and ’s to be known, therefore in this case is one
dimensional for each subject, i.e. . In order to “linearize” the model we take log on both
sides of the equation (3.15), then we get
(3.16)
Notice that for small enough values of , therefore we have
and since and are known constants we can write
where (3.17)
If we use our earlier notations we have that the conditional density of given
is
(3.18)
Then if we denote to be the vector of measurements of the th subject and by
the independence of s the conditional density of given is
(3.19)
If we now consider to be the number of support points and to be the vector of parameters
and their corresponding weights as in (2.11) then the conditional density of given is
(3.20)
48
and the log-likelihood function is
(3.21)
Then the updated values of according to (2.21) and (2.13) are
As for the Galaxy data we can find this maximum explicitly by solving the following equation
for
thus
(3.22)
and
(3.23)
49
where
(3.24)
Case 2. Let us now consider ’s to be unknown therefore in this case ’s are two-dimensional
which is denoted as where . Then the “linearized” PK model
can be written as
where and (3.25)
Therefore for some conditional density of given is
(3.26)
Then similarly if we denote to be a vector of all measurements for th subject, the conditional
density of given is
(3.27)
Assuming is the number of support points the conditional density of given and log-
likelihood function are
(3.28)
50
Also the updated values of according to (2.12) and (2.13) are
Then the updated values of and can be found by the equation (2.20), if we write the
model (3.25) in the matrix notation:
51
Therefore if we write it all out, for all we have
(3.29)
and
52
(3.30)
where
(3.31)
Nonlinear approach
Let us now consider a PK model as it was defined at the beginning
(3.32)
Case 3. For simplicity, as before, we first assume and to be known and unknown parameter
to be one dimensional, i.e. for each , . Therefore for some if
we denote then the conditional density of given is
(3.33)
Then by the independence if the conditional density of given is
Letting to be the number of support points, as earlier, the conditional density of given
and log- likelihood function are
53
Then the updated values of according to (2.12) and (2.13) are
(3.34)
where .
Case 4. Let us now consider more general case when ’s assumed to be unknown. Therefore,
keeping the same notations as in the linear approach, we let parameter
to be two-dimensional where the first and the second components and represent the
unknown elimination rate and the volume of distribution respectively. Then if we let
the formulas we used in case 1 stay unchanged. Thus if is the number
of support points of MLE of the true distribution of , the updated values of are
(3.35)
where and .
Data simulation
As we mentioned before to illustrate the NPEM1 algorithm we use the simulated data. Let us
now describe the simulations. The observations were simulated from the corresponding models
using the following parameters:
54
In case when is known we set otherwise . The elimination rate constants s
are simulated from either single normal distribution or from the mixture of two normal
distributions that is:
Model 1:
Model 2: , where (we use )
We will call these two cases unimodal and bimodal respectively.
The error term is taken from standard normal distribution and the scale factor .
Algorithm illustration
Linear approach. One-dimensional case. (Case 1)
The program Run_NPEM_linear.m uses the data simulated by Model 1 and Model 2 and the
algorithm described above to find the MLE of the true distribution of . Here we will illustrate
the results we got together with the description of the way the program works. We will start with
the data for the linear case first.
1. Type Run_NPEM_linear in the command window and press return. The result is
>> Run_NPEM_linear
number of support points =
2. Input the number of support points, say K = N = 20 and press return. The output is
>> Run_NPEM_linear
number of support points = 20
dimension of theta =
3. Input the dimension of theta, we start with one dimensional case
55
>> Run_NPEM_linear
number of support points = 20
dimension of theta = 1
for unimodal model input “1”, for bimodal model input “2”
4. Input the type of the distribution from where the elimination rate constant is simulated,
we start with the single normal distribution, i.e. we type 1. The output is
>> Run_NPEM_linear
number of support points = 20
dimension of theta = 1
for unimodal model input “1”, for bimodal model input “2” 1
count =
627
theta =
Columns 1 through 7
0.3588 0.3588 0.3588 0.3588 0.3588 0.3588 0.3588
Columns 8 through 14
0.3588 0.5035 0.5035 0.5035 0.5035 0.5362 0.5362
Columns 15 through 20
0.6213 0.6213 0.6213 0.6213 0.6213 0.6213
w =
Columns 1 through 7
0.0000 0.0000 0.0000 0.0001 0.0009 0.0158 0.0357
Columns 8 through 14
56
0.0478 0.0355 0.1349 0.1928 0.2606 0.1298 0.0893
Columns 15 through 20
0.0071 0.0295 0.0158 0.0034 0.0009 0.0000
LogLikelihood =
-26.4905
D_max_val =
-7.5292e-04
We can see that the maximum value of the D function is very close to zero and the graph, which
is represented below, ensures that the D function is nonpositive everywhere. Moreover, we also
represent the graph of the MLE of the true distribution of , where we can notice the 3
distinct point of support and they are among those points where D function is equal to zero.
Fig 8: The graph of the simulated data for Case 1 and Model 1.
57
Fig 9: The graph of the D function for K = 20, Case 1 and Model 1
Fig 10: The graph of the for K = 20, Case 1 and Model 1 it shows 4 support points
58
Fig 11: True distribution of Model 1
Let us now run the program for K = 4. The way it works is absolutely the same; therefore we just
show the results.
>> Run_NPEM_linear
number of support points = 4
dimension of theta = 1
for unimodal model input “1”, for bimodal model input “2” 1
count =
1327
theta =
0.3588 0.5035 0.5362 0.6213
w =
0.1003 0.6238 0.2191 0.0567
LogLikelihood =
59
-26.4905
D_max_val =
-7.5612e-04
Notice that the value of the likelihood function did not change and the maximum value of the D
function is still very close to zero. These results and the figures represented below allow us to
conclude that by the theorem 3 the distribution we found is the MLE of the true distribution of
.
Fig 12: The graph of the D function for K = 4, Case 1 and Model 1
60
Fig 13: The graph of the for K = 4, Case 1 and Model 1 it shows 4 support points
Let us now run the program for 20 support points, one-dimensional but the type of the
distribution, from where the elimination rate constant is simulated, is now taken to be a
mixture of two normal distributions. Since the way the program works in this case is also the
same as it was shown before we will only show the output of the program.
>> Run_NPEM_linear
number of support points = 20
dimension of theta = 1
for unimodal model input “1”, for bimodal model input “2” 2
count =
600
theta =
Columns 1 through 8
61
0.2168 0.2168 0.2168 0.2997 0.2997 0.2997 0.2999 0.3535
Columns 9 through 16
0.3535 0.3535 0.5332 0.5332 0.5332 0.5332 0.6705 0.6705
Columns 17 through 20
0.6705 0.6705 0.6705 0.6705
w =
Columns 1 through 8
0.0044 0.0302 0.0436 0.0414 0.0958 0.1566 0.1791 0.1515
Columns 9 through 16
0.0574 0.0395 0.0100 0.0244 0.0340 0.0319 0.0081 0.0275
Columns 17 through 20
0.0294 0.0233 0.0101 0.0018
LogLikelihood =
-2.4748
D_max_val =
-4.9737e-08
As we expected, since we used the value of the maximum of the function
is close to zero and the following graphs allow us to assume that the MLE of the true distribution
of might have 5 distinct points of support.
62
Fig 14: The graph of the simulated data for Case 1 and Model 2.
Fig 15: The graph of the D function for K = 20, Case 1 and Model 2
63
Fig 16: The graph of the for K = 20, Case 1 and Model 2 it shows 5 support points
Fig 17: True distribution of Model 2
Let us test this assumption and run our program for .
>> Run_NPEM_linear
number of support points = 5
64
dimension of theta = 1
for unimodal model input “1”, for bimodal model input “2” 2
count =
461
theta =
0.2168 0.2998 0.3535 0.5332 0.6705
w =
0.0783 0.4732 0.2480 0.1003 0.1002
LogLikelihood =
-2.4748
D_max_val =
-2.9370e-08
Indeed, we can see that the maximum value of the function is very close to zero
and the graph, which is presented below, shows that the support points are among those
for which . Therefore by the theorem 3 we can say that the distribution we
found is the MLE of the true distribution of .
65
Fig 18: The graph of the D function for K = 5, Case 1 and Model 2
Fig 19: The graph of the for K = 5, Case 1 and Model 2 it shows 5 support points
66
Linear approach
Two-dimensional case. (Case 2)
Let us now move to the two-dimensional . We will run the program Run_MPEM_linear.m
for the data simulated by the Model 1 where the elimination rate constant is simulated
from the single normal distribution.
>> Run_NPEM_linear
number of support points = 20
dimension of theta = 2
for unimodal model input “1”, for bimodal model input “2” 1
count =
95
theta =
Columns 1 through 8
0.3962 0.3962 0.3962 0.3962 0.3962 0.3962 0.3962 0.3962
3.0362 3.0362 3.0362 3.0362 3.0362 3.0362 3.0362 3.0362
Columns 9 through 16
0.3962 0.4915 0.4915 0.4915 0.5502 0.5502 0.5502 0.5502
3.0362 2.8859 2.8859 2.8859 3.0786 3.0786 3.0786 3.0786
Columns 17 through 20
0.5502 0.5502 0.5502 0.5502
3.0786 3.0786 3.0786 3.0786
w =
Columns 1 through 8
67
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0034 0.0215
Columns 9 through 16
0.0400 0.0709 0.2125 0.1808 0.1106 0.2297 0.0963 0.0272
Columns 17 through 20
0.0070 0.0001 0.0000 0.0000
LogLikelihood =
3.5667
D_max_val =
4.6850e-04
The results show that function is nonpositive everywhere and more over we can calculate
that there are approximately three distinct points of support of the distribution we found.
Fig 20: The graph of the simulated data for Case 2 and Model 1
68
Fig 21: The graph of the D function for K = 20, Case 2 and Model 1
We will test our guess by running the program for . The results are
>> Run_NPEM_linear
number of support points = 3
dimension of theta = 2
for unimodal model input “1”, for bimodal model input “2” 1
count =
140
theta =
0.3962 0.4915 0.5502
3.0362 2.8859 3.0786
w =
0.0649 0.4642 0.4709
69
LogLikelihood =
3.5667
D_max_val =
4.7606e-04
We can see that there are 3 distinct points of support as we guessed earlier. Moreover the
maximum value of the function is close to zero and the graph of the function, which
represented below, shows that the points we found are among the zeros of this function.
Therefore by the theorem 3 we can conclude that the distribution we found is the MLE of the
true distribution of .
Fig 22: The graph of the D function for K = 3, Case 2 and Model 1
In case when the elimination rate constant is simulated from the mixture of two normal
distributions by doing similar research we found that the MLE of the true distribution of in
70
this case has 8 points of support and the value of the likelihood function is -4.7029. The
following is the output of Run_NPEM_linear.m for , dimension of is 2 and the
bimodal case.
>> Run_NPEM_linear
number of support points = 8
dimension of theta = 2
for unimodal model input “1”, for bimodal model input “2” 2
count =
36
theta =
0.2957 0.2957 0.2957 0.3466 0.4301 0.5825 0.6666 0.6666
2.8686 2.8686 2.8686 3.0729 3.1708 2.9520 3.0955 3.0955
w =
0.0002 0.0462 0.3818 0.3718 0.0493 0.0504 0.0986 0.0017
LogLikelihood =
-4.7029
D_max_val =
-3.1112e-04
71
Fig 23: The graph of the D function for K = 8, Case 2 and Model 2
Nonlinear approach
One-dimensional case. (Case 3)
The program Run_NPEM_nonlinear.m uses the algorithm described above to find the MLE of
the true distribution of . Here we will illustrate the results we got together with the description
of the way the program works.
We will start with the one-dimensional and the unimodal case.
1. Type Run_NPEM_nonlinear in the command window and press return. The result is
>> Run_NPEM_nonlinear
72
number of support points =
2. Input the number of support points, say K = N = 20 and press return. The output is
>> Run_NPEM_nonlinear
number of support points = 20
dimension of theta =
3. Input the dimension of theta, we start with one dimensional case
>> Run_NPEM_nonlinear
number of support points = 20
dimension of theta = 1
for unimodal model input “1”, for bimodal model input “2”
4. Input the type of the distribution from where the elimination rate constant is simulated,
we start with the single normal distribution, i.e. we type 1. The output is
>> Run_NPEM_nonlinear
number of support points = 20
dimension of theta = 1
for unimodal model input “1”, for bimodal model input “2” 1
count =
86
theta =
Columns 1 through 8
0.4377 0.4377 0.4377 0.4377 0.4377 0.4377 0.4377 0.4377
Columns 9 through 16
0.4377 0.4377 0.5013 0.5013 0.5675 0.5675 0.5676 0.5676
73
Columns 17 through 20
0.5676 0.5676 0.5676 0.5676
w =
Columns 1 through 8
0.0000 0.0000 0.0000 0.0000 0.0000 0.0009 0.0041 0.0117
Columns 9 through 16
0.0722 0.0758 0.1622 0.2454 0.1796 0.1651 0.0557 0.0214
Columns 17 through 20
0.0055 0.0002 0.0000 0.0000
LogLikelihood =
8.8038
D_max_val =
7.1013e-05
Fig 24: The graph of the simulated data for Case 3 and Model 1
74
Fig 25: The graph of the D function for K = 20, Case 3 and Model 1
Fig 26: The graph of the for K = 20, Case 3 and Model 1 it shows 3 support points
The results indicate that the MLE of the true distribution of might have three distinct point
of support and the likelihood function is equal to 8.8038.
Let us run the program for . The output is
>> Run_NPEM_nonlinear
75
number of support points = 3
dimension of theta = 1
for unimodal model input “1”, for bimodal model input “2” 1
count =
100
theta =
0.4377 0.5013 0.5675
w =
0.1649 0.4076 0.4275
LogLikelihood =
8.8038
D_max_val =
-5.8039e-05
Fig 27: The graph of the D function for K = 3, Case 3 and Model 1
76
Fig 28: The graph of the for K = 3, Case 3 and Model 1 it shows 3 support points
Indeed by the theorem 3 the distribution we found is the MLE of the true distribution of in
this case.
We did similar research for the bimodal case and we found that the MLE of the true
distribution of in this case has 6 points of support and the value of the likelihood function
is -78.5026. The following is the output of Run_NPEM_nonlinear.m for , dimension of
is 1 and the bimodal case.
>> Run_NPEM_nonlinear
number of support points = 6
dimension of theta = 1
for unimodal model input “1”, for bimodal model input “2” 2
count =
41
77
theta =
0.1663 0.3006 0.3644 0.5021 0.5882 0.6589
w =
0.0516 0.5757 0.1754 0.0550 0.0562 0.0862
LogLikelihood =
-78.5026
D_max_val =
4.3684e-05
Fig 29: The graph of the D function for K = 6, Case 3 and Model 2
78
Fig 30: The graph of the for K = 6, Case 3 and Model 2 it shows 6 support points
Fig 31: True distribution of Model 2
Nonlinear approach
Two-dimensional case. (Case 4)
Let us now represent the resulting distributions we found for the two-dimensional . We
first start with unimodal case. The distribution we got in this case has 3 points of support and
79
the likelihood is equal to 7.4824. The following is the output of Run_NPEM_nonlinear.m for
, dimension of is 2 and the unimodal case. The code for this program is presented in
the appendix.
>> Run_NPEM_nonlinear
number of support points = 3
dimension of theta = 2
for unimodal model input "1", for bimodal model input "2" 1
count =
61
theta =
0.3923 0.5028 0.5497
22.7622 19.1202 23.3316
w =
0.1520 0.4140 0.4341
LogLikelihood =
7.4824
D_max_val =
-1.6324e-04
80
Fig 32: The graph of the D function for K = 3, Case 4 and Model 1
In case when the elimination rate constant is simulated from the mixture of two
normal distributions the results we got show that the MLE of the true distribution of
has 10 points of support and the likelihood function is equal to -45.3473. The output of
the Run_NPEM_nonlinear.m for , two-dimensional and bimodal case is
represented below.
>> Run_NPEM_nonlinear
number of support points = 10
dimension of theta = 2
for unimodal model input "1", for bimodal model input "2" 2
count =
95
theta =
Columns 1 through 8
81
0.2740 0.2740 0.2740 0.2986 0.3455 0.4384 0.5830 0.6658
19.4619 19.4619 19.4619 19.0598 23.0426 26.1566 20.6542 23.7547
Columns 9 through 10
0.6658 0.6658
23.7547 23.7547
w =
Columns 1 through 8
0.0001 0.0090 0.2032 0.2289 0.3642 0.0526 0.0430 0.0377
Columns 9 through 10
0.0608 0.0005
LogLikelihood = -45.3490
D_max_val = 0.0157
Fig 33: The graph of the simulated data for Case 4 and Model 2
82
Fig 34: The graph of the D function for K = 6, Case 4 and Model 2
(a)
(b)
Fig. 35: a) Smoothed graph of for K = 6, Case 4 and Model 2 it shows 6 support points
;
b)
Graph of the true distribution
83
Pharmacokinetics (PK) example with additive measurement noise
We will apply NPEM1 algorithm to the simulated data which represents the series of
measurements of the drug concentration in the blood of each patient in the group. The drug
concentration in the organism hours after administration of the initial dose can be
described by the following exponential decay function
(3.36)
where is the elimination rate constant and is the volume of distribution.
Suppose that we have subjects (patients) and they were administrated an initial dose of
some drug. Then we take a series of measurements of the drug concentration in the blood of
each subject. These measurements can be described by the following PK model:
(3.37)
where is the drug concentration for th subject at a time , is the volume of distribution
and is the elimination rate constant for th subject. The measurement error term is
assumed to be standard normally distributed and denotes possibly unknown scale factor.
The unknown parameters are . We assume that they are
independent and identically distributed given their common but unknown distribution
function .
Therefore using our earlier notation let and the objective is to estimate the
distribution given the data .
84
NPEM 1 algorithm illustration
Let us now illustrate the main formulas we need to apply the NPEM1 algorithm for this model.
Let parameter to be two-dimensional where the first and the second
components and represent the unknown elimination rate and the volume of
distribution respectively. if we denote then the
conditional density of given is
(3.38)
Then by the independence if the conditional density of given is
Letting to be the number of support points, as earlier, the conditional density of given
and log- likelihood function are
Thus if is the number of support points of MLE of the true distribution of , the updated
values of are
(3.39)
where and .
85
Data Simulation
To illustrate the NPEM1 algorithm we described above we used the simulated data. The
observations were simulated according to (3.37) using the following parameters:
The volume of distribution is taken to be . The elimination rate constant is
simulated from the mixture of two normal distributions that is
,
where (we used ). The measurement error term covariance matrix is taken to be
.
Results: We ran Run_NPEM_nonlinear.m for support points and got the following
results.
Elapsed time is 1983 seconds.
count = 113
theta =
Columns 1 through 9
0.3357 0.1891 0.2925 0.2925 0.6184 0.6184 0.3357 0.2294 0.3357
23.1324 20.2387 20.9565 20.9565 22.6233 22.6233 23.1324 29.1334 23.1324
Columns 10 through 18
0.3543 0.2737 0.3357 0.6184 0.3357 0.3357 0.2925 0.3357 0.3357
15.2039 18.2366 23.1324 22.6233 23.1324 23.1324 20.9537 23.1324 23.1324
Columns 19 through 20
0.6184 0.6184
22.6233 22.6233
86
w =
Columns 1 through 9
0.0101 0.1015 0.0001 0.0863 0.0387 0.0004 0.0001 0.0479 0.0007
Columns 10 through 18
0.0916 0.1246 0.0043 0.0095 0.0000 0.1003 0.2324 0.0000 0.0000
Columns 19 through 20
0.0836 0.0678
LogLikelihood = -44.4991
D_max_val = 0.6171
Fig 36: The graph of the D function. PK example with additive noise
87
(a)
(b)
Fig 37: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This figure
shows the graphs of (solid line) true distribution of the volume of distribution (dashed line ‘--
’) smoothed MLE of the true distribution and (spikes) .
88
Conclusions:
The results we got show that the MLE of the true distribution of has 7 points of support and
the likelihood function is equal to -44.4991. We can clearly see from the figures 36 (a) and (b)
that the estimates are very close to the true distributions.
EM1 +IP algorithm illustration:
Let us now illustrate the way EM1+IP algorithm works for this model.
Step 1 The update on the support points is done by the equation (3.39)
Step 2 The update on the weights is done by the Primal-Dual Interior Point method.
The data was simulated exactly the same way as in the NPEM1 algorithm. Moreover the
EM1+IP algorithm is a part of the Run_NPEM_nonlinear.m algorithm.
Results: We ran Run_NPEM_nonlinear.m for support points and got the following
results.
Elapsed time is 1295 seconds.
count = 70
theta =
Columns 1 through 9
0.3357 0.1891 0.2937 0.2936 0.6184 0.6184 0.3357 0.2294 0.2938
23.1324 20.2387 21.0130 21.0106 22.6233 22.6233 23.1324 29.1335 21.0169
Columns 10 through 18
0.3543 0.2737 0.3357 0.6184 0.3357 0.3357 0.2925 0.3357 0.3357
15.2038 18.2364 23.1324 22.6233 23.1324 23.1324 20.9544 23.1324 23.1324
Columns 19 through 20
89
0.6184 0.6184
22.6233 22.6233
w =
Columns 1 through 9
0.0165 0.1015 0.0000 0.0000 0.0400 0.0400 0.0165 0.0479 0.0000
Columns 10 through 18
0.0916 0.1246 0.0165 0.0400 0.0165 0.0165 0.3188 0.0165 0.0165
Columns 19 through 20
0.0400 0.0400
LogLikelihood = -44.4991
D_max_val = 0.6176
Fig 38: The graph of the D function. EM1+IP
90
(a)
(b)
Fig 39: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . EM1+IP
91
Conclusions:
The results of the EM1+IP algorithm we got show that the MLE of the true distribution of has
also 7 points of support and the likelihood function is equal to -44.4991. We can clearly see from
the figures 38 (a) and (b) that the estimates are very close to the true distributions and moreover
they are very similar to those we got for NPEM1 algorithm. There is one major difference in
performance between this two programs and it is the time. It takes almost twice less time to run
the EM1+IP algorithm that it is to run the NPEM1.
NPEM2 algorithm illustration:
Let us now illustrate the way NPEM2 algorithm works for this model.
Step 1 Choose a big set of support points.
Step 2 The update on the weights is done by the equation (3.39).
The data was simulated exactly the same way as in the NPEM1 algorithm. The NPEM2
algorithm is also a part of the Run_NPEM_nonlinear.m algorithm.
Results: We ran Run_NPEM_nonlinear.m for a random set of support points of size 2129 and
got the following results.
Elapsed time is 1295.925732 seconds.
count = 358
theta =
Columns 1 through 9
0.2945 0.3402 0.2723 0.3303 0.2271 0.6104 0.3745 0.3358 0.3546
20.9180 22.8809 18.2520 15.6299 29.4580 22.9248 15.2124 23.2104 15.2417
Columns 10 through 15
92
0.6218 0.2819 0.6241 0.6259 0.1923 0.2802
22.3901 20.3394 22.5659 22.5220 20.2368 20.9106
w =
Columns 1 through 9
0.3056 0.0001 0.1153 0.0559 0.0467 0.0540 0.0413 0.1146 0.0001
Columns 10 through 15
0.0931 0.0102 0.0517 0.0013 0.1033 0.0069
LogLikelihood = -44.5343
D_max_val = 0.6695
Fig 40: The graph of the D function. NPEM 2
93
(a)
(b)
Fig 41: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . NPEM2
94
Conclusions: The algorithm NPEM2 works relatively fast and gives good results. The
optimization os the likelihood in this case very much depends on the set of points we choose at
the beginning. However the one main reason that makes this algorithm important to consider, is
that EM algorithm converges to the global maximum in this case according to the theorem 3.
EM2 +IP algorithm illustration:
Let us now illustrate the way EM2+IP algorithm works for this model.
Step 1 Choose a big set of support points.
Step 2 The update on the weights is done by the Primal-Dual Interior Point method.
The data was simulated exactly the same way as in the NPEM1 algorithm. The EM2+IP
algorithm is also a part of the Run_NPEM_nonlinear.m algorithm.
Results: We ran Run_NPEM_nonlinear.m for a random set of support points of size 2129 and
got the following results.
Elapsed time is 37.027546 seconds.
count = 2
theta =
Columns 1 through 9
0.2945 0.2723 0.3303 0.2271 0.6104 0.3745 0.3358 0.6218 0.2819
20.9180 18.2520 15.6299 29.4580 22.9248 15.2124 23.2104 22.3901 20.3394
Columns 10 through 12
0.6241 0.1923 0.2802
22.5659 20.2368 20.9106
w =
95
Columns 1 through 9
0.3063 0.1152 0.0559 0.0467 0.0498 0.0414 0.1146 0.0934 0.0111
Columns 10 through 12
0.0568 0.1033 0.0055
LogLikelihood = -44.5343
D_max_val = 0.6696
Fig 42: The graph of the D function. EM2+IP
(a)
96
(b)
Fig 43: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . EM2+IP
Conclusions:
The algorithm EM2+IP works extremely fast, thousands of times faster then NPEM2. However
the answer that it gives depends a lot on the initial set of points that is chosen since it does not
change during the process of optimization. This algorithm is very similar to the USC-NPAG
algorithm, the difference is that USC-NPAG algorithm moves the grid while optimizing the
likelihood.
97
Rolling thumbtacks example
Beckett and Diaconis [1994] generated binary strings from rolls of common thumbtacks. For the
data set, 320 thumbtacks were flicked 9 times each. Conditioned on the tack, it is assumed the
flicks are independent. Let X denote the number of times each tack landed point up. The data
are summarized in Table 3 below:
y 0 1 2 3 4 5 6 7 8 9
Count 0 3 13 18 48 47 67 54 51 19
Table 3. Rolling Thumbtacks example data
In this example and
(3.40)
where .
Assuming that is the number of support points of and we have
(3.41)
Then for some the updated values of are
98
(3.42)
To find this argmax we just need to solve the following equation for
therefore
(3.43)
and
(3.44)
where
(3.45)
99
It is found in [Liu, 1996] that the maximum likelihood estimate of has three support points at
with weights. Using our method we got the following results:
>> Run_NPEM_ThEx
number of support points K =10
count = 4202
theta = 0.4385 0.4385 0.6020 0.6020 0.8144 0.4385 0.5997 0.8144 0.4385
0.4385
w = 0.0531 0.0032 0.1451 0.1449 0.1463 0.1282 0.1504 0.2229 0.0019 0.0040
LogLikelihood = -640.2757
max_value = -4.0177e-04
Fig 44: The graph of for K=10. Rolling thumbtacks
100
Fig 45: The graph of the D function for K=10. Rolling thumbtacks
>> Run_NPEM_ThEx
number of support points K =3
count = 5194
theta = 0.6008 0.4381 0.8144
w = 0.4411 0.1892 0.3697
LogLikelihood = -640.2757
max_value = -8.7616e-04
101
Fig 46: The graph of for K=3. Rolling thumbtacks
Fig 47: The graph of the D function for K=3. Rolling thumbtacks
102
Eye Tracking Example
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
3
3
3
3
4
4
5
5
5
6
6
6
7
7
7
8
9
9
10
10
11
11
12
12
14
15
15
17
17
22
24
34
Table 4: Eye Tracking example, measurements
The model is given by Eqs. (3.46) with
(3.46)
The are 101 data points , but there are only 19 distinct
values of the set . The data are presented in the Table 4.
Assuming that is the number of support points of and we have
(3.47)
Then for some the updated values of are
(3.48)
To find this argmax we just need to solve the following equation for
103
therefore
(3.49)
and
(3.50)
where
(3.51)
We got the following results:
For K=6, eps = 10^(-9)
>> Run_NPEM_EtrackEx
number of support points K =6
count = 747
theta = 0.0000 1.0912 6.0335 12.0468 20.1079 31.7978
w = 0.3266 0.3826 0.1418 0.1094 0.0276 0.0119
LogLikelihood = -218.7495
104
Fig 48: The graph of for K=6. Eye Tracking example
Fig 49: The graph of the D function for K=6. Eye Tracking example
105
For K=101, eps =10^(-9)
>> Run_NPEM_EtrackEx
number of support points K =101
count = 608
theta =
Columns 1 through 12
0.0000 0.0000 0.0000 1.0912 1.0912 1.0912 1.0912 1.0912 1.0912 1.0912
1.0912 1.0912
Columns 13 through 24
6.0334 6.0334 6.0334 6.0334 6.0334 6.0334 6.0334 6.0334 6.0334 6.0335
6.0335 6.0335
Columns 25 through 36
6.0335 6.0341 12.0467 12.0468 12.0468 12.0468 12.0468 12.0468 12.0468 12.0468
12.0468 12.0468
Columns 37 through 48
12.0468 12.0468 12.0468 12.0468 12.0468 12.0468 12.0468 12.0468 12.0468
12.0468 12.0468 12.0468
Columns 49 through 60
12.0468 20.1079 20.1079 20.1079 20.1079 20.1079 20.1079 20.1079 20.1079
20.1079 20.1079 20.1079
Columns 61 through 72
20.1079 20.1079 20.1079 20.1079 20.1079 20.1079 20.1079 20.1079 20.1079
20.1079 20.1079 20.1079
106
Columns 73 through 84
31.7978 31.7978 31.7978 31.7978 31.7978 31.7978 31.7978 31.7978 31.7978
31.7978 31.7978 31.7978
Columns 85 through 96
31.7978 31.7978 31.7978 31.7978 31.7978 31.7978 31.7978 31.7978 31.7978
31.7978 31.7978 31.7978
Columns 97 through 101
31.7978 31.7978 31.7978 31.7978 31.7978
w =
Columns 1 through 12
0.1667 0.0917 0.0681 0.0754 0.0740 0.0591 0.0502 0.0436 0.0289 0.0219
0.0182 0.0114
Columns 13 through 24
0.0115 0.0123 0.0129 0.0128 0.0126 0.0122 0.0118 0.0111 0.0101 0.0088
0.0084 0.0070
Columns 25 through 36
0.0060 0.0042 0.0053 0.0064 0.0067 0.0067 0.0067 0.0066 0.0064 0.0063
0.0059 0.0056
Columns 37 through 48
0.0055 0.0052 0.0049 0.0045 0.0041 0.0039 0.0037 0.0032 0.0030 0.0028
0.0024 0.0020
Columns 49 through 60
107
0.0016 0.0012 0.0017 0.0017 0.0018 0.0017 0.0017 0.0017 0.0016 0.0015
0.0014 0.0014
Columns 61 through 72
0.0013 0.0012 0.0011 0.0011 0.0009 0.0009 0.0008 0.0007 0.0007 0.0006
0.0005 0.0005
Columns 73 through 84
0.0003 0.0003 0.0004 0.0005 0.0005 0.0005 0.0005 0.0005 0.0005 0.0005
0.0005 0.0005
Columns 85 through 96
0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004
0.0003 0.0003
Columns 97 through 101
0.0003 0.0003 0.0003 0.0003 0.0002
LogLikelihood = -218.7495
108
Fig 50: The graph of for K=101. Eye Tracking example
Fig 51: The graph of the D function for K=101. Eye Tracking example
The results show that the number of support points of is 6 and they are listed in the K=6
results section.
109
Chapter 4. NPML Estimation for Stochastic Models
In the previous chapters we discussed estimation of an unknown distribution of the population
parameters. The models we looked at were deterministic, i.e. we only considered measurement
noise
, (4.1)
where are known measurements, which we consider to be independent but not
necessarily identically distributed. is known vector function, is a vector of unknown
parameters and is a measurement noise, which is considered to be normally distributed with
zero mean and some possibly unknown covariance matrix. The vectors , are
considered to be i.i.d with common but unknown distribution function F. The question is to
estimate F based on the given measurements .
However one of the drawbacks of this model is that is does not take into an account the
uncertainties in the model itself, i.e. process noise. For example, in case of the PK problem the
errors like
• dose errors
• sample timing errors
• model misspecifications
are not included in the deterministic models. The stochastic models on the other hand can
overcome this problem.
110
Discrete time Stochastic population model:
Consider the ith subject in a sample population. The discrete time stochastic system that
corresponds to it can be written as follows:
(4.2)
where at time
- state vector
- known input
- vector of PK/PD parameters for the ith subject
- known vector function that defines a structural model
is a process noise vector
The measurements at time are
(4.3)
where
is the measurement noise
- known vector function that defines a measurement model
Population Problem: Now let , be the vector of measurements for the ith subject
as above. Then as before
are independent but not necessarily identically distributed
random variables , for . Assume also that the conditional density of given
is known and denoted by , where the ’s are unobserved random parameters that are
111
independent and identically distributed with common but unknown distribution function . The
likelihood is
The population problem is as before: maximize relative to whatever class of
distributions is being considered.
Historical Background: In a number of papers, Mortensen et al considered the above maximum
likelihood problem for the case where is assumed to be multivariable normal with unknown
mean vector and unknown covariance matrix. They developed software programs for Matlab
and R [M07]. More recently, [DL11] under the same normal hypothesis, developed software
programs for MONOLIX. We are going to study this problem using NPML methods.
In 1991, Mark Welle wrote a MS Thesis, under the supervision of Alan Schumitzky, which
considered NPML for a simple one compartment model with process noise. We have
generalized this work and made it compatible with the NPEM and IP programs.
It turns out that for linear problems, defined below, the likelihood
can be
calculated exactly using the linear Kalman-Bucy filter. For nonlinear problems, this cannot be
done and the more complicated nonlinear filtering must be used [J70]. We will therefore focus
on the linear case.
Linear case: In case when both and are linear functions of and
respectively the stochastic system is called linear. Otherwise it is called nonlinear. In this case
the system (4.2) and (4.3) can be written the following way:
112
(4.5)
where , and are known matrices and given the vectors and are
mutually independent for all k.
The log-likelihood function in this case can be written as follows:
(4.6)
where
(4.7)
Then the log-likelihood function now has the following form:
(4.8)
Let .
113
Proposition 1: The density is normal with mean m and covariance matrix .
Proof: We will drop the i subscript for cleaner notation.
Then for we have where ,
and . , , and are deterministic
vectors. Thus is a sum of normal random variables and therefore is normally distributed
itself. Then by induction it can be shown that is also normally distributed. The mean m and
covariance matrix are calculated using Kalman-Bucy filer [J70].
(4.9)
where and are calculated as follows:
Linear Kalman-Bucy filter equations
Prediction step
(4.10)
(4.11)
Updating step
(4.12)
(4.13)
114
(4.14)
Continuous state discrete observations linear stochastic population model:
(4.15)
where
- state vector
- known input
- vector of PK/PD parameters
, and are known matrices
- is the white Gaussian process with the covariance matrix , i.e.
with positive semi-definite matrix
- measurement vector
is the measurement noise. We consider it to be Gaussian sequence with zero
mean and the covariance matrix .
Let on the interval . Then the first equation of (4.2) can be integrated over the
interval :
(4.15)
115
where is a fundamental matrix of the first equation of (4.2), that is
Also and . It is shown in Jazwinsky
[J70] that is a zero mean Gaussian sequence with the covariance matrix
(4.16)
Example 1:
PK one-compartment model is described by the linear stochastic differential equation for the
state vector and discrete linear equation for the observations at time ,
, (4.17)
, (4.18)
(4.19)
where
K – is an elimination rate
V – is the volume of distribution
D – is the initial dose of the drug
- is the white Gaussian process with the covariance matrix
116
- is a measurement error that is considered to be normally distributed with zero mean and
covariance matrix
Note: It the equation (4.14) we need to know the value of .
Solution: Before we can use the theory we described above we should integrate the stochastic
equation (4.17) over the intervals . We will use Ito formula with the function
. Then
If we integrate now from to we get
The bottom equation can be written as follows
(4.20)
where
(4.21)
is a white Gaussian sequence with zero mean and the covariance
117
Note: To integrate the last integral we need to know how W(t) depends on t.
If W(t) is a constant, say then
(4.22)
Therefore now we have a discrete system
(4.23)
The log-likelihood function that is defined by (4.7) can now be written as follows
where by proposition 1
118
The linear Kalman-Bucy filter equations are written as:
Prediction step
(4.24)
(4.25)
Updating step
(4.26)
(4.27)
(4.28)
Simulation
To illustrate the algorithm we described above we used the simulated data. The observations
were simulated according to (4.23) using the following parameters:
The volume of distribution is taken to be . The elimination rate constant is
simulated from the mixture of two normal distributions that is
,
where (we used ). The initial condition on the signal is where
and measurement error term covariance matrix is taken to be .
119
Program Run_NPEM_nonlinear_kalman.m overview:
• NPEM 1 with process noise
The program Run_NPEM_nonlinear_kalman.m uses the simulated data described above to find
the maximum likelihood estimate of the distribution of unknown parameters
using Kalman filtering to get and then it uses NPEM_nonlinear.m program from the
section: Nonlinear approach. Two-dimensional case. (Case 4) to find . We also modified
the way to choose the initial values of the parameters for the EM algorithm. We take
to be some random points on the interval (0.1, 0.7) and then the volume of distribution
.
We looked at some different situation when process noise in the simulation and in the program
was either included or exclude and looked at different values of it. The results are described in
the following table.
• Hybrid of the NPEM 1 and Interior Point algorithms (EM1+IP):
EM1+IP algorithm is a part of the Run_NPEM_nonlinear_kalman.m. The user just has to switch
between two optimization methods. This method is a combination of NPEM 1 and Interior Point
algorithms that can be described in the following way:
Step 1: Initialization of the algorithm. Initial values for the parameters and weights
are set at this point. Usually it is done in the following way is a set of at most N random
120
points taken form the interval where elimination rate is believed to belong. For our example the
interval is (0.1, 0.7). Then the volume of distribution . All the constants are
defined in this step too.
Step 2: The parameters are updated according to the EM algorithm, the same way as
it was done in the NPEM 1 algorithm.
Step 3: The weights are updated according to the Interior Point method. The way Interior Point
method works is described in section 2.
Step 4: The difference between log-likelihoods is checked. If the difference is small we found the
maximum likelihood estimate of the true distribution, otherwise we repeat from step 2.
Process
noise in
the data
simulation
Process noise
assumed in
the program
Simulated data Loglikelihood
Function
(NPEM 1/
EM1+IP)
(NPEM 1/
EM1+IP)
0 0
y = 3.0318 2.4652 1.4451 0.8318 0.5916
3.8141 2.5238 1.9795 1.2461 1.5058
4.1810 3.1880 2.4719 2.1553 1.1058
3.6645 1.9045 2.2693 1.0860 1.3571
2.6750 1.6603 0.4935 0.2337 0.6709
3.4643 2.6637 1.8797 1.4125 0.7581
-44.4991 0.0739
121
3.9512 2.4454 2.1788 1.6535 1.0127
2.2899 1.1181 0.3411 0.9731 0.4810
2.8839 1.6842 1.7039 1.8491 1.0013
4.8104 3.2157 2.3594 2.0940 1.1465
4.2350 3.3786 3.1700 2.1661 1.9050
3.4235 2.7739 1.9768 1.4916 1.5822
2.9950 2.1396 1.6545 1.3703 0.6233
4.4025 2.7943 1.9540 2.2768 1.7389
3.2249 2.5655 1.9551 1.3237 0.8217
3.6158 3.3738 2.0733 1.7889 1.0597
4.6233 3.2999 1.6755 1.4241 1.2517
4.1627 3.0668 2.4418 2.4528 2.1717
2.2047 1.3471 0.7285 0.2067 0.0544
2.4372 0.9972 0.6792 0.0700 0.5235
-44.4991 0.0739
-46.9579 0.0346 0.1 0
y = 2.9803 2.4120 1.4556 0.7496 0.5644
3.8929 2.4253 1.9122 1.0439 1.4493
4.1691 3.2325 2.5893 2.1454 1.0509
3.5969 1.9100 2.3438 1.1250 1.4337
2.6276 1.6523 0.3679 0.1338 0.5570
3.5057 2.6040 1.8024 1.4661 0.8684
3.9765 2.6862 2.3170 1.6160 0.7539
2.3401 1.1759 0.1561 1.0024 0.4985
2.9567 1.7214 1.8198 1.8021 1.0501
4.8857 3.1586 2.4195 2.2902 1.2349
4.1081 3.3691 3.0937 2.1834 1.9876
3.4272 2.6797 2.0425 1.4497 1.5872
3.0232 2.1404 1.6564 1.2716 0.5444
4.4719 2.7188 2.0330 2.2849 1.6383
3.3788 2.6207 2.0040 1.2972 0.8387
3.5074 3.1263 1.8006 1.6011 0.9607
4.4034 3.1084 1.4809 1.3076 1.2133
4.1283 3.0530 2.3440 2.3777 2.1803
2.3482 1.3492 0.6917 0.0717 -0.0270
2.4349 1.0532 0.8673 0.1959 0.6842
-46.9579 0.0346
122
-49.1069 0.0511 0.1 0.1
y = 2.9803 2.4120 1.4556 0.7496 0.5644
3.8929 2.4253 1.9122 1.0439 1.4493
4.1691 3.2325 2.5893 2.1454 1.0509
3.5969 1.9100 2.3438 1.1250 1.4337
2.6276 1.6523 0.3679 0.1338 0.5570
3.5057 2.6040 1.8024 1.4661 0.8684
3.9765 2.6862 2.3170 1.6160 0.7539
2.3401 1.1759 0.1561 1.0024 0.4985
2.9567 1.7214 1.8198 1.8021 1.0501
4.8857 3.1586 2.4195 2.2902 1.2349
4.1081 3.3691 3.0937 2.1834 1.9876
3.4272 2.6797 2.0425 1.4497 1.5872
3.0232 2.1404 1.6564 1.2716 0.5444
4.4719 2.7188 2.0330 2.2849 1.6383
3.3788 2.6207 2.0040 1.2972 0.8387
3.5074 3.1263 1.8006 1.6011 0.9607
4.4034 3.1084 1.4809 1.3076 1.2133
4.1283 3.0530 2.3440 2.3777 2.1803
2.3482 1.3492 0.6917 0.0717 -0.0270
2.4349 1.0532 0.8673 0.1959 0.6842
-49.1069 0.0511
-275.4227 -2.8045 1 0
y = 2.5166 1.9335 1.5503 0.0101 0.3196
4.6022 1.5391 1.3068 -0.7755 0.9402
4.0621 3.6332 3.6465 2.0563 0.5568
2.9889 1.9598 3.0141 1.4751 2.1225
2.2010 1.5806 -0.7630 -0.7650 -0.4680
3.8784 2.0659 1.1060 1.9480 1.8614
4.2043 4.8538 3.5603 1.2787 -1.5752
2.7922 1.6964 -1.5086 1.2665 0.6558
3.6124 2.0565 2.8629 1.3789 1.4892
5.5637 2.6447 2.9598 4.0559 2.0308
2.9656 3.2830 2.4071 2.3385 2.7310
3.4602 1.8317 2.6336 1.0724 1.6320
3.2771 2.1478 1.6741 0.3836 -0.1655
5.0964 2.0388 2.7440 2.3577 0.7330
4.7638 3.1178 2.4439 1.0595 0.9918
-274.8439 -2.8578
123
2.5313 0.8984 -0.6531 -0.0892 0.0700
2.4242 1.3850 -0.2713 0.2593 0.8681
3.8182 2.9295 1.4635 1.7018 2.2573
3.6394 1.3685 0.3606 -1.1433 -0.7595
2.4142 1.5574 2.5602 1.3291 2.1305
-141.6724 -0.4788 1 1
y = 2.5166 1.9335 1.5503 0.0101 0.3196
4.6022 1.5391 1.3068 -0.7755 0.9402
4.0621 3.6332 3.6465 2.0563 0.5568
2.9889 1.9598 3.0141 1.4751 2.1225
2.2010 1.5806 -0.7630 -0.7650 -0.4680
3.8784 2.0659 1.1060 1.9480 1.8614
4.2043 4.8538 3.5603 1.2787 -1.5752
2.7922 1.6964 -1.5086 1.2665 0.6558
3.6124 2.0565 2.8629 1.3789 1.4892
5.5637 2.6447 2.9598 4.0559 2.0308
2.9656 3.2830 2.4071 2.3385 2.7310
3.4602 1.8317 2.6336 1.0724 1.6320
3.2771 2.1478 1.6741 0.3836 -0.1655
5.0964 2.0388 2.7440 2.3577 0.7330
4.7638 3.1178 2.4439 1.0595 0.9918
2.5313 0.8984 -0.6531 -0.0892 0.0700
2.4242 1.3850 -0.2713 0.2593 0.8681
3.8182 2.9295 1.4635 1.7018 2.2573
3.6394 1.3685 0.3606 -1.1433 -0.7595
2.4142 1.5574 2.5602 1.3291 2.1305
-141.6647 -0.5361
Table 5: The table shows the results of two NPML methods NPEM 1 and EM1+IP that were
calculated using different levels of process noise. It also compares the results when the process
noise was included in the simulation but was present or not in the model.
124
(a) NPEM 1 with process noise
(a) EM1+IP with process noise
(b) NPEM 1 with process noise
125
(b) EM1+IP with process noise
(c) NPEM 1 with process noise
126
(c) EM1+IP with process noise
(d) NPEM 1 with process noise
127
(d) EM1+IP with process noise
(e) NPEM 1 with process noise
128
(e) EM1+IP with process noise
Fig 1: This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) in case when
(a) simulated and assumed noise are equal to 0
(b) simulated noise is equal to 0.1 and assumed noise is 0
(c) simulated and assumed noise are equal to 0.1
(d) simulated noise is equal to 1 and assumed noise is 0
(e) simulated and assumed noise are equal to 1
(a) NPEM 1 with process noise
129
(a) EM1+IP with process noise
(b) NPEM 1 with process noise
130
(b) EM1+IP with process noise
(c) NPEM 1 with process noise
131
(c) EM1+IP with process noise
(d) NPEM 1 with process noise
132
(d) EM1+IP with process noise
(e) NPEM 1 with process noise
133
(e) EM1+IP with process noise
Fig 2: This figure shows the graphs of (solid line) true distribution of the volume of distribution
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) in case when
(a) simulated and assumed noise are equal to 0
(b) simulated noise is equal to 0.1 and assumed noise is 0
(c) simulated and assumed noise are equal to 0.1
(d) simulated noise is equal to 1 and assumed noise is 0
(e) simulated and assumed noise are equal to 1
Conclusions:
The above results can be classified into three situations:
1. The data was generated with no process noise and the model did not assume any noise
2. The data was generated with process noise and
a) the model assumed no noise
b) the model assumed process noise
134
3. The data was generated with process noise and
a) the model assumed no noise
b) the model assumed process noise
The random variables that represent the PK parameters and are assumed to be independent
therefore the covariance of and . In the above simulations we see that if the data was
simulated with the process noise, then if we assume the process noise in the solution then the
covariance is closer to zero than if it was ignored.
We can also notice that the more noise we included the more disperse and chaotic the marginal
distributions of and got, when we neglected the process noise in the solution. Therefore
the estimates that are produced when the process noise is included in the solution are definitely
better.
135
Chapter 5. Nonparametric Bayesian (NPB) Estimation
Unlike in the NPML approach in the Nonparametric Bayesian (NPB) approach the distribution
is considered to be an unknown random distribution and therefore we need to place a prior
distribution on . A convenient prior in our case is the Dirichlet process prior. Let us now
define the Dirichlet Distribution and the Dirichlet Process.
Dirichlet distribution
Let be a finite space and let
and (5.1)
Note that is also called the dimensional simplex.
Then a random vector is said to have a Dirichlet distribution with parameter ,
, which is denoted by , if the density function is
(5.2)
for .
Notice that in case when the density of the Dirichlet distribution is the same as Beta
density with parameters . In case when one of , the Dirichlet distribution is
136
defined by setting the corresponding and interpreting the density as a density of the
lower-dimensional set.
Properties of the Dirichlet distribution
The properties below are important for motivating and in some case proving corresponding
properties of the more general Dirichlet Process. Proofs are given in [GR03]. In the proposed
thesis, the derivations will be included for completeness.
1. Let be independent Gamma random variables with parameters where
and , then ~ (5.3)
2. If and are integers such that then
~ (5.4)
3. If then
(5.5)
4. Let where and given , let be - valued i.i.d . Then
the posterior distribution of given is where is the point (delta)
measure such that .
5. Let and let be the st observation from . Then for some ,
137
the predictive probability of given is
(5.6)
Dirichlet Distribution through the Polya Urn Model
Let be the -algebra of all possible subsets of and let be a non-null measure on
such that for some measurable subset .
Consider a bag with balls of which are of color , (assuming
for all ). Draw balls at random from the urn, replacing each ball by two balls of the
same color. Then let if the th ball is of color . Then
(5.7)
Which is the same as property 5 of the Dirichlet distribution. Therefore Dirichlet distribution
follows the Polya Urn Model
138
Dirichlet Process
Let be a subset of and let be a -algebra generated by the Borel subsets of . Let
be a set of probability distributions on
Definition: Let be a non-null finite measure on , and
so that . Then the Dirichlet Process is a unique distribution over the space
with parameter , such that the relation
(5.8)
holds for any and any measurable partition of , i.e. for all
, for all and . The probability measure is
our prior guess of and the parameter is the “strength” of the guess. The existence of this
process is shown in Ferguson [F73].
Properties
The only properties of the Dirichlet process that are need in this thesis are as follows:
(D1)
(D3)
139
Note: In the above properties, is the point (delta) measure such that
. In Property (D4), the l.h.s is a probability density;
while the r.h.s. is a probability distribution (or measure). This “abuse of notation” is common
with papers in this research area. Property (D4) should be written as:
By context it will always be clear, which is density and which is measure.
Properties (D1)-(D3) are proved in [F73].
Proof (D4):
We will show the proof for the discrete space and the general result can be generalized from
there.
Let then the predictive probability of given is
since given , is independent of . Then
140
An important property of DP also follows from (D2) and (D4):
(5.9)
Sketch of the proof of Eq. (5.9):
This can be shown as follows, let then
Example 1:
Consider the simple model:
so that we observe directly the PK parameters . Now rewrite eq. (D2) as follows:
which is also the
nonparametric maximum likelihood estimate .
So (D2) can be written as:
141
i.e. a mixture of the prior estimate and the NPML estimate.
It follows that as we have , i.e. the expected posterior of
tends to the maximum likelihood estimate of (data eventually overcomes the prior).
Example 2: Let be a deterministic discrete distribution such that
Then the random distribution:
(5.10)
belongs to .
Note: is a deterministic distribution while is a random distribution because the are r.v.
The proof of this example follows from the properties of the Dirichlet distribution.
For some applications can be very large, for example when is used as non-informative
prior.
The property (D4) and Eq. (5.9) are very important in the calculation of the Dirichlet Processes.
It gives the way for sampling from the posterior distribution of given as follows:
(5.11)
w.p. means “with probability”
142
This last representation implies the “replication” property of the . It can be shown that the
number of distinct in is order of magnitude log(n), see [A73]. This ultimately
limits the computational complexity of the problem.
Equivalence Theorem: Not only does the Dirichlet process imply (D4) and so satisfies a Polya
Urn Model, but if any random sequence { } satisfies the Polya Urn Model, then with
for some random distribution , see [BM73].
Chinese Restaurant Process: Another interpretation of (D4) is given by the Chinese Restaurant
Process [A85, P96], describing a Chinese restaurant with an infinite number of infinitely large
round tables. The first customer will be seated at an empty table with probability 1. The next
customer has two choices: to join the first customer or to pick another empty table. When the (n
+ 1)st customer arrives, chooses the first unoccupied table with probability or an
occupied ith table with probability , where is the number of people sitting at that table
and is a scalar parameter of the process.
For convenience, in the rest of this section, and throughout Sections 5, 6 and 7, we then assume
“strength” parameter .
143
Bayesian Version of Statistical Problem: Eqs. (2.1)-(2.2) now become:
(5.12a)
(5.12b)
(5.12c)
The main property of the Dirichlet process that is needed for PK dosage regimen design is:
(5.13)
Consider the representation (5.13). Suppose we could sample from the posterior probability
. Call these samples , . Then it follows that the
posterior expectation of is approximately:
(5.14)
This is the main formula needed for PK control applications
For this thesis, to obtain samples from we will use Gibbs Sampler (GS) discussed in
the next section.
144
Chapter 6. Theoretical Foundations of the Gibbs Sampler
Again in this section, we take the “strength” parameter .
Gibbs Sampler [CS92]):
To obtain sample from the joint distribution proceed as follows:
If th sample is then
(6.1)
The Gibbs sampler generates a dependent Markov chain. It is shown in [CS92] that under
reasonable assumptions, the limiting distribution of the Gibbs samples is .
Gibbs Sampler (GS) for the model (5.12a)-(5.12c)
GS requires sampling from the full conditional distributions:
By Bayes’ law: . Now is calculated from Eq. (5.12a)
and is calculated from property (D4) by rearranging the order of .
Gibbs Sampler
Let be any point in .
145
For sample , , from
(6.2)
where now
(In other words, during a GS cycle, the most current updates are used)
Eq. (6.2) can be written as follows
(6.3)
where
(6.4)
Note that is the posterior density of and assumed prior .
Sampling from Eq. (6.3) is performed by:
(6.5)
Note: GS generates a dependent Markov chain whose limit is
. Implementing GS becomes an art. One usually throws away the first segment
of chain (“Burn In”) and then “Thins “ out rest of chain.
146
Convergence tests must then be used. One of the simplest and most effective tests for
convergence is the method of Brooks and Gelman [BG28]. In this method at least two chains are
run independently and statistical tests are made to see if the chains converge to the same
quantities. Actually visual evidence is usually sufficient.
147
Chapter 7. Computer Simulations for Gibbs Sampler
Linear model
(7.1)
(7.2)
(7.3)
(7.4)
We will need the following formula for the product of two normal densities.
Let be the normal density with mean vector and covariance matrix , evaluated at the
point :
where
Matrix Identity [LS72]
(7.5)
148
For scalar case:
(7.6)
1. Galaxy Data
To illustrate the Gibbs sampler (and NPML) and the replication property of DP, we describe
results for the “Galaxy Data” problem. This problem is a benchmark for practically all density
estimation programs, see [EW95]. The data consists of 82 galaxy velocities ranging from 9 units
to 35 units.
The histogram of the data is given below.
Frequency Histogram of Galaxy Data (velocities on x – axis)
149
The model for the Galaxy data is a scalar linear model. In terms of our linear model, Eqs. (7.1)-
(7.4), we have:
(7.7)
Back to Eq. (6.4)
(7.8)
To summarize
(7.9)
150
Results:
Below is the smoothed graph of that was obtained using Gibbs sampler that
generated samples for the following values of the parameters: = 20.8315
(sample mean of the data), = 4.5681 (sample standard deviation of the data). To produce this
graph we used the following argument.
Write . Then from Eq. (4.14)
We could approximate as normal distribution with mean and very small std .
(We used = 0.1). Write . Then
(7.10)
would be a sum of normals. Then using Matlab we graph the density of as a function of
.
Fig. 1: The graph of
Galaxy Data
For comparison we also display the graph
151
Fig. 2: The graph of (6 support points). Galaxy Data
2. Linearized PK model
Let us consider the PK model of Eq. (3.15). As before
(7.11)
Where and ’s are considered to be known and is an unknown parameter. For
simplicity, we consider the “linearized” version of the model, which can be derived as follows.
Take log on both sides of the Eq. (7.11), then
Notice that for small enough values of , therefore we have
and since and are known constants we can write
where (7.12)
If we use the notations as in (6.12a)-(6.12c) for , then
152
(7.13)
(7.14)
, (7.15)
where , , and are known
Then using Eq.(6.4) and (7.5) we have
(7.16)
Results:
Below is the histogram of the Gibbs sample that was obtained using the above algorithm that
generated samples for the following values of the parameters: standard
error term, , , = 0.3
(sample mean of the simulated values for the elimination rate constant ), =
0.0190 (sample variance of the simulated values of the elimination rate constant ). In a real
data problem (not simulated) we can find
and by first running NPML and then using sample
mean and covariance of .
153
Fig 3: The graph of the simulated data for the above values of parameters. PK example
Fig. 4: Frequency Histogram of Gibbs sample for 20 subjects (elimination rate constants on x –
axis). PK example
For comparison we also display the smoothed graph of and the graph of the true
distribution of (we know it exactly since we use simulated the data for this example in
particular and ) The value of the smoothing
standard deviation is 0.01.
154
Fig. 5: The graph of for (5 support points). PK example
Fig. 6: The graph of true distribution of . PK example
If we compare Fig. 4 with the true distribution of in Fig. 6 we can definitely see that NBP
estimate shows two bumps. They are even more obvious if instead of 20 subjects we use 100
subjects. We present the histogram of the Gibbs sample for 100 subjects as before.
Fig. 7: Frequency Histogram of Gibbs sample for 100 subjects (elimination rate constants on x-
axis). PK example
155
Chapter 8. Stick-Breaking Approach in NPB Estimation
Consider the following statistical model:
, (8.1)
(8.2)
where
is a vector of independent but not necessarily identically distributed measurements with
known density
is a vector of unknown parameters defined on a space in finite dimensional
Euclidean space ( )
is an unknown probability distribution of
In general will mean that or has distribution .
Assume that given are independent and identically distributed random vectors with
common (but unknown) probability distribution on .
Our main problem is to estimate based on the data
We will use the Nonparametric Bayesian (NPB) approach to estimate . In the NPB approach,
the distribution is itself a random variable. The first requirement is to place a prior
distribution on . A convenient way of doing this is with the so-called Dirichlet process (DP)
156
prior, see [F73, WW97, WW98]. A Dirichlet process prior on a random distribution is
defined by a probability distribution and a scale parameter , where is our prior guess
of and the parameter will be the strength of the guess. We write
for the
Dirichlet process.
Bayesian Version of Statistical Problem: Eqs. (8.1)-(8.2) now become:
(8.3a)
(8.3b)
(8.3c)
Now the population analysis problem is to estimate the full conditional distribution of given
the data .
Most methods to solve this problem employ a marginal approach: it turns out that can be
“integrated out” leaving a much simpler problem for the . The resulting model for the
is
given by the Polya Urn representation:
The marginal approach leads to a Gibbs sampler algorithm, described in section 5, for estimating
i.e., the expected value of given the data – not . This approach is used
by practically all researchers, and is sufficient for the problems they consider. However, it does
not solve the classical population analysis problem as stated above, i.e. to estimate the full
157
conditional distribution of
.
To solve this problem we propose the Full Conditional Method.
In this method we estimate we the full conditional distribution of given the data , not just
its expected value.
The Full Conditional method begins with the definition of the Dirichlet Process called the Stick-
Breaking representation, see [S94] and Ishwaran and James [IJ01]. Consider the random
distribution:
(8.4)
where the random vectors are independent and identically distributed (i.i.d) from the
known distribution and the weights are defined from the so-called stick-breaking
process as follows:
where
is the Beta distribution with parameters . The name “Stick Breaking”
comes from the fact that the are random cuts on a stick of length 1 and the are the lengths
of the pieces broken off. This gives an informal proof that the sum to 1. It is shown in [S94]
that the random distribution if and only if can be written in the form of Eq.(8.4).
Intuitively “Stick Breaking” construction works as follows. Suppose is represented as in
Eq.(8.4). Suppose there is a stick with length 1. Let , and regard
them as fractions for how much we take away from the remainder of the stick every time. Then
can be calculated by the length we take away each time
158
Fig 1: Illustration of Stick Breaking (taken from Zoubin Ghahramani’s Tutorial in 2005)
By this construction, we will have almost surely. In this way, we get an explicit
construction of
(8.5)
Below, we show how to use this Stick Breaking representation to estimate , not just
. The corresponding estimate of leads to the estimates of all moments and
their corresponding credibility intervals.
159
Truncated Stick-Breaking: Ishwaran and James [IJ01] consider replacing the infinite sum in
Eq. (8.4) by a truncated version:
(8.6)
where it is now required that so the series of weights sums to one. They show that the
truncated version is virtually indistinguishable from for sufficiently large . The only
problem now is how large does
have to be. Ishwaran and James [IJ01] have suggested that
is sufficient. We will show that this number can be dramatically reduced.
Full Conditional Approach: Let us assume now that we have a sufficient number of
components in Eq. (8.6) to accurately approximate the infinite sum in Eq. (8.5). Then using a
truncated version of the Stick Breaking definition of the Dirichlet process we can rewrite our
model in the following way
,
(8.7)
Eq. (8.7) defines a mixture model with an unknown but finite number of components. Much is
known about this subject, see [T06], [TBS08], and [FS10]. For a fixed number of components
160
, the posterior distribution of the weights and the support points can be determined
by the Blocked Gibbs Sampling [IJ01, p. Sec 5.2].
Blocked Gibbs Sampling:
The model in (8.7) can be written in the following way:
(8.8)
Since we have that , then we let where with probability .
Note: The set contains the values that are not distinct. For example, let and
= (1,2,4,3,2,3,4). This set of indices corresponds to the set
of values of ‘s. Then let be the set of distinct values of
, in case of the above example = (1,2,3,4). Note also that the set of distinct
values of is not the same as .
The blocked Gibbs sampler allows sampling from the full posterior distribution of
( ) by iteratively sampling from the conditional distributions of
161
Then each sample of ( ) will define a random probability measure
Which eventually gives the sample from the posterior distribution of .
Note: The posterior distribution of is not the same as the however if
goes to infinity converges to which is a sample from , see [IJ01].
Blocked Gibbs Sampling requires sampling from the full conditional distributions just like the
regular Gibbs Sampling. Let us obtain them.
The easiest way to calculate these densities is to consider joint density
Use Bayes’ law repeatedly:
Now if .
Where is a generalized Dirichlet distribution that can be defined as
follows.
162
Definition: The generalized Dirichlet distribution (GD) is a generalization of the Dirichlet
distribution with a more general covariance structure and almost twice the number of parameters.
We say that the vector has a generalized Dirichlet distribution (GD) with parameters
and if the density function of can be written as:
where and is a gamma function.
By independence assumptions
Now let be any of the variables and let be the remaining
variables, i.e. . Then
. Now we can write all the formulas we need.
1. Sample , the following way:
(8.10)
163
Therefore is either sampled from if and or otherwise from
, .
2. Sample , from:
Therefore
(8.11)
3. Sample , from:
which is a generalized Dirichlet distribution
where .
The way you sample from this distribution is
(8.12)
164
Consequently, for fixed , we will get samples of the support points
and the weights by following these
steps:
1. Initiate:
Let and generate
2. Draw the ’s sample as follows:
Step 1: Pick , distinct values of .
For
,
Step 2:
Step 3:
Note: As opposed to Gibbs Sampler for the Marginal Method, here the Gibbs Sampler directly
involves the distribution .
Sampling the posterior of
,
and moments of .
165
Let be samples of from Gibbs Sampler after the sampler
has “converged”. Then we have:
Samples from :
Samples from the moments of : Let be the j
th
moment of .
Then samples of are given by:
In particular, samples from the first moment or mean are given by
A histogram of the values of gives the estimated distribution of . Bayesian
credibility intervals are derived from this.
Samples from the expected value
. For this quantity we calculate:
Note: To check our algorithm, we can compare our estimate of with the estimate
from the marginalization method.
Galaxy Data
To illustrate the Gibbs sampler (and NPML) and the replication property of DP, we describe
results for the “Galaxy Data” problem. This problem is a benchmark for practically all density
estimation programs see [EW95]. The data consists of 82 galaxy velocities ranging from 9 units
to 35 units.
The histogram of the data is given below.
166
Frequency Histogram of Galaxy Data (velocities on x – axis)
The model for the Galaxy data is a scalar linear model. The equations (8.8) can now be written
as:
(8.13)
Blocked GS fro the Galaxy data. Fix some K and repeat the following steps L times, we also did
a “burn in” of B first samples.
1.
Initiate: Let and generate
2. Draw the ’s sample as follows:
Step 1: Pick , distinct values of .
167
where = sample mean of the data, = sample variance of data
for
where , ,
and
Step 2:
Step 3:
Results: Below is the smoothed graph of where that was obtained using
Blocked Gibbs sampler that generated
samples burned for the following
168
values of the parameters: = 20.8315 (sample mean of the data),
(sample variance of data). To produce this graph we used the following argument.
Write .
We could approximate as normal distribution with mean and very small std .
(We used = 0.1). Write . Then
(8.14)
would be a sum of normals. Then using Matlab (see program in the Appendix) we graph the
density of as a function of .
The following table shows the comparison of three different methods for estimating the unknown
probability distribution .
Methods
Quantity of interest NPML Marginal GS Blocked GS
20.8315 20.8437 20.7874 CI [20.7774 20.7975]
4.4536 4.3983 4.2160 CI [4.2057 4.2262]
Table 6: This table shows the comparison of three different methods for estimating the unknown
probability distribution for the Galaxy Data example.
To obtain the expected value of in case of NPML estimate we used the following formula
169
, in case of marginal GS and in case of blocked GS
estimate we calculated the expected value of the mean distribution
which is .
To obtain the standard deviation of we used the following formulas. In case of NPML
estimate , for the marginal GS we used
where and
.
To calculate the standard deviation of in case of blocked GS estimate we used
where and
.
Notice that only in case of blocked GS we are able to calculate confidence intervals for both
quantities of interest and . We did it the following way. First we ran the blocked GS
algorithm for J = 10,000 times for each of those runs we used 10,000 iterations and considered
only the last sample of ’s and ‘s which forms the distribution .
For each of those distributions we calculated the expected value of , call it and the
170
standard deviation of and call it . Therefore now we have two samples of J
expectations and J standard deviations. If we assume that these samples are normally distributed,
then the confidence intervals of the true means can be calculated as follows. The confidence
interval of the true mean of is where ,
and is a 97.5
th
quartile of the t-distribution with J-1
degrees of freedom. The confidence interval of the true mean of sample of standard deviations
is calculated the same way as for just replacing with .
171
Chapter 9. Maximum Likelihood of Stick-Breaking (Connection between
NPML and NPB)
Let F
K
(t)
= w
k
(t)
δ
(t)
φ
k
k=1
K
∑ be the distribution at the t th iteration of the blocked Gibbs Sampler (GS)
algorithm.
1) The likelihood of F
K
(t)
is
L(F
K
(t)
)= p(Y
1
,...,Y
N
|F
K
(t)
)= p(Y
i
|F
K
(t)
)
i=1
N
∏
= p(Y
i
|θ)dF
K
(t)
(θ)
∫
i=1
N
∏
= w
k
(t)
p(Y
i
|φ
k
)
k=1
K
∑
i=1
N
∏
where the last equality is true since F
K
(t)
is a finite sum of weights and delta functions.
Then the log-likelihood can be written as follows
l(F
K
(t)
)= log(L(F
K
(t)
))= log( [ w
k
(t)
p(Y
i
|φ
k
)
k=1
K
∑
i=1
N
∏
])
= log[ w
k
(t)
p(Y
i
|φ
k
)
k=1
K
∑
]
i=1
N
∑
(9.1)
2) If we let l
max
(F
K
(t)
)= max
t
(l(F
K
(t)
)) then we can compare l
max
(F
K
(t)
) to the value of the log-
likelihood function of F
ML
the maximum likelihood estimate of the unknown distribution
F of θ that we found using nonparametric maximum likelihood (NPML) approach.
Since F
ML
maximizes the likelihood function, the following inequality is always true
l
max
(F
K
(t)
)≤l(F
ML
) (9.2)
Note: The equality is possible with a big enough number of iterations of the blocked GS.
172
We decided to check if this inequality is satisfied for the results of the full nonparametric
Bayesian (Full NPB) approach. First we present the results for the Galaxy data example.
Galaxy data
Model:
,
,
We ran the blocked GS for 10000 times for the following values of the parameters ,
sample mean and sample variance. The maximum value of the
log-likelihood that we got is and the value of the likelihood function of
that we found before is . We can see that the inequality (9.2) is true in
case of Galaxy data example and more over the values are very close.
In the two figures below we present the graphs of and .
Fig 1: The graph of , there are 6 support points. Galaxy data
173
Fig 2: The graph of , there are 20 support points in total but only 7 of them have significant
weight. Galaxy data
The distributions are very similar. There are two reasons why has 7 support points and
has 6 support points. First, we need to have more blocked GS iterations. Second, the
solution of the NPML problem is not unique, but the value of the log likelihood function
is. Therefore even with the big enough number of blocked GS iterations the
distributions can still look somewhat different.
174
Bibliography
[A85] Aldous D.J. Exchangeability and related topics. Volume 1117 of In Ecole d’ Ete de
Probabilites de Saint-Flour XII (Edited by P. L. Hennequin) Springer Lecture Notes in
Mathematics. 1985.
[B06] Baek, Y. (2006). An Interior Point Approach to Constrained Nonparametric Mixture
Models. PhD Dissertation, Mathematics Department, University of Washington.
[B12] Bell, Bradley (2012) http://moby.ihme.washington.edu/bradbell/non_par/non_par.xml
[BG98] Brooks S and Gelman A. General methods for monitoring convergence of iterative
simulations. Journal of Computational and Graphical Statistics 7: 434–455, 1998.
[BM73] Blackwell D and MacQueen JB. Ferguson Distributions Via Polya Urn
Schemes. The Annals of Statistics, Vol. 12, No. 1 (1984), pp. 351-357
[BV04] Boyd, Stephen; Vandenberghe, Lieven (2004). Convex Optimization. Cambridge:
Cambridge University Press.
[CG92] Casella G and George EI. Explaining the Gibbs Sampler. The American
Statistician, 46 (1992), pp. 167-174.
[DK84] Katz, D., and D.Z. D'Argenio. Discrete approximation of multivariate densities with
application to Bayesian estimation. Computational Stat. Data Anal. 2:27-36, 1984.
[DL11] Delattre, M.; Lavielle, M. PAGE 20 (2011) Abstr 2054 [www.page-
meeting.org/?abstract=2054]
[E94] Escobar MD. Estimating Normal Means with a Dirichlet Process Prior. J. of the American
Statistical Association. Vol. 89: 268-277, 1994
[EW95] Escobar MD, West M. Bayesian Density Estimation and Inference Using Mixtures.
Journal of the American Statistical Association 1995 Jun; 90(430):577-588.
175
[F73] Ferguson T. A Bayesian Analysis of Some Nonparametric Problems. Ann.
Statist. Vol. 1, (1973), 209-230.
[G89] Geweke J. Bayesian Inference in Econometric Models Using Monte Carlo Integration
Econometrica. 57: 1317-1339, 1989.
[GR03] Ghosh JK and Ramamoorthi RV. "Bayesian nonparametrics." Springer-
Velag, New York, 2003.
[HO05] Hjort LD and Ongaro A. Exact Inference for Random Dirichlet Means Statistical
Inference for Stochastic Processes. 8:227–254, 2005.
[IJ01] Ishwaran J and James LF. Gibbs Sampling Methods for Stick-Breaking Priors J. of the
American Statistical Association. 96: 161-173, 2001.
[IT02] Ishwaran H, Takahara G. Independent and Identically Distributed Monte Carlo
Algorithms for Semi-parametric Linear Mixed Models, Journal of the American Statistical
Association, Vol. 97, No. 460 Dec., 2002, pp. 1154- 1166
[J70] Jazwinski, Andrew H. (1970), Stochastic processes and filtering theory, Academic Press,
New York.
[KLW94] Kong A, Liu JS, Wong WH. Sequential Imputations and Bayesian Missing Data
Problems. Journal of the American Statistical Association 1994 Mar; 89(425): 278-288.
[KT51] Kuhn, H. W.; Tucker, A. W. (1951). Nonlinear programming. Proceedings of 2nd
Berkeley Symposium. Berkeley: University of California Press. pp. 481–492.
[L01] Leary, R., Jelliffe, R., Schumitzky, A., & Guilder, M. V. (2001). An adaptive grid non-
parametric approach to pharmacokinetic and dynamic (PK/PD) population models. Computer-
Based Medical Systems, IEEE Symposium.
[L83] Lindsay BG. The Geometry of Mixture Likelihoods: A General Theory. Annals of
176
Statistics 1983; 11:86-94.
[Liu 96] Liu JS. Nonparametric Hierarchical Bayes Via Sequential Imputations. The Annals of
Statistics 1996 Jun; 24(3):911-930.
[LJSG01] Leary R, Jelliffe R, Schumitzky A, Van Guilder M. An adaptive grid non-parametric
approach to population pharmacokinetic/dynamic (PK/PD) population models. Proceedings, 14
th
IEEE symposium on Computer Based Medical Systems 2001, Jul, 389-394.
[Lo84] Lo AY. On a class of Bayesian nonparametric estimates: I. Density Estimates. Annals of
Statistics 1984; 12: 351-357.
[M07] S. Mortensen, S. Klim, B. Dammann, N. Kristensen, H. Madsen, R. Overgaard "A
Matlab framework for estimation of NLME models using stochastic differential equations",
Journal of Pharmacokinetics and Pharmacodynamics vol:34, pages: 623-642, 2007.
[M86] Mallet A. A maximum likelihood estimation method for random coefficient regression
models. Biometrika 1986; 73:645-656.
[NDYSJ 12] Neely, M, van Guilder, M, Yamada, W, Schumitzky, A & Jelliffe, R (2012), Pmetrics: a non-
parametric and parametric pharmacometric package for R. Theraputic drug monitoring. 34(4), 467–476.
[P96] Pitman J. Statistics, Probability and Game Theory. (Chapter: Some developments of the
Blackwell-Mac Queen urn scheme, volume 30 of IMS Lecture Notes-Monograph series.) 1996
(pp. 245-267)
[PD04] Park K and D’Argenio DZ. Control of Uncertain Pharmacodynamic Systems. In:
Advanced Methods of Pharmacokinetic and Pharmacodynamic Systems Analysis – III. Edited by
DZ D’Argenio. New York, Plenum Press, 2004; 275-293.
[Q98] Fernando A. Quintana, Nonparametric Bayesian Analysis for Assessing Homogeneity in
k x I Contingency Tables with Fixed Right Margin Totals, Journal of the American Statistical
177
Association, Vol. 93, No. 443 Sep., 1998, pp. 1140- 1149
[S91] Schumitzky A. Nonparametric EM Algorithms for Estimating Prior Distributions. Applied
Math and Computation 1991; 45:141-157.
[T13] Tatiana Tatarinova, Michael Neely, Jay Bartroff, Michael van Guilder, Walter Yamada,
David Bayard, Roger Jelliffe, Robert Leary, Alyona Chubatiuk, Alan Schumitzky, (2013) Two
general methods for population pharmacokinetic modeling: non-parametric adaptive grid and
non-parametric Bayesian. Journal of Pharmacokinetics and Pharmacodynamics, Volume 40,
Issue 2, pp 189-199
[W91] Welle M, (1991). Implementation of a Nonparametric Algorithm for Estimating Prior
Distributions. MS Thesis, Department of Mathematics. University of Southern California.
[W96] Wakefield JC. Bayesian individualization via sampling based methods. Journal of
Pharmacokinetics and Biopharmaceutics, 24: 103-131, 1996.
[WSRG94] Wakefield J, Smith A, Racine-Poon A, Gelfand AE. Bayesian Analysis of Linear
and Nonlinear Population Models. Applied Statistics. 43: 201-222, 1994.
[WW97] Wakefield J, Walker S. Bayesian nonparametric population models: formulation and
comparison with likelihood approaches. J Pharmacokinet. Biopharm. 1997 Apr; 25(2):235-253.
[WW98] Wakefield J, Walker S. Population models with a nonparametric random coefficient
distribution. Sankhya Series B 1998;60:196-212.
178
Appendix
Appendix 1. NPML PK Code –Matlab
% Matlab program for NPML general non-linear PK problem
%Main program:
K = input('number of support points = ');
dim = input ('dimension of theta = ');
type = input('for unimodal model input "1", for bimodal model
input "2" ');
format compact
[y, t, sigma, theta_0, w_0, eps] = initial_data(K, type, dim);
y
% theta_0
% w_0
[count, theta, w, LogLikelihood, PYL] = NPEM_nonlinear(y, t,
w_0, theta_0, eps, sigma);
count
theta
w
LogLikelihood
if dim == 1
Dfun = @(theta_parameter)D(theta_parameter, y, t, sigma,
PYL);
NDfun = @(theta_parameter)(-1)*Dfun(theta_parameter);
x = fminbnd(NDfun,0.2,0.8);
D_max_val = Dfun(x)
179
fplot(Dfun,[0,1])
grid on
plot_as_pulse(theta,w,2,1,'b')
end
if dim == 2
Dfun = @(theta_parameter)D(theta_parameter, y, t, sigma,
PYL);
K = (0.1:0.01:1);
V = (10:0.1:50);
for i = 1:length(K)
for j = 1:length(V)
Z(i,j) = Dfun([K(i),V(j)]);
end
end
D_max_val = max(max(Z))
surf(K, V, Z')
end
function [y, t, sigma, theta_0, w_0, eps] = initial_data(K,
type, dim)
rand('seed', 0)
theta_0 = zeros(1,K);
[t,y] = simulation(type, dim);
sigma = 0.3;
a = 0.1;
b = 0.8;
c = 14;
d = 26;
for l = 1:K
180
if dim == 1
theta_0(l) = a + (l-rand)*((b - a)/K);
end
if dim == 2
theta_0(1,l) = a + (l-rand)*((b - a)/K);
theta_0(2,l) = c + (l-rand)*((d - c)/K);
end
end
w_0 =(1/K) * ones(1,K);
eps =10^(-8);
function [t,y] = simulation(type, dim)
N = 20;
J = 5;
t = zeros(N,J);
k = zeros(1,N);
sigma = 0.3;
rand('seed', 0)
randn('seed', 0)
for i = 1:N
t(i,:) = [1.5 2 3 4 5.5];
end
% Model (1) where K is generated as a single normal distribution
if type == 1
mu = 0.5;
sigma_k = 0.06;
k = randn(1,N) * sigma_k + mu;
end
% Model (2) where K is generated as a mixture of two normal
distributions
181
if type == 2
mu = [0.3 0.6];
sigma_k = [0.06 0.06];
p = 0.8;
for i = 1:N
w = rand;
if w <= p
k(i) = randn * sigma_k(1) + mu(1);
else
k(i) = randn * sigma_k(2) + mu(2);
end
end
end
% main loop, simulation of y
if dim == 1
for i = 1:N
for j = 1:J
y(i,j) =5*exp((-1)*k(i)* t(i,j))*(1 + randn *
sigma);
end
end
end
if dim == 2
for i = 1:N
V(i) = randn * 2 + 20;
for j = 1:J
y(i,j) = (100/V(i)) *exp((-1)*k(i)* t(i,j))*(1 +
randn * sigma);
end
end
end
182
function [count, theta, w, LogLikelihood, PYL] =
NPEM_nonlinear(y, t, w_0, theta_0, eps, sigma)
%---matrix: y(NxJ), t(NxJ), w_0(1xK), theta_0(dim x K)
%---scalar: eps,sigma,linar
[N,J] = size(y);
[dim,K] = size(theta_0);
old_theta = theta_0;
old_w = w_0;
new_theta = zeros(dim,K);
count = 0;
% initial likelihood calculation--------------------------
PYL = p_y_l(y, t, old_theta, old_w, sigma); %% PYL =
p(Yi l lamda) is a vector (1xN)
new_likelihood = L(PYL);
likelihood = new_likelihood - 2 * eps;
% main loop-----------------------------------------------
while (abs(new_likelihood - likelihood) >= eps)&&(count <= 1000)
PTYL = p_t_y_l(y, t, old_theta, old_w, PYL, sigma); %% PTYL
= p(theta(l) l Yi, lamda) is a matrix (NxK)
% theta_prime calculation------------------------------
if dim == 1
for l = 1:K
fun = @(theta)Q(l, y, t, theta, sigma, PTYL);
new_theta(l) = fminbnd(fun,0.1,0.8);
183
end
end
if dim == 2
for l = 1:K
fun = @(theta)Q(l, y, t, theta, sigma, PTYL);
new_theta(:,l) = fminsearch(fun,[0.2,14]);
end
end
% w_prime calculation------------------------------------
new_w = (1/N) * sum(PTYL,1);
% making new parameters old before it goes to the next loop
old_theta = new_theta;
old_w = new_w;
PYL = p_y_l(y, t, old_theta, old_w, sigma);
likelihood = new_likelihood;
new_likelihood = L(PYL);
count = count+1;
end
LogLikelihood = new_likelihood;
theta = old_theta;
w = old_w;
function p = p_y_l(y, t, theta_0, w_0, sigma)
N = length(y(:,1));
K = length(theta_0(1,:));
184
f_lamda = zeros(N,K);
for i = 1:N
for l = 1:K
f_lamda(i,l) = w_0(l) * prob(y(i,:), t(i,:), theta_0(:,l),
sigma);
end
end
p_y_lamda = sum(f_lamda,2);
p = p_y_lamda;
function p = prob(y, t, theta, sigma)
J = length(y);
dim = length(theta);
z = zeros(1,J);
for j = 1:J
if dim == 1
m =5*exp((-1)*theta*t(j));
z(j) = normpdf(y(j),m,m*sigma) ;
end
if dim == 2
m = (100/theta(2))*exp((-1)*theta(1)*t(j));
z(j) = normpdf(y(j),m,m*sigma) ;
end
end
p = prod(z);
function p = p_t_y_l(y, t, theta_0, w_0, p_y_lamda, sigma)
N = length(y(:,1));
K = length(theta_0(1,:));
185
F = zeros(N,K);
for i = 1:N
if p_y_lamda(i)==0
F(i,:) = (1/K)*ones(1,K);
else
for l = 1:K
F(i,l) = w_0(l) * prob(y(i,:), t(i,:),
theta_0(:,l), sigma) / p_y_lamda(i);
end
end
end
p = F;
function f = Q(l, y, t, theta, sigma, p_theta_y_lamda)
N = length(y(:,1));
f_y = zeros(1,N);
for i = 1:N
z = prob(y(i,:), t(i,:), theta, sigma);
if z == 0 %% it is
done to avoid log of zero
f = Inf;
else
f_y(i) = p_theta_y_lamda(i,l) * log(prob(y(i,:),
t(i,:), theta, sigma));
end
end
186
f = -sum(f_y);
function L = L(p_y_l)
l_p = log(p_y_l);
L = sum(l_p);
function D = D(theta_parameter, y, t, sigma, PYL)
N = length(y);
D_components = zeros(1,N);
for i=1:N
D_components(i) = prob(y(i,:), t(i,:), theta_parameter,
sigma)/PYL(i);
end
D = sum(D_components) - N;
end
To plot the 3D spikes graph we used the following code
clear
theta = [0.2779 19.3815; 0.2779 19.3815; 0.2779 19.3815; 0.2779
19.3815; 0.2779 19.3815; 0.2779 19.3815; 0.3051 19.0041;
0.3456 23.0530; 0.3456 23.0530; 0.3456 23.0530;
0.4384 26.1566; 0.4384 26.1566; 0.5830 20.6542;
0.5830 20.6542; 0.6658 23.7547; 0.6658 23.7547;
0.6658 23.7547; 0.6658 23.7547; 0.6658 23.7547;
0.6658 23.7547];
187
w = [0.0000, 0.0001, 0.0008, 0.0105, 0.1002,
0.1841, 0.1470, 0.0960, 0.1412, 0.1256, 0.0230,
0.0295, 0.0165, 0.0265, 0.0075, 0.0403, 0.0449,
0.0060, 0.0004, 0.0000];
r=[18:0.1:27];
t=[0.2:0.01:0.7];
Z=0;
F =@(q)MLF(q,theta',w,0.003);
for i = 1:length(t)
for j = 1:length(r)
Z(i,j) = F([t(i),r(j)]);
end
end
surf(t,r,Z');
function F = MLF(t,theta,w,sigma)
[M,N] = size(theta);
F = 0;
if M==1
for i = 1:N
F = F + (w(i)* normpdf(t, theta(i),
sigma))/normpdf(theta(i), theta(i), sigma);
end
end
188
if M==2
for i = 1:N
F = F + (w(i)* normpdf(t(1), theta(1,i),
sigma)*normpdf(t(2), theta(2,i),
(sigma+0.09)))*(2*pi*sigma*(sigma+0.09));
end
end
To plot the true distribution of
trueF =
@(theta)((0.8*normpdf(theta(1),0.3,0.06)+0.2*normpdf(theta(1),0.
6,0.06))*normpdf(theta(2),20,2));
K = (0.2:0.02:0.7);
V = (18:0.5:27);
Z=zeros(length(K),length(V));
for i = 1:length(K)
for j = 1:length(V)
Z(i,j) = trueF([K(i),V(j)]);
end
end
surf(K, V, Z')
Appendix 2. Marginal NPB Galaxy Code –Matlab
% The Matlab program for NPB Galaxy Data
189
% Data
y = [ 9.1720 9.3500 9.4830 9.5580 9.7750 10.2270 10.4060 16.0840 16.1700 18.4190 18.5520
18.6000 18.9270 19.0520 19.0700 19.3300 19.3430 19.3490 19.4400 19.4730 19.5290 19.5410
19.5470 19.6630 19.8460 19.8560 19.8630 19.9140 19.9180 19.9730 19.9890 20.1660 20.1750
20.1790 20.1960 20.2150 20.2210 20.4150 20.6290 20.7950 20.8210 20.8460 20.8750 20.9860
21.1370 21.4920 21.7010 21.8140 21.9210 21.9600 22.1850 22.2090 22.2420 22.2490 22.3140
22.3740 22.4950 22.7460 22.7470 22.8880 22.9140 23.2060 23.2410 23.2630 23.4840 23.5380
23.5420 23.6660 23.7060 23.7110 24.1290 24.2850 24.2890 24.3660 24.7170 24.9900 25.6330
26.9600 26.9950 32.0650 32.7890 34.2790];
M = 10000;
burn = 5000;
sigma1 = 1;
gtheta = gsample(M,burn,y,sigma1);
mu = mean(y);
sigma0 = std(y);
sigma = 0.1;
F = @(theta)ExpF(theta,gtheta,mu,sigma0,sigma);
theta = 5:0.1:35;
K = length(theta);
for k = 1:K
k
X(k) = F(theta(k));
end
190
plot(theta,X)
function w = weights(i,y,theta,sigma1)
n= length(y);
mu = mean(y);
sigma0 = std(y);
unw = zeros(1,n);
for j=1:n
if j==i
q0 = normpdf(mu, y(i), sqrt(sigma0^(2)+sigma1^(2)));
unw(j) = q0;
end
unw(j) = normpdf(y(i),theta(j),sigma1);
end
b = sum(unw);
w = (1/b)* unw;
______________________________________________________________________________
function theta = rulet(i,x,w,mu,sigma)
% rand('twister', 0)
%randn('state', 0)
n = length(w);
u = rand;
y = 0;
s = 0;
191
for j = 1:n
if j == i
if (s < u)&&(u <= (s+w(i)))
y = sigma*randn+mu;
end
s = s+w(i);
else
if (s < u)&&(u <= (s+w(j)))
y = x(j);
end
s = s+w(j);
end
end
theta = y;
_____________________________________________________________________________
function gtheta = gsample(M,burn,y,sigma1)
n = length(y);
G = zeros((M-burn),n);
mu = mean(y);
sigma0 = std(y);
sigmastar = sqrt(sigma1^(2) * sigma0^(2))/(sigma1^(2) + sigma0^(2));
k = -burn;
G(1,:) = y;
x = y;
192
for m = 1:M
for i = 1:n
w = weights(i,y,x,sigma1);
mustar = (mu*sigma1^(2) + y(i)*sigma0^(2))/(sigma1^(2) + sigma0^(2));
x(i) = rulet(i,x,w,mustar,sigmastar);
end
k = k+1;
if k > 0
G(k,:) = x;
end
end
gtheta = G;
function EF = ExpF(theta,GS,mu,sigma0,sigma)
[M,N] = size(GS);
EF = (1/(N+1))* normpdf(theta,mu,sigma0);
193
for m = 1:M
for i = 1:N
EF = EF + (1/M)*(1/(N+1))* normpdf(theta, GS(m,i), sigma);
end
end
Appendix 3. Full NPB Galaxy code – Matlab
y = [ 9.1720 9.3500 9.4830 9.5580 9.7750 10.2270 10.4060 16.0840 16.1700
18.4190 18.5520 18.6000 18.9270 19.0520 19.0700 19.3300 19.3430 19.3490
19.4400 19.4730 19.5290 19.5410 19.5470 19.6630 19.8460 19.8560 19.8630
19.9140 19.9180 19.9730 19.9890 20.1660 20.1750 20.1790 20.1960 20.2150
20.2210 20.4150 20.6290 20.7950 20.8210 20.8460 20.8750 20.9860 21.1370
21.4920 21.7010 21.8140 21.9210 21.9600 22.1850 22.2090 22.2420 22.2490
22.3140 22.3740 22.4950 22.7460 22.7470 22.8880 22.9140 23.2060 23.2410
23.2630 23.4840 23.5380 23.5420 23.6660 23.7060 23.7110 24.1290 24.2850
24.2890 24.3660 24.7170 24.9900 25.6330 26.9600 26.9950 32.0650 32.7890
34.2790];
L = 10000;
burn = 1000;
K = 20;
alfa_0 = 1;
mu_G0 = mean(y);
sigma_G0 = std(y);
sigma_prob = 1;
[w,fi,S] = BlockedGS(y,alfa_0,K,mu_G0,sigma_G0,sigma_prob,L,burn);
I =length(w(:,1));
mu = zeros(1,I);
sigma = zeros(1,I);
for i = 1:I
mu(i) = w(i,:)*(fi(i,:))';
dif = fi(i,:)-mu(i);
wsqdif = zeros(1,K);
for j = 1:K
wsqdif(j) = w(i,j)*dif(j)^2;
end
sigma(i) = sqrt(sum(wsqdif));
end
avrmu = mean(mu);
stdmu = std(mu);
avrsigma = mean(sigma);
stdsigma = std(sigma);
194
Std = sqrt((1/I)*(sigma*sigma'+mu*mu')-avrmu^2);
ConfIntmu = [avrmu - 1.96*stdmu/sqrt(I) avrmu + 1.96*stdmu/sqrt(I)];
ConfIntsigma = [avrsigma - 1.96*stdsigma/sqrt(I) avrsigma +
1.96*stdsigma/sqrt(I)];
N = length(S(:,1));
C = zeros(1,N);
for i = 1:N
C(i) = length(unique(S(i,:)));
end
avrC = mean(C);
figure(3);
hist(C)
function [w,fi,S] = BlockedGS(y,alfa_0,K,mu_G0,sigma_G0,sigma_prob,L,burn)
w_0 = (1/K)*ones(1,K);
N =length(y);
fi = zeros((L-burn),K);
S = zeros((L-burn),N);
w = zeros((L-burn),K);
k = -burn;
for i = 1:N
S_new(i) = gets(w_0);
end
fi_new = zeros(1,K);
w_new = w_0;
for l = 1:L
S_old = S_new;
w_old = w_new;
fi_new = zeros(1,K);
S_new = zeros(1,N);
w_new = zeros(1,K);
fi_new = getfi_normal(y, mu_G0,sigma_G0, sigma_prob, S_old, K);
S_new = step2(y,fi_new,w_old,sigma_prob);
w_new = getw(alfa_0, S_new,K);
k = k+1;
if k > 0
fi(k,:) = fi_new;
S(k,:) = S_new;
195
w(k,:) = w_new;
end
end
fi_h = fi(1,:);
S_h = S(1,:);
w_h = w(1,:);
if k>1
for k = 2:K
fi_h = [fi_h,fi(k,:)];
S_h = [S_h,S(k,:)];
w_h = [w_h,w(k,:)];
end
end
figure(1)
clf
hist(S_h)
sigma = 0.2;
figure(2)
clf
F = @(t)(1/(L-burn))*MLF(t,fi_h,w_h,sigma);
fplot(F,[min(fi_h)-3*sigma, max(fi_h)+3*sigma])
end
%----------------------------------------------------------------------------
---
function F = MLF(t,theta,w,sigma)
N = length(theta);
F = 0;
for i = 1:N
F = F + (w(i)* normpdf(t, theta(i), sigma))/normpdf(theta(i), theta(i),
sigma);
end
end
function S = gets(w)
K = length(w);
196
u = rand;
y = 0;
s = 0;
for k = 1:K
if (s < u)&&(u <= (s+w(k)))
y = k;
end
s = s+w(k);
end
S = y;
function fi = getfi_normal(y, mu_G0,sigma_G0, sigma_prob, S, K)
N = length(S);
fi = zeros(1,K);
for k = 1:K
ind{k} = {};
end
for i = 1:N
ind{S(i)}{end+1} = i;
end
for k =1:K
mu = 0;
M = 0;
if isempty(ind{k})
fi(k) = randn*sigma_G0 + mu_G0;
else
M = length(ind{k});
for i = ind{k}
mu = mu+y(i{1});
end
fi(k) = randn*sqrt(sigma_G0*sigma_prob/(M*sigma_G0+sigma_prob))+
(mu*sigma_G0+mu_G0*sigma_prob)/(M*sigma_G0+sigma_prob);
end
end
function S = step2(y,fi_new,w_old,sigma_prob)
N = length(y);
K = length(fi_new);
197
S = zeros(1,N);
w1 = zeros(1,K);
for i = 1:N
for k = 1:K
w1(k) = w_old(k)*prob(y(i),fi_new(k),sigma_prob);
end
nor = sum(w1);
w = (1/nor)*w1;
S(i) = gets(w);
end
function w = getw(alfa_0, S_new,K)
M = zeros(1,K);
V = zeros(1,K);
N = length(S_new);
for k = 1:K
ind{k} = {};
end
for i = 1:N
ind{S_new(i)}{end+1} = i;
end
for k = 1:K
M(k) = length(ind{k});
end
A = 0;
B = 0;
for k = 1:K-1
A = 1+M(k);
B = alfa_0;
for r = k+1:K
B = B+M(r);
end
V(k) = betarnd(A,B);
end
V(K) = 1;
w(1)= V(1);
for k = 2:K
w(k) = w(k-1)*(1-V(k-1))*V(k)/V(k-1);
end
function p = prob(y,fi,sigma) %% f(Y_i l fi_S_i )
known conditional density of Y given fi
p = normpdf(y,fi,sigma);
end
Abstract (if available)
Abstract
Suppose we observe a random sample Y₁,...,Y
N
, of independent but not necessarily identically distributed random variables Yᵢ,∈ ℝᵈ, for i = 1,...,N. Assume also that the conditional density of Yᵢ given θᵢ is known and denoted by pᵢ(Yᵢ|θᵢ), where the θᵢ’s are unobserved random parameters that are independent and identically distributed with common but unknown distribution function F. ❧ The objective is to estimate F given the data Y₁,...,Y
N
. We used two different approaches to get the estimate of F: Nonparametric Maximum Likelihood (NPML) and Nonparametric Bayesian (NPB). ❧ For the nonparametric maximum likelihood approach, convex analysis shows that the maximum likelihood (ML) estimator Fᴹᴸ of F is a discrete distribution with no more than N support points where N is the number of subjects in the population. The position and weights of the support points are unknown. We developed four different methods of estimation: NPEM 1, NPEM 2, EM1+IP and EM2+IP. These methods are based on the Expectation–Maximization (EM) and Primal-Dual Interior Point (IP) algorithms. The NPEM 1 algorithm uses the EM algorithm to maximize the log-likelihood function over the set of support points and weights. NPEM 2 fixes the large grid of point at first and them uses EM algorithm to maximize the log-likelihood only over the set of weights. The NPEM 2 algorithm is much faster than the NPEM 1 algorithm, but to find the optimal solution a very large grid of points has to be used. Therefore in case of multidimensional parameters, NPEM 2 can be impractical. The EM1+IP algorithm is a hybrid of the NPEM 1 and IP algorithms in the way that support points are updated the same way as in NPEM 1, but the weights are updated using Primal-Dual Interior Point (IP) algorithms. This method is much faster then NPEM 1 and comparable in speed with NPEM 2, however it does not need a large grid of points. EM1+IP requires no more then N support points to start the estimation process and therefore can be used in cases of multidimensional parameters. The EM2+IP algorithm is a hybrid of NPEM 2 and IP. It starts with a large grid of support points and then updates the weights using the IP algorithm. This method gives results instantly if compared to the speeds of the rest of the methods. However the optimality of the results, as in case of the NPEM 2 algorithm, very much depends on the initial grid of points. ❧ Another highlight of this thesis is an ability of building nonparametric maximum likelihood estimates of distributions of the parameters of linear stochastic models, i.e. population models with process noise. The advantages of the models with process noise are that, in addition to the measurements errors, the uncertainties in the model itself are taken into the account, i.e. process noise. For example, in case of the PK problem the errors like dose errors, sample timing errors, model misspecifications. are not included in the deterministic models. The stochastic models on the other hand can overcome this problem. ❧ We used linear Kalman-Bucy filtering [J70] to find the likelihoods of these models and then employed all the above methods to find the maximum likelihood estimates of an unknown distribution F. ❧ In the NPB approach, the distribution F is itself a random variable. The first requirement is to place a prior distribution on F. A convenient way of doing this is with the so-called Dirichlet process (DP) prior, see [F73, WW97, WW98]. A Dirichlet process prior on a random distribution F is defined by a probability distribution G₀ and a scale parameter α₀, where G₀ is our prior guess of F and the parameter α₀ will be the strength of the guess. We write DP(α₀G₀) for the Dirichlet process. ❧ The main reason we consider the NPB approach in the PK Population Problem is because of its ability to calculate Bayesian credibility intervals, no matter what the sample size is. This is not possible with NPML in general, however it is possible to get bootstrapped confidence intervals for NPML which are somewhat similar to Bayesian credibility intervals. In both cases we get approximations of accuracy of the estimates. Even though NPML and NPB give us an estimates of accuracy, the Bayesian nature of the estimated population distribution F is more preferable. ❧ Two algorithms were developed to find NBP estimates of the distribution F, Marginal NPB and Full Conditional NPB. The Marginal NPB approach uses Gibbs sampling to estimate the posterior expectation of the distribution F given the data, but it does not solve the classical NPB problem of finding the posterior distribution of F given the data. The Full NPB approach uses the Stick-Breaking representation of the Dirichlet process and gives a way to sample from the posterior distribution of F given the data. This approach allows us to build credibility intervals on all the desired quantities of the distribution F. ❧ Finally we established a very nice connection between NPML and NPB methods. Using the Full NPB approach we can sample potential estimates of the true distribution F, moreover we can calculate the value of the log-likelihood function for each of those distributions. All these values of likelihoods are smaller than or equal to the maximum likelihood value that was found by NPML methods. The equality is possible with a big enough number of draws from the posterior distribution of F given the data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Population modeling and Bayesian estimation for the deconvolution of blood alcohol concentration from transdermal alcohol biosensor data
PDF
Population pharmacokinetic/pharmacodynamic modeling: evaluation of maximum likelihood expectation maximization method in ADAPT 5
PDF
Prohorov Metric-Based Nonparametric Estimation of the Distribution of Random Parameters in Abstract Parabolic Systems with Application to the Transdermal Transport of Alcohol
PDF
Parameter estimation in second-order stochastic differential equations
PDF
Pharmacokinetic/pharmacodynamic modeling for genetically polymorphic populations
PDF
Analysis using generalized linear models and its applied computation with R
PDF
Animal to human scaling and pharmacokinetic/pharmacodynamic modeling of anticancer drugs
PDF
Finite sample bounds in group sequential analysis via Stein's method
PDF
Obtaining breath alcohol concentration from transdermal alcohol concentration using Bayesian approaches
PDF
Statistical inference of stochastic differential equations driven by Gaussian noise
PDF
Nonparametric ensemble learning and inference
PDF
Finite dimensional approximation and convergence in the estimation of the distribution of, and input to, random abstract parabolic systems with application to the deconvolution of blood/breath al...
PDF
Stochastic inference for deterministic systems: normality and beyond
PDF
An abstract hyperbolic population model for the transdermal transport of ethanol in humans: estimating the distribution of random parameters and the deconvolution of breath alcohol concentration
PDF
Bayesian hierarchical models in genetic association studies
PDF
Efficient statistical significance approximation for local association analysis of high-throughput time series data
PDF
Nonlinear dynamical modeling of single neurons and its application to analysis of long-term potentiation (LTP)
PDF
Sense and sensibility: statistical techniques for human energy expenditure estimation using kinematic sensors
PDF
Estimation of random input to semi-linear abstract parabolic systems with application to quantitative description of drinking behavior based on trans-dermal alcohol concentration
PDF
Localization of multiple targets in multi-path environnents
Asset Metadata
Creator
Kryshchenko (Chubatiuk), Alona
(author)
Core Title
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Publication Date
08/01/2013
Defense Date
05/21/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
applied statistics,nonparametric Bayesian estimation,nonparametric maximum likelihood estimation,nonparametric population analysis,OAI-PMH Harvest,population pharmacokinetics theory,statistics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Schumitzky, Alan (
committee chair
), Bartroff, Jay (
committee member
), D'Argenio, David Z. (
committee member
)
Creator Email
achubatyuk@gmail.com,chubatiu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-309797
Unique identifier
UC11293433
Identifier
etd-Kryshchenk-1897.pdf (filename),usctheses-c3-309797 (legacy record id)
Legacy Identifier
etd-Kryshchenk-1897.pdf
Dmrecord
309797
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kryshchenko (Chubatiuk), Alona
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
applied statistics
nonparametric Bayesian estimation
nonparametric maximum likelihood estimation
nonparametric population analysis
population pharmacokinetics theory