Close
USC Libraries
University of Southern California
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Folder
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
(USC Thesis Other) 

Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content NONPARAMETRIC ESTIMATION OF AN UNKNOWN
PROBABILITY DISTRIBUTION USING MAXIMUM LIKELIHOOD
AND BAYESIAN APPROACHES


by
Alona Chubatiuk

__________________________________________________



A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the  
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(APPLIED MATHEMATICS)




August, 2013

Copyright 2013                                                                              Alona Chubatiuk
i
Dedication






To my family














ii
Acknowledgments

I would like to start with “Thank you, Thank you, Thank you” to my advisor Professor Alan
Schumitzky. I have been blessed to have an opportunity to work with him, to learn from him, to
listen to him and simply knowing him as a person. I cannot thank him enough for all his
patience, help and support during the times I went through the screening exams, my oral exam,
the birth my son, finally preparation of the thesis manuscript and the defense. He was always
there for me. Also I would like to thank Professors Jay Bartroff and David D’Argenio for serving
on both my oral exam and defense committee and making valuable suggestions and advises.
My sincerest thanks to my parents Valentyna and Valentyn Chubatiuk for their love and
patience. Having them as an example, I always dreamed of becoming a professor of
mathematics. And they believed and supported me all the way through without any doubt.    
A special thank you to my husband Oleksandr and my son Pavel for creating a happy, loving and
peaceful atmosphere at home that always gave me energy for another working day. Also many
thanks to my mother in law, Irina who basically took over the work in my house to give me time
to focus on my research.  
None of this would be possible without the encouragement and support from my friends and
colleagues Meg Aboulian, Tatiana Tatarinova, Dmytro Chebotarov, Vitalii Ostrovskyi, Jinlin
Song, Selenne Garcia-Torres, Xin Wang, Kasia Wasilewska and many more. Thank you to all of
you.
Finally, thank you to the USC Graduate School and WiSE for the Dissertation Completion
Fellowship and Fellowship Support to Accommodate Pregnancy, Birth of a Child, and Adoption
for PhD Students. These helped me no less than all I described above.    
iii
Table of Contents

Dedication………………………………………………………………………………………….i
Acknowledgments………………………………………………………………………………...ii
List of Tables……………………………………………………………………………………..iv
List of Figures……………………………………………………………………………………..v
Abstract……………………………………………………………………………………………x
Chapter 1.    Introduction…………………………………………………………………………1
Chapter 2.    Nonparametric Maximum Likelihood (NPML) Estimation………………………...6
Chapter 3.    Computer Simulations for NPML………………………………………………….31  
Chapter 4.    NPML Estimation for Stochastic Models………………………………………...109
Chapter 5.    Nonparametric Bayesian (NPB) Estimation……………………………………..135  
Chapter 6.    Theoretical Foundations of the Gibbs Sampler…………………………………...144
Chapter 7.    Computer Simulations for Gibbs Sampler………………………………………..147
Chapter 8.    Stick-breaking Approach in NPB Estimation…………………………………….155
Chapter 9.    Maximum Likelihood of Stick-Breaking (Connection between NPML and
NPB)……………………………………………………………………………………………171      
Bibliography……………………………………………………………………………………174
Appendix………………………………………………………………………………………..178




iv
List of Tables

Table 1: This table shows the number of 95 percent bootstrapped confidence intervals out
of a 100 that contain the true value of the expected value and the standard
deviation of the ‘’bimod’ data example………………………………………………..30
Table 2: Galaxy data……………………………………………………………………………..36
Table 3. Rolling Thumbtacks example data……………………………………………………..97
Table 4: Eye Tracking example, measurements……………………………………………......102
Table 5: The table shows the results of two NPML methods NPEM1 and EM1+IP that
were calculated using different levels of process noise. It also compares the
results when the process noise was included in the simulation but was present or
not in the model………………………………………………………………………123  
Table 6: This table shows the comparison of three different methods for estimating the  
unknown probability distribution  for the Galaxy Data example………………….168

 






v
List of Figures

Chapter 2
Fig 1: The graph of  for …………………………………..24

Chapter 3
Fig.1: The graph of the D function for Lindsay simple example………………………………..34
Fig.2: The graph of the , Lindsay simple example………………………………………….35
Fig. 3: The graph of the D function for Galaxy Data for K=3…………………………………...40
Fig. 4: The graph of the discrete distribution of   for K=3 shows 3 support points. Galaxy data
(The height of each vertical line is the corresponding probability on the support point) ……….40
Fig. 5: The graph of the discrete distribution of  for K= 82 shows 4 support points. Galaxy data
…………………………………………………………………………………………………...43
Fig. 6: The graph of the D function for Galaxy Data with 3 support points. Galaxy data………44
Fig. 7: The graph of the discrete distribution of  for K= 4 shows 4 support points. Galaxy
data……………………………………………………………………………………………….45
Fig 8: The graph of the simulated data for Case 1 and Model 1…………………………………56  
Fig 9: The graph of the D function for K = 20, Case 1 and Model 1……………………………57
Fig 10: The graph of the  for K = 20, Case 1 and Model 1 it shows 4 support points……...57
Fig 11: True distribution of  Model 1………………………………………………………….58  
Fig 12: The graph of the D function for K = 4, Case 1 and Model 1……………………………59
Fig 13: The graph of the  for K = 4, Case 1 and Model 1 it shows 4 support points……….60
vi
Fig 14: The graph of the simulated data for Case 1 and Model 2………………………………..62
Fig 15: The graph of the D function for K = 20, Case 1 and Model 2…………………………..62
Fig 16: The graph of the  for K = 20, Case 1 and Model 2 it shows 5 support points……...63
Fig 17: True distribution of  Model 2………………………………………………………….63
Fig 18: The graph of the D function for K = 5, Case 1 and Model 273…………………………65
Fig 19: The graph of the  for K = 5, Case 1 and Model 2 it shows 5 support points……….65
Fig 20: The graph of the simulated data for Case 2 and Model 1………………………………..67
Fig 21: The graph of the D function for K = 20, Case 2 and Model 1…………………………..68
Fig 22: The graph of the D function for K = 3, Case 2 and Model 1……………………………69
Fig 23: The graph of the D function for K = 8, Case 2 and Model 2……………………………71
Fig 24: The graph of the simulated data for Case 3 and Model 1………………………………..73
Fig 25: The graph of the D function for K = 20, Case 3 and Model 1…………………………..74
Fig 26: The graph of the  for K = 20, Case 3 and Model 1 it shows 3 support points……...74
Fig 27: The graph of the D function for K = 3, Case 3 and Model 1……………………………75
Fig 28: The graph of the  for K = 3, Case 3 and Model 1 it shows 3 support points……….76
Fig 29: The graph of the D function for K = 6, Case 3 and Model 2……………………………77
Fig 30: The graph of the  for K = 6, Case 3 and Model 2 it shows 6 support points……….78
Fig 31: True distribution of  Model 2………………………………………………………….78
Fig 32: The graph of the D function for K = 3, Case 4 and Model 1……………………………80
Fig 33: The graph of the simulated data for Case 4 and Model 2………………………………..81
Fig 34: The graph of the D function for K = 6, Case 4 and Model 2……………………………82
Fig. 35: a) Smoothed graph of  for K = 6, Case 4 and Model 2 it shows 6 support points
;
b)
Graph of the true distribution
………………………………………………………………...82
vii
Fig 36: The graph of the D function. PK example with additive noise………………………….86
Fig 37: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This figure
shows the graphs of (solid line) true distribution of the volume of distribution   (dashed line ‘--
’) smoothed MLE of the true distribution and (spikes) …………………………………….87
Fig 38: The graph of the D function. EM1+IP…………………………………………………..89
Fig 39: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . EM1+IP….90
Fig 40: The graph of the D function. NPEM 2…………………………………………………..92
Fig 41: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . NPEM2.....93
Fig 42: The graph of the D function. EM2+IP…………………………………………………..95
Fig 43: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . EM2+IP.....96
Fig 44: The graph of for K=10. Rolling thumbtacks…………………………………….....99
Fig 45: The graph of the D function for K=10. Rolling thumbtacks…………………………...100
viii
Fig 46: The graph of for K=3. Rolling thumbtacks……………………………………….101
Fig 47: The graph of the D function for K=3. Rolling thumbtacks…………………………….101
Fig 48: The graph of for K=6. Eye Tracking example…………………………………….104
Fig 49: The graph of the D function for K=6. Eye Tracking example…………………………104
Fig 50: The graph of for K=101. Eye Tracking example…………………………………108
Fig 51: The graph of the D function for K=101. Eye Tracking example………………………108
Chapter 4
Fig 1: This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes)  in case when  
(a) simulated and assumed noise are equal to 0
(b) simulated noise is equal to 0.1 and assumed noise is 0  
(c) simulated and assumed noise are equal to 0.1
(d) simulated noise is equal to 1 and assumed noise is 0
(e) simulated and assumed noise are equal to 1……………………………………………128
Fig 2: This figure shows the graphs of (solid line) true distribution of the volume of distribution
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes)  in case when  
(a) simulated and assumed noise are equal to 0
(b) simulated noise is equal to 0.1 and assumed noise is 0  
(c) simulated and assumed noise are equal to 0.1
(d) simulated noise is equal to 1 and assumed noise is 0
(e) simulated and assumed noise are equal to 1……………………………………………133

ix
Chapter 7
Fig. 1: The graph of  Galaxy data………………………………………………150
Fig. 2: The graph of (6 support points). Galaxy Data…………………………………….151
Fig 3: The graph of the simulated data for the above values of parameters. PK example……..153
Fig. 4: Frequency Histogram of Gibbs sample for 20 subjects (elimination rate constants on x – axis). PK
example…………………………………………………………………………………………153
Fig. 5: The graph of   for  (5 support points). PK example……………………….154
Fig. 6: The graph of true distribution of . PK example……………………………………...154
Fig. 7: Frequency Histogram of Gibbs sample for 100 subjects (elimination rate constants on x-
axis). PK example………………………………………………………………………………154
Chapter 8
Fig 1: Illustration of Stick Breaking (taken from Zoubin Ghahramani’s Tutorial in 2005)……158  
Chapter 9
Fig 1: The graph of , there are 6 support points. Galaxy data……………………………..172
Fig 2: The graph of , there are 20 support points in total but only 7 of them have significant
weight. Galaxy data…………………………………………………………………………….173





x
Abstract  

Suppose we observe a random sample , of independent but not necessarily identically
distributed random variables , for . Assume also that the conditional density of
given  is known and denoted by , where the ’s are unobserved random
parameters that are independent and identically distributed with common but unknown
distribution function .
The objective is to estimate  given the data . We used two different approaches to get
the estimate of : Nonparametric Maximum Likelihood (NPML) and Nonparametric Bayesian
(NPB).
For the nonparametric maximum likelihood approach, convex analysis shows that the maximum
likelihood (ML) estimator  of  is a discrete distribution with no more than N support
points where N is the number of subjects in the population.  The position and weights of the
support points are unknown.  We developed four different methods of estimation: NPEM1,
NPEM 2, EM1+IP and EM2+IP. These methods are based on the Expectation –Maximization
(EM) and Primal-Dual Interior Point (IP) algorithms. The NPEM1 algorithm uses the EM
algorithm to maximize the log-likelihood function over the set of support points and weights.  
NPEM 2 fixes the large grid of point at first and them uses EM algorithm to maximize the log-
likelihood only over the set of weights. The NPEM 2 algorithm is much faster than the NPEM 1
algorithm, but to find the optimal solution a very large grid of points has to be used. Therefore in
case of multidimensional parameters, NPEM 2 can be impractical. The EM1+IP algorithm is a
hybrid of the NPEM1 and IP algorithms in the way that support points are updated the same way
as in NPEM 1, but the weights are updated using Primal-Dual Interior Point (IP) algorithms. This
xi
method is much faster then NPEM 1 and comparable in speed with NPEM 2, however it does not
need a large grid of points. EM1+IP requires no more then N support points to start the
estimation process and therefore can be used in cases of multidimensional parameters. The
EM2+IP algorithm is a hybrid of NPEM2 and IP. It starts with a large grid of support points and
then updates the weights using the IP algorithm. This method gives results instantly if compared
to the speeds of the rest of the methods. However the optimality of the results, as in case of the
NPEM 2 algorithm, very much depends on the initial grid of points.  
Another highlight of this thesis is an ability of building nonparametric maximum likelihood
estimates of distributions of the parameters of linear stochastic models, i.e. population models
with process noise. The advantages of the models with process noise are that, in addition to the
measurements errors, the uncertainties in the model itself are taken into the account, i.e. process
noise. For example, in case of the PK problem the errors like
• dose errors
• sample timing errors
• model misspecifications
are not included in the deterministic models. The stochastic models on the other hand can
overcome this problem.
We used linear Kalman-Bucy filtering [J70] to find the likelihoods of these models and then
employed all the above methods to find the maximum likelihood estimates of an unknown
distribution F.
In the NPB approach, the distribution  is itself a random variable. The first requirement is to
place a prior distribution on .  A convenient way of doing this is with the so-called Dirichlet
process (DP) prior, see [F73, WW97, WW98].  A Dirichlet process prior on a random
xii
distribution  is defined by a probability distribution  and a scale parameter , where  is
our prior guess of and the parameter  will be the strength of the guess. We write

for the Dirichlet process.
The main reason we consider the NPB approach in the PK Population Problem is because of its
ability to calculate Bayesian credibility intervals, no matter what the sample size is. This is not
possible with NPML in general, however it is possible to get bootstrapped confidence intervals
for NPML that are somewhat similar to Bayesian credibility intervals. In both cases we get
approximations of accuracy of the estimates. Even though NPML and NPB give us an estimates
of accuracy, the Bayesian nature of the estimated population distribution  is more preferable.  
Two algorithms were developed to find NBP estimates of the distribution F, Marginal NPB and
Full Conditional NPB. The Marginal NPB approach uses Gibbs sampling to estimate the
posterior expectation of the distribution F given the data, but it does not solve the classical NPB
problem of finding the posterior distribution of F given the data. The Full NPB approach uses the
Stick-Breaking representation of the Dirichlet process and gives a way to sample from the
posterior distribution of F given the data. This approach allows us to build credibility intervals
on all the desired quantities of the distribution F.  
Finally we established a very nice connection between NPML and NPB methods. Using the Full
NPB approach we can sample potential estimates of the true distribution F, moreover we can
calculate the value of the log-likelihood function for each of those distributions. All these values
of likelihoods are smaller than or equal to the maximum likelihood value that was found by
NPML methods. The equality is possible with a big enough number of draws from the posterior
distribution of F given the data.  

1
Chapter 1.  Introduction  

In this thesis, we focus on the “Nonparametric Estimation of an Unknown Probability
Distribution Using Maximum Likelihood and Bayesian Approaches.”  The general statistical
problem is to estimate an unknown probability distribution  of a population of nonlinear
models from noisy data.  There are many applications of this problem medicine, computational
biology and engineering.    

One important medical example is in the analysis of clinical trials. A new drug is given to a
population of subjects. The drug’s behavior is stochastically determined by an unknown subject-
specific vector parameter . This pharmacokinetic parameter  varies significantly (usually
based on genetic or epigenetic factors) between the subjects and that accounts for the variability
of the drug response in the population. The pharmacokinetic population problem is to determine
the probability distribution  of  based on the noisy clinical data.  

In the general case, our objective is to use two different approaches, nonparametric maximum
likelihood (NPML) and nonparametric Bayesian (NPB) to establish new algorithms for
estimating the unknown probability distribution  given the data.  

For the maximum likelihood approach, convex analysis shows that the maximum likelihood
(ML) estimator  of  is a discrete distribution with no more than N support points where N
is the number of subjects in the population.  The position and weights of the support points are
unknown. We have programmed an algorithm based on the Expectation – Maximization (EM)
2
method to determine , see [S91].  The discrete property of  is important for optimal
drug dosage design as it reduces integration problems over  to summation problems.  

In 1983, Lindsay [L83] proved the fundamental result about nonparametric maximum likelihood
(NPML) estimation for mixing distributions.  However, this result was not widely known in PK
circles until 1986 when Mallet’s ground-breaking paper [86] appeared.  Both Lindsay’s and
Mallet’s results were based on convexity theory. In 1991, Schumitzky [S91] derived an EM
algorithm for the NPML estimator called NPEM1.  What is interesting is that in 1984, before the
Mallet’s result, Katz and D’Argenio [KD84] proposed a nonparametric estimation method which
can now be viewed as the first iteration of the NPEM1 algorithm of Schumitzky.  

In the early 1990’s, Leary and Burke developed a primal-dual interior point NPML method
which was hundreds of times faster than the NPEM1 [L01].  This last algorithm is essentially the
nonparametric adaptive grid (NPAG) algorithm supplied by the Laboratory of Applied
Pharmacokinetics at USC [LJSG01]. A recent version is now available in an R-based program
called P-metrics, which was developed by Dr. Michael Neely of USC, see [NGYSJ 12] and  
[T13].

In this thesis we propose a hybrid method (EM1+IP) which uses the best aspects of NPEM1 and
NPAG and which should be more efficient for high dimensional problems.

Another highlight of this thesis is an ability of building nonparametric maximum likelihood
estimates of distributions of the parameters of the linear stochastic models, i.e. population
3
models with process noise. We used linear Kalman-Bucy filtering [J70] to find the likelihoods of
theses models and then employ the NPEM1 methods and EM1+IP to find the maximum
likelihood estimate of the unknown distribution.  

In the NPB approach, the distribution  is itself a random variable. The first requirement is to
place a prior distribution on .  A convenient way of doing this is with the so-called Dirichlet
process (DP) prior, see [F73, WW97, WW98].  A Dirichlet process prior on a random
distribution  is defined by a probability distribution  and a scale parameter , where  is
our prior guess of and the parameter  will be the strength of the guess. We write

for the Dirichlet process.

The main reason we consider the NPB approach in the PK Population Problem is because of its
ability to calculate Bayesian credibility intervals, no matter what the sample size is. This is not
possible with NPML in general, however it is possible to get bootstrapped confidence intervals
for NPML that are somewhat similar to Bayesian credibility intervals. In both cases we get
approximations of accuracy of the estimates. Even though NPML and NPB give us and estimates
of accuracy, the Bayesian nature of the estimated population distribution  is more preferable.  

One of the goals in the Bayesian case, to determine , i.e. the conditional expectation
of  given the data. The distribution  takes the place of  in drug dosage design.
One important result for pharmacokinetic applications is that, like , the distribution
is also discrete.  Of course the number of support points is no longer less than N, but
that makes  more robust.  
4

Most methods that find  employ a marginal approach:  it turns out that  can be
“integrated out” leaving a much simpler problem for the .  The resulting model for the

is
given by the Polya Urn representation:    

The marginal approach leads to a Gibbs sampler algorithm for estimating  i.e., the
expected value of given the data – not . This approach is used by practically all
researchers, and is sufficient for the problems they consider. However, it does not solve the
classical NPB population analysis problem, i.e. to estimate the full conditional distribution of
.

To solve this problem we propose the Full Conditional Method.  In this method we estimate we
the full conditional distribution of given the data , using Stick-Breaking priors and show
some examples of how build the credibility intervals.

In very broad terms the major Results we achieved in this thesis are as follows:

Result 1. To use the NPML approaches to find an estimate of  given the data. And to program
algorithms based on the EM methods to determine  the NPML Estimate of  and to
illustrate the results on different data sets. Also to show the way of how to build the bootstrap
confidence intervals of the moments of .

5
Result 2. To use a combination of the EM and Interior Point methods to find an estimate of  
given the data and to program the EM1+IP algorithm that is more efficient for high dimensional
problems than any other methods that find .

Result 3. To use NPEM1 and EM1+IP methods to find  of the stochastic population models,
i.e. with process noise.

Result 4. To use the NPB approach to find an estimate of  by placing a DP prior on it and
determining the posterior distribution of . And to program two algorithms based on Gibbs
Sampling methods to determine , the posterior expected distribution of  given the
data and based on Stick Breaking priors (the representation of the DP prior) to determine
and build credibility intervals of the moments of distribution .  

Result 5.  To use Stick Breaking priors to find and thus to establish the connection between
NPB and NPML.  
6
Chapter 2. Nonparametric Maximum Likelihood (NPML) Estimation  

Consider the following statistical model:
,                                                        (2.1)
 

                                                                                        (2.2)
where    

 is a vector of independent but not necessarily identically distributed measurements with
known   density  
 is a vector of  unknown parameters defined on a space in finite dimensional
Euclidean space ( )

is an unknown probability distribution on (the Borel subsets of) .  
 
In general  will mean that or  has distribution .

Assume that given  the  are independent and identically distributed random vectors with
common (but unknown) probability distribution on .
Our main problem is to estimate  based on the data


Nonparametric Maximum Likelihood (NPML):  

The log likelihood function of can be written as follows
                                           (2.3)
where
7
                                              (2.4)
Then  is a maximum likelihood estimate of  if it solves the following optimization
problem
                                          (2.5)
 
Convex optimization framework:
Notice that the maximum of the likelihood function depends only on the values of the mixture
likelihood vector  
                                         (2.6)
Therefore we can reformulate the problem of finding  that maximizes  as finding
which mixture likelihood vector  gives the largest value to the likelihood. It can be done in
three steps:
1. Construct the set of feasible solutions in :  

2. Construct the objective function  that has to be maximized over  
3. If  is the element of  that maximizes the objective function , then it is
straightforward to obtain such that .

We start with the construction of the set of feasible solutions .
Let us first define the vector of atomic probability densities  
                                           (2.7)
8
such that  
                                                     (2.8)

It follows from fundamental results of convex geometry that if we define
                                                         (2.9)
and if  is compact then  

where is the convex hull of . This representation of  allows us to think about the
elements of as linear combinations of the elements of .  
Now we present the main theorem of Lindsay [L83] that proves the existence and uniqueness of
the solution to the problem of maximizing the objective function  over the set .

Theorem 1: Assume  is compact and  contains at least one point with positive
likelihood. Then there exists a unique vector  that maximizes  over the set .

Moreover it follows by the Caratheodory’s theorem that  can be represented as  where
has at most  points of support.

Therefore we can conclude that  can be found in the class of discrete distributions with at
most  support points, which we write as:  
                                 (2.10)
9
where are the support points in  and are the weights such
that .

The terms represents the delta distribution on  with the defining property
for any continuous function  or more generally
.   This property of transforming integration into summation is
crucial in the pharmacokinetic application of dosage regimen design.  


Thus the likelihood maximization problem can now be written as  
                                                                         
(2.11)

     

However the above problem of maximizing the likelihood function is still very difficult if
approached directly because of high dimensionality of the parameters. That is why we use an
iterative method based on the Expectation-Maximization algorithm, see [S91], which we present
next, that determines the global maximum of the likelihood function.      

Notes: 1) In the pharmacokinetic world, the father of Nonparametric Maximum Likelihood is
Alain Mallet. Although Mallet’s paper [M86] appeared after Lindsay’s paper [L83], Mallet’s
1983 PhD Dissertation (in French) proved the main results of the Lindsay paper.
10
 
2)  In [S91], there were two algorithms described for solving the optimization problem in Eq.
(2.11). One of these is the basis for the method used by the USC Laboratory of Applied
Pharmacokinetics see [LJSG01].  The other one is the algorithm described in this Thesis.  It has
never been implemented before now.

In this thesis we will define and implement four different algorithms for finding , called
NPEM1, NPEM2, NPEM1+IP and EM2+IP.  We describe each separately.

Nonparametric Expectation-Maximization  (NPEM1) algorithm for solving optimization
problem (2.11):
Step 1. Initiate:  
Step 2. E-step: compute  
Step 3. M-step: find  that maximizes  
Step 4. If , where  is a very small number, then stop. Otherwise, go to
step 2.
Where  that maximizes  can be found as follows:
                    (2.12)    
                                                                (2.13)
where
                           (2.14)
11
The derivations of these formulas together with the proof of the fact that EM algorithm never
decreases the value of the likelihood function at each iteration can be found in [S91]. However, it
is well known that in general, the EM algorithm does not always converge to a global maximum.
Therefore we used so-called directional derivatives to check if the distribution that we found
using our iterative scheme is in fact the global maximum of the likelihood function.
Define  
                                                  (2.15)

Theorem 2 (Lindsay [L83]):
is the distribution that maximizes  if and only if

Moreover the support of is contained in the set .

Nonparametric Expectation-Maximization 2 (NPEM2) algorithm for solving optimization
problem (2.11) with fixed support points:
In this case a (usually large) set of points  is fixed.  The corresponding
optimization problem is P2:  


12
           
       

NPEM2 Algorithm
Step 1. Initiate:  
Step 2. E-step: compute  
Step 3. M-step: find  that maximizes  
Step 4. If , where  is a very small number, then stop. Otherwise, go to
step 2.
It follows:
that maximizes  is given by:
                                                        (2.16)
where
                               (2.17)
The derivations of these formulas together with the proof of the fact that NPEM2 algorithm
increases the value of the likelihood function at each iteration can be found in [S91]. Again
since, in general, the EM algorithm does not always converge to a global maximum; we used the
directional derivative of Eq. (2.15) to check if the distribution that we found using our iterative
scheme is in fact the global maximum of the likelihood function.  On the other hand, because of
13
the convexity of the optimization problem as a function of , any limit point of the NPEM2
algorithm is a global maximum.  This is discussed below

Computational Aspects of NPEM1  
1. NPEM1 Algorithm Vector-Matrix notation  

E-M Steps
Support Point Update
1)  
 
2)                                 (2.18)

Eq. (2.18) is just a nonlinear least squares problem and straightforward to solve.
When Y is a linear (or affine) function of , then the optimization problem in Eq. (2.18) can be
solved analytically.  Using  

14

Now taking transpose of both sides:

3)                                                                 (2.19)
If    is independent of  then    
4)                                                             (2.20)

Convergence of EM Algorithms and Convex Optimization

The PhD thesis of Baek [B06], written under the supervision of James Burke of the University of
Washington, gives a nice connection between the EM algorithms and convexity theory.  Since
Baek’s thesis is not readily accessible, we outline the main points here.  Also from these results
we get a global convergence result for the NPEM2 algorithm.
The general optimization problem is
P1:                                                                                            
   
NPEM1 Algorithm
             

15
In this case and the NPEM1 algorithm optimizes over both
.
From Eqs. (2.12)-(2.13), the computational steps are then
             

                                                                         
where
                   
and and let  vector of weights (probabilities) such
that .
The optimization problem P1 is a constrained problem, convex in , but not necessarily convex
in .  Including the constraints in the optimization of P1 gives the Lagrangian

function with multipliers and :
                                           
(2.21)
.
Then the Karush-Kuhn-Tucker (KKT) conditions [BV04] are:

16
where  
 

Then KKT2)-KKT4) imply
KKT5)    
Further, the solution

to KKT1) can be obtained by solving the optimization problem  
KKT6)    

Now compare eqs. (2.12)-(2.13) to KKT5)-KKT6).  We see that a limit point of the NPEM1
algorithm is a KKT point of problem P1 .  Therefore the KKT condition is a necessary condition
for the limit points of NPEM1. This gives a connection between the NPEM1 algorithm and the
NPAG algorithm.

NPEM2 Algorithm

In this case a (usually large) grid of points  is fixed.  The corresponding
optimization problem is  
P2:  


17

The NPEM2 algorithm now optimizes over only . Of course, by the Lindsay and
Mallet theorems, only at most points in  will eventually have non-zero probability.  
The NPEM2 algorithm is given in eqs. (2.16)-(2.17).  The computational steps are then
                                                             
where
                   
The optimization problem P2 is now a constrained convex problem.  Including the constraints in
the optimization of P2 gives the Lagrangian function with multipliers
and :
                  (2.22)

Now the Karush-Kuhn-Tucker (KKT) conditions are:

where  
 
Then KKT2=b)-KKT4b) imply  
18
KKT5b)      


Now compare eqs. (2.16)-(2.17) to this last equation. We again see that a limit point of the
NPEM2 algorithm is a KKT point of problem P2. Therefore the KKT condition is a necessary
condition for the limit points of NPEM1. However, since P2 is convex and differentiable
[KT51], the KKT condition is also a sufficient condition for the limit points of NPEM2 to be a
global optimum of P2.  That is, for a convex problem P2, a solution is a global optimum of P2
if and only if the KKT condition is satisfied. This proves the following theorem.

Theorem 3: Let be a limit point of NPEM2.  Then is a global optimum of P2.  

Remark: In general, the limit points of an EM algorithm are only stationary points.

The Primal-Dual Interior Point method for optimizing eq. (2.13) with fixes support points
First consider the NPEM 2 algorithm with fixed the supports points . The optimization of
      (2.23)
with respect to is a convex optimization problem.  And although the update given in
NPEM2:
           
19
is likelihood improving, it can be exceptionally slow to converge.  The idea of Robert Leary
(now at the Pharsight Corporation) and James Burke (at the University of Washington) was to
replace this update of by an optimization method consistent with modern convexity theory.  
Namely, optimize eq. (2.13) by the Primal-Dual Interior Point method.
Primal Dual Interior Point Method

We now describe the primal dual interior point method as it applies to our NPML problem.
The actual algorithm we use for this step is a Matlab program written Bradley Bell of the
University of Washington, see Bell [B12].  Since this algorithm is written in computer code, we
give a mathematical description in what follows.

First we derive the Karush-Kuhn-Tucker (KKT) conditions for our NPML optimization problem.
These conditions are essential for our Primal Dual Interior Point program. This material is
standard in Convex Optimization theory so no proofs will be given.  An excellent reference is the
book by Boyd and Vandenberghe [BV04].  

We first describe the KKT conditions in general setting. Consider the primal problem
P1:  



The Lagragian for P1 is then


20
and the Lagrangian dual function is:
 
The subsequent dual problem is
D1:  
Important properties are:
• The dual problem is always convex
• The primal and dual optimal values  satisfy  
• Slater’s condition: If there is an  
• Duality gap is the quantity  and

 
Note: we use this last property in the Primal Dual Interior Point algorithm as a stopping
condition.
Corresponding to the problems P1 and D1 are the KKT conditions:
         (Stationarity)
                              (Primal feasibility)
                               (Dual feasibility)
                  (Complementarity)
For proof see [BV04].  The main result is then:
Theorem 4: If   satisfies Slater’s condition, then are primal-dual optimal if
and only if

satisfy the KKT conditions.
For proof see [BV04].  
21

We now apply the above theory to our NPML problem.  Assume a fixed grid of support points
.  As before the corresponding optimization problem is
P2: maximize: , w.r.t  
      subject to:  
Now let  and define the function

Then the problem P2 now can be written as:  
P: minimize:  
    subject to: { }
where  is the  column vector of all ones
Note: We write P as a minimization problem because  is a linear function of  and  is
a convex function of  and we want to apply the above theory.
First for convenience we include the constraint  in problem P as a penalty in the
objective function. For proof see [B06] ( Prop. 4.2).
P:        
Now writing this problem in the form P1, we have
,  
The Lagrangian and dual Lagrangian function become:
22


Since

is convex in , a necessary and sufficient conditions for the minimizer of
are the gradient conditions:

Regrouping terms in the Lagrangian  
 
and substituting the above gradient conditions in

gives

The dual problem is then
D:  
 
maximize −Φ(v) wrt (y,v):  y≥ 0,v≥ 0,   y+Ψ
T
v= N⋅1
K

And written in minimization form
D:  
 
minimize Φ(v) wrt (y,v):  y≥ 0,v≥ 0,   y+Ψ
T
v= N⋅1
K

To get the KKT conditions we need the additional derivatives:

Then from the general KKT conditions we have:
Stationarity:
23
         
Primal Feasibity:    
Dual Feasibity:        
Complimentarity      
Using matrix notation these conditions become
 
where we have combined the two conditions  
and where  

Interior Point Method:
Relaxed Problem
Now we want to remove the constraint in the optimization problem P
A perfect version would be

But is not differentiable. So we approximate by for  
As  .  See graph below.
24


Fig 1: The graph of  for  

The new “relaxed” optimization problem becomes
 
Remember  

Now the KKT conditions are exactly the same for the first two equations and the third equation
becomes .  The resulting relaxed KKT conditions are then
25

The change of variables:  in KKT makes the first equation become
, while the rest remains the same.  The Interior Point Method applies Newton’s
method to the perturbed nonlinear equations

where  is decreased to zero.
The procedure terminates when  

where  
   
is the distance between the current primal and dual solutions.  
Further .  
Newton’s Method: We now give a brief description of the Newton’s method applied to .
The vector function  is defined by

The  Jacobian of  is then

26
The Newton step at a point  is obtained by solving the linear equations
                                     (2.24)
After each solution to eq. (2.24), the value of   is reduced at the same rate as quantity
.  For numerical details see the Bell [B12].
The program in this form is equivalent to the first cycle of the USC NPAG program.

Hybrid of the NPEM 1 and the Interior Point Method (EM1+IP):
There is no reason why the primal dual interior point method cannot be adjoined to the algorithm
of NPEM1.  In this case, instead of a very large grid of fixed support points, there are only N
support points at each iteration.  And eq. (2.12) is as before. But the update of the weights in eq.
(2.13) is replaced by the primal dual interior point method.  More precisely, the computational
steps are then

EM1+IP
Step 1.    
Step 2.    
Since Step 1 is likelihood improving by the EM argument of [S91] and Step 2 is likelihood
improving by definition, it follows that Steps 1 and 2 together are likelihood improving.
Remarks:  The EM1+IP algorithm has the potential to be more efficient for high dimensional
problems then the current USC-NPAG algorithm  


27
Hybrid of the NPEM 2 and the Interior Point Method (EM2+IP):
We also decided to join the primal dual interior point method with the algorithm of NPEM2.  In
this case we take of a very large grid of fixed support points as in NPEM2.  But the update of the
weights in eq. (2.16 – 2.17) is replaced by the primal dual interior point method.  More precisely,
the computational steps are then
EM2+IP
Step 1.    Take a large grid of support points  
Step 2.    
Step 3.    Throw away points with zero probabilities
Since Step 1 is fixed and Step 2 is likelihood improving by definition, it follows that Steps 1 and
2 together are likelihood improving.
Remarks:  The EM2+IP is identical to the first iteration of the current USC-NPAG algorithm.  
Accuracy of NPML estimates
It is known that in general it is not possible to build the confidence intervals on the unknown
distribution  since it is an infinite dimensional parameter. However bootstrap confidence
intervals can be used to get the asymptotic confidence intervals on any desired quantity of the
NPML estimate  of the distribution . We will show the way it can be done on the example
below.
“Bimod” example:
The “bimod” dataset [PG08] consists of 100 points drawn from the bimodal mixture
. The question is to estimate the distribution of the mean of this
data.  
28
The following model was used in this example.
Consider  and  is the above simulated “bimod” dataset. Then for
 
                                                 (2.24)
which is a density of normal distribution with mean  and standard deviation . We will consider
.  Let  be the number of support points of  and  be the vector of unknowns as it was
defined in (2.11) then
                                                                (2.25)
where

                                                             (2.26)
The updated values of  can be found using (2.12) and (2.13). Thus for any  we
have



29
To find the value of  that maximizes the above function we will use the formula (2.20) with
:    

Thus if we write it all out we get
                                             (2.27)
and
                                             (2.28)
where as before
                               (2.29)
The analysis was done the following way.
• The  was found using NPEM1 algorithm
• The mean and sigma of was found  
      and  
• The 95 percent bootstrap confidence interval was built using 100 bootstrap samples of the
“bimod” data set and the type of interval is chosen to be the bootstrap bias-corrected and
accelerated percentile method.    
 
30
Bootstrap bias-corrected and accelerated percentile method (Bbca)
The bias-corrected and accelerated percentile method, suggested by Efron [E87], seeks to
adjust for the bias and skewness of the sampling distribution. It adjusts the percentiles
selected from the bootstrap percentile method to be the endpoints of the confidence
intervals. The adjusted percentiles are

where  denotes the standard normal cumulative distribution function and  
is the th percentile of the standard normal distribution. Here  adjusts for the
skewness of the sampling distribution, and  adjusts for the bias of
the estimator where  is the estimate of  calculated for the bootstrap sample and  is
the number of bootstrap samples used to calculate the confidence interval.  
• This process was repeated 100 times to check if we get approximately 5 of the confidence
intervals to miss the true values of the mean and sigma.
The results are shown in the following table.
True value # of intervals that contain the true value

93
91

Table 1: This table shows the number of 95 percent bootstrapped confidence intervals out of a
100 that contain the true value of the expected value and the standard deviation of the ‘’bimod’
data example.    
31
Chapter 3. Computer Simulations for NPML  

Lindsay simple example
This example is given in Lindsay [L83, Ex. 2]. It is a useful example because of its simplicity
and the fact that the optimal solution is known analytically.
Suppose that  where  and for some  
                                                               (3.1)
Assuming that  is the number of support points of  and that  is defined as in (2.11) we
have
                             (3.2)
                                                               
Then for further simplification we write
                                                       (3.4)
Then by (2.12) and (2.13) for some   the updated values of  are
                      (3.5)
and
                                                                 (3.6)
32
where
                                        (3.7)

It was shown in Lindsay[L83] that the maximum likelihood estimate of  has two support
points at  with weight   and at   with weight  . Notice that the number
of support points is the same as the size of the given data that is .  

Consider now the NPEM1 algorithm described above. It was implemented using MATLAB. The
program Run_NPEM_Ex6_1.m works in the following way:
1. Type the name of the file in the command window and press return. It gives the output as:
>> Run_NPEM_Ex6_1
tolerance =

2. Input the tolerance level, which we denote by  and press return.  The
result is                          
>> Run_NPEM_Ex6_1
           tolerance = 0.00001
theta_0 =
3. Input the initial value for  = [2, 5] and press return.
>> Run_NPEM_Ex6_1
33
tolerance = 0.00001
theta_0 = [2,  5]
w_0 =
4. Input the initial value for  and press return. Then the output is
>> Run_NPEM_Ex6_1
tolerance = 0.00001
theta_0 =[2,  5]
w_0 =[0.5000,  0.5000]
count =
   13
theta =
   1.0001    4.0006
w =
   0.3333    0.6667
LogLikelihood =
  -1.7919
x =
   4.0000
max_value =
 2.8395e-004

Where  count is the number of iterations it takes the algorithm to converge; theta and w are the
maximum likelihood estimates of the true values of the vectors  and ; LogLikelihood is the
34
value of the log-likelihood function evaluated at the resulting values of theta and w; x is the
value of  that maximizes  and max_value is the value of . We can see that the
value of the maximum is within acceptable numerical error from zero. Run_NPEM_Ex6_1.m
also gives the graph of the function   which is shown in Fig.1 below:

Fig.1: The graph of the D function for Lindsay simple example.

We can see that  is non-positive for all values of  and moreover the set of values for
which it is equal to zero contains points 1.0001 and 4.0006 therefore by the theorem[3] we can
conclude that the algorithm converges to a global maximum and the resulting distribution is the
maximum likelihood estimate of the true distribution of .  
35

Fig.2: The graph of the . Lindsay simple example.

If we compare the results we got with the solution presented by Lindsay we can see that they are
essentially the same up to some rounding error.  











36
Galaxy Data
We will use a Galaxy Data to illustrate the NPEM1 algorithm. This data was published by
Roeder (1990) and it represents the recession velocities, in units of 10
3
km/s, of 82 galaxies from
six well-separated conic sections of space. The error in the calculations is said to be less than
0.05 units.  
This problem is a benchmark for practically all density estimation programs, see [EW95].  
The histogram of the data is given below:


Recession velocities in 1000 km/s of 82 galaxies  
9.172   9.350   9.483   9.558   9.775   10.227 10.406 16.084 16.170 18.419 18.552 18.600 18.927
19.052 19.070 19.330 19.343 19.343 19.440 19.473 19.529 19.541 19.547 19.663 19.846 19.856
19.863 19.914 19.918 19.973 19.989 20.166 20.175 20.179 20.196 20.215 20.221 20.415 20.629
20.795 20.821 20.846 20.875 20.986 21.137 21.492 21.701 21.814 21.921 21.960 22.185 22.209
22.242 22.249 22.314 22.374 22.495 22.746 22.747 22.888 22.914 23.206 23.241 23.263 23.484
23.538 23.542 23.666 23.706 23.711 24.129 24.285 24.289 24.366 24.717 24.990 25.633 26.960
26.995 32.065 32.789 34.279

Table 2: Galaxy data
37
The distribution of velocities could answer the question about clumping of galaxies, which is if
galaxies are clumped then the distribution of velocities is multi-modal and if they are not
clumped then the distribution would increase at the beginning and then gradually tail off.  

In terms of our notation the model can be described as follows. Consider  and ’s from
the above table then for  
                                                 (3.8)
which is  a density of normal distribution with mean  and standard deviation . We will
consider .  Let  be the number of support points of  and  be the vector of
unknowns as it was defined in (2.11) then
                                                                (3.9)
where

                                                             (3.10)
The updated values of  can be found using (2.12) and (2.13). Thus for any  we
have



38
To find the value of  that maximizes the above function we will use the formula (2.20) with
:    

Thus if we write it all out we get
                                             (3.11)
and

                                             (3.12)
where as before
                               (3.13)
The program Run_NPEM_Galaxy.m uses the NPEM1 algorithm described above to find the
MLE of the true distribution of . Here we will illustrate the results we got together with the
description of the way the program works.  
1. Type Run_NPEM_Galaxy in the command window and press return. The result is
>> Run_NPEM_Galaxy
number of support points =
2. Input the number of support points, say K = 3 and press return. The output is
>> Run_NPEM_Galaxy
39
number of support points = 3
count =
   13
theta =
   9.7502   21.4032   32.9443
w =
   0.0859    0.8769    0.0372
LogLikelihood =
-212.6803
max_value =
  29.5478
Where similarly as in the previous example count is the number of iterations it takes the
algorithm to converge, theta and w are the maximum likelihood estimates of the true values of
the vectors  and , LogLikelihood is the value of the log-likelihood function evaluated at the
resulting values of theta and w and max_value is the maximum value of  the function .  
If we compare the results we got with those presented in the paper written by Murray Aitkin we
can see that in case of three support points and  the values are the same, but if we look at
the graph of the function  which is also produced by Run_NPEM_Galaxy.m and is
presented in the Fig. 3 below
40

Fig. 3: The graph of the D function for Galaxy Data for K=3.
we can see that it takes positive values and attains it’s maximum 29.5478 at a point  
therefore by theorem 3 we can conclude that the resulting distribution is not an MLE of the true
distribution of  if . Run_NPEM_Galaxy.m also produces the Fig. 4 that represents the
resulting discrete distribution of  which is presented below

Fig. 4: The graph of the discrete distribution of   for K=3 shows 3 support points. Galaxy data
example (The height of each vertical line is the corresponding probability on the support point)

Moreover if we run the program for  and the same  the resulting values are:
41
theta =
 Columns 1 through 8
   9.7420    9.7420    9.7420    9.7420    9.7420    9.7420    9.7420    9.7420
 Columns 9 through 16
   9.7420    9.7420    9.7420    9.7420    9.7420    9.7420    9.7420   21.1342
 Columns 17 through 24
  21.1342   21.1342   21.1342   21.1342   21.1342   21.1342   21.1342   21.1342
 Columns 25 through 32
  21.1342   21.1342   21.1342   21.1342   21.1342   21.1342   21.1342   21.1342
 Columns 33 through 40
  21.1342   21.1342   21.1342   21.1342   21.1342   21.1342   21.1342   21.1342
 Columns 41 through 48
  21.1342   21.1342   21.1342   21.1342   21.1342   21.1342   21.1342   21.1342
 Columns 49 through 56
  21.1342   21.1342   21.1342   23.9844   23.9844   23.9844   23.9844   23.9844
 Columns 57 through 64
  23.9844   23.9844   23.9844   23.9844   23.9844   23.9844   23.9844   23.9844
 Columns 65 through 72
  23.9844   23.9844   23.9844   32.9931   32.9931   32.9931   32.9931   32.9931
 Columns 73 through 80
  32.9931   32.9931   32.9931   32.9931   32.9931   32.9931   32.9931   32.9931
 Columns 81 through 82
  32.9931   32.9931
42
w =
 Columns 1 through 8
   0.0088    0.0089    0.0088    0.0086    0.0084    0.0075    0.0070    0.0062
 Columns 9 through 16
   0.0053    0.0045    0.0039    0.0035    0.0026    0.0013    0.0004    0.0000
 Columns 17 through 24
   0.0004    0.0005    0.0013    0.0026    0.0028    0.0041    0.0055    0.0071
 Columns 25 through 32
   0.0098    0.0112    0.0134    0.0155    0.0203    0.0224    0.0268    0.0283
 Columns 33 through 40
   0.0310    0.0346    0.0361    0.0375    0.0397    0.0407    0.0407    0.0408
 Columns 41 through 48
   0.0403    0.0386    0.0374    0.0349    0.0332    0.0308    0.0283    0.0253
 Columns 49 through 56
   0.0205    0.0184    0.0135    0.0110    0.0117    0.0110    0.0092    0.0080
 Columns 57 through 64
   0.0071    0.0061    0.0046    0.0038    0.0032    0.0023    0.0018    0.0015
 Columns 65 through 72
   0.0010    0.0006    0.0002    0.0006    0.0009    0.0016    0.0018    0.0021
 Columns 73 through 80
   0.0025    0.0027    0.0028    0.0030    0.0032    0.0032    0.0032    0.0032
 Columns 81 through 82
   0.0031    0.0030
43
LogLikelihood =
-212.1969
Then we can see that the value of the log-likelihood function has increased and the resulting
discrete distribution of  which is presented below shows that we have 4 different support points

Fig. 5: The graph of the discrete distribution of  for K= 82 shows 4 support points. Galaxy data

Indeed if we run the program for  the results are
>> Run_NPEM_Galaxy
number of support points = 4
count =
  972
theta =
   9.7420   21.1342   23.9844   32.9931
w =
   0.0858    0.7941    0.0832    0.0369
44
LogLikelihood =
-212.1969
max_value =
 1.1422e-006

We can see that the maximum value of the function  is very close to zero and the graph,
which is presented below, shows that the support points are among those for which .
Therefore by the theorem 3 we can say that the distribution we found is the optimal one.

Fig. 6: The graph of the D function for Galaxy Data with 3 support points. Galaxy data

We also present the graph of the resulting MLE of the true distribution of ,  
45

Fig. 7: The graph of the discrete distribution of  for K= 4 shows 4 support points. Galaxy data















46
Pharmacokinetics (PK) example with multiplicative measurement noise
We will apply NPEM1 algorithm to the simulated data which represents the series of
measurements of the drug concentration in the blood of each patient in the group. The drug
concentration  in the organism  hours after administration of the initial dose  can be
described by the following exponential decay function
                                                          (3.14)                                    
where   is the elimination rate constant and  is the volume of distribution.
Suppose that we have  subjects (patients) and they were administrated an initial dose  of
some drug. Then we take a series of  measurements of the drug concentration in the blood of
each subject. These measurements can be described by the following  PK model:
                                      (3.15)
where   is the drug concentration for th subject at a time ,  is the volume of distribution
and  is the elimination rate constant for th subject. The measurement error term  is
assumed to be standard normally distributed and  denotes possibly unknown scale factor.  
The unknown parameters are . We assume that they are
independent and identically distributed given their common but unknown distribution
function .  
Therefore using our earlier notation let  and the objective is to estimate the
distribution given the data .
While illustrating the work of algorithm on this data we will consider two different approaches
linear and nonlinear.  
47
Linear approach
Case 1. For simplicity we first assume  and ’s to be known, therefore in this case  is one
dimensional for each subject, i.e. . In order to “linearize” the model we take log on both
sides of the equation (3.15), then we get
                                 
(3.16)
Notice that for small enough values of ,  therefore we have

and since  and  are known constants we can write
where                         (3.17)
If we use our earlier notations we have that the conditional density of  given  
is
                                                                   (3.18)                  
Then if we denote  to be the vector of measurements of the th subject and by
the independence of s the conditional density of  given  is
                                                                     (3.19)
If we now consider  to be the number of support points and  to be the vector of parameters
and their corresponding weights as in (2.11) then the conditional density of  given  is  
                                           (3.20)
48
and the log-likelihood function is
                                                                             (3.21)

Then the updated values of according to (2.21) and (2.13) are



As for the Galaxy data we can find this maximum explicitly by solving the following equation
for  



thus
                                          (3.22)
and

                                                           (3.23)
49
where
                           (3.24)

Case 2. Let us now consider ’s to be unknown therefore in this case ’s are two-dimensional
which is denoted as  where . Then the “linearized” PK model
can be written as
      where and                     (3.25)
Therefore for some   conditional density of  given  is

                                  (3.26)
Then similarly if we denote  to be a vector of all measurements for th subject, the conditional
density of  given  is
                                                     (3.27)
                 
Assuming  is the number of support points the conditional density of  given  and log-
likelihood function are
                                                   (3.28)
50


Also the updated values of  according to (2.12) and (2.13) are



Then the updated values of  and can be found by the equation (2.20), if we write the
model (3.25) in the matrix notation:  


51



 




Therefore if we write it all out, for all  we have
(3.29)
and  
52
                                                                                        (3.30)
where
                   (3.31)

Nonlinear approach
Let us now consider a PK model as it was defined at the beginning  
                                                             (3.32)
Case 3. For simplicity, as before, we first assume  and  to be known and unknown parameter
to be one dimensional, i.e. for each , . Therefore for some  if
we denote  then the conditional density of  given  is
                                                                     (3.33)
Then by the independence if  the conditional density of  given  is

Letting  to be the number of support points, as earlier, the conditional density of   given  
and log- likelihood function are


53
Then the updated values of  according to (2.12) and (2.13) are
                                            (3.34)

where .
Case 4. Let us now consider more general case when ’s assumed to be unknown. Therefore,
keeping the same notations as in the linear approach, we let parameter  
to be two-dimensional where the first and the second components  and   represent the
unknown elimination rate  and the volume of distribution respectively. Then if we let
the formulas we used in case 1 stay unchanged. Thus if  is the number
of support points of MLE of the true distribution of , the updated values of  are
                 (3.35)

where   and .


Data simulation
As we mentioned before to illustrate the NPEM1 algorithm we use the simulated data. Let us
now describe the simulations.  The observations were simulated from the corresponding models
using the following parameters:
54

In case when is known we set  otherwise . The elimination rate constants s
are simulated from either single normal distribution or from the mixture of two normal
distributions that is:
     Model 1:  
Model 2:  , where  (we use )
We will call these two cases unimodal and bimodal respectively.  
The error term  is taken from standard normal distribution and the scale factor .

Algorithm illustration
Linear approach. One-dimensional case. (Case 1)
The program Run_NPEM_linear.m uses the data simulated by Model 1 and Model 2 and the
algorithm described above to find the MLE of the true distribution of . Here we will illustrate
the results we got together with the description of the way the program works. We will start with
the data for the linear case first.
1. Type Run_NPEM_linear in the command window and press return. The result is
>> Run_NPEM_linear
number of support points =
2. Input the number of support points, say K = N = 20 and press return. The output is
>> Run_NPEM_linear
number of support points = 20
dimension of theta =  
3. Input the dimension of theta, we start with one dimensional case
55
>> Run_NPEM_linear
number of support points = 20
dimension of theta = 1  
             for unimodal model input “1”, for bimodal model input “2”  
4. Input the type of the distribution from where the elimination rate constant is simulated,
we start with the single normal distribution, i.e. we type 1. The output is
>> Run_NPEM_linear
      number of support points =  20
      dimension of theta =  1
      for unimodal model input “1”, for bimodal model input “2”  1
      count =
  627
theta =
 Columns 1 through 7
   0.3588    0.3588    0.3588    0.3588    0.3588    0.3588    0.3588
 Columns 8 through 14
   0.3588    0.5035    0.5035    0.5035    0.5035    0.5362    0.5362
 Columns 15 through 20
   0.6213    0.6213    0.6213    0.6213    0.6213    0.6213
w =
 Columns 1 through 7
   0.0000    0.0000    0.0000    0.0001    0.0009    0.0158    0.0357
 Columns 8 through 14
56
   0.0478    0.0355    0.1349    0.1928    0.2606    0.1298    0.0893
 Columns 15 through 20
   0.0071    0.0295    0.0158    0.0034    0.0009    0.0000
LogLikelihood =
 -26.4905
D_max_val =
 -7.5292e-04
We can see that the maximum value of the D function is very close to zero and the graph, which
is represented below, ensures that the D function is nonpositive everywhere. Moreover, we also
represent the graph of the MLE of the true distribution of ,  where we can notice the 3
distinct point of support and they are among those points where D function is equal to zero.

 
Fig 8: The graph of the simulated data for Case 1 and Model 1.  
57

Fig 9: The graph of the D function for K = 20, Case 1 and Model 1  

Fig 10: The graph of the  for K = 20, Case 1 and Model 1 it shows 4 support points
58

Fig 11: True distribution of  Model 1  

Let us now run the program for K = 4. The way it works is absolutely the same; therefore we just
show the results.
>> Run_NPEM_linear
number of support points =  4
dimension of theta =  1
for unimodal model input “1”, for bimodal model input “2”  1
count =
       1327
theta =
   0.3588    0.5035    0.5362    0.6213
w =
   0.1003    0.6238    0.2191    0.0567
LogLikelihood =
59
 -26.4905
D_max_val =
 -7.5612e-04
Notice that the value of the likelihood function did not change and the maximum value of the D
function is still very close to zero. These results and the figures represented below allow us to
conclude that by the theorem 3 the distribution we found is the MLE of the true distribution of
.
 

Fig 12: The graph of the D function for K = 4, Case 1 and Model 1
60

Fig 13: The graph of the  for K = 4, Case 1 and Model 1 it shows 4 support points

Let us now run the program for 20 support points, one-dimensional  but the type of the
distribution, from where the elimination rate constant is simulated, is now taken to be a
mixture of two normal distributions. Since the way the program works in this case is also the
same as it was shown before we will only show the output of the program.  
      >> Run_NPEM_linear
      number of support points =  20
      dimension of theta =  1
      for unimodal model input “1”, for bimodal model input “2”  2
count =
  600
theta =
 Columns 1 through 8
61
   0.2168    0.2168    0.2168    0.2997    0.2997    0.2997    0.2999    0.3535
 Columns 9 through 16
   0.3535    0.3535    0.5332    0.5332    0.5332    0.5332    0.6705    0.6705
 Columns 17 through 20
   0.6705    0.6705    0.6705    0.6705
w =
 Columns 1 through 8
   0.0044    0.0302    0.0436    0.0414    0.0958    0.1566    0.1791    0.1515
 Columns 9 through 16
   0.0574    0.0395    0.0100    0.0244    0.0340    0.0319    0.0081    0.0275
 Columns 17 through 20
   0.0294    0.0233    0.0101    0.0018
LogLikelihood =
  -2.4748
D_max_val =
 -4.9737e-08
As we expected, since we used  the value of the maximum of the  function
is close to zero and the following graphs allow us to assume that the MLE of the true distribution
of  might have 5 distinct points of support.  
62

 
Fig 14: The graph of the simulated data for Case 1 and Model 2.  


Fig 15: The graph of the D function for K = 20, Case 1 and Model 2

63

Fig 16: The graph of the  for K = 20, Case 1 and Model 2 it shows 5 support points

Fig 17: True distribution of  Model 2

Let us test this assumption and run our program for .
      >> Run_NPEM_linear
      number of support points =  5
64
      dimension of theta =  1
      for unimodal model input “1”, for bimodal model input “2”  2
count =
  461
theta =
   0.2168    0.2998    0.3535    0.5332    0.6705
w =
   0.0783    0.4732    0.2480    0.1003    0.1002
LogLikelihood =
  -2.4748
D_max_val =
 -2.9370e-08
Indeed, we can see that the maximum value of the function  is very close to zero
and the graph, which is presented below, shows that the support points are among those
for which . Therefore by the theorem 3 we can say that the distribution we
found is the MLE of the true distribution of .
65

Fig 18: The graph of the D function for K = 5, Case 1 and Model 2


Fig 19: The graph of the  for K = 5, Case 1 and Model 2 it shows 5 support points
 




66
Linear approach
Two-dimensional case. (Case 2)
Let us now move to the two-dimensional . We will run the program Run_MPEM_linear.m
for the data simulated by the Model 1 where the elimination rate constant is simulated
from the single normal distribution.
>> Run_NPEM_linear
number of support points =  20
dimension of theta =  2
for unimodal model input “1”, for bimodal model input “2”  1
count =
   95
theta =
 Columns 1 through 8
   0.3962    0.3962    0.3962    0.3962    0.3962    0.3962    0.3962    0.3962
   3.0362    3.0362    3.0362    3.0362    3.0362    3.0362    3.0362    3.0362
 Columns 9 through 16
   0.3962    0.4915    0.4915    0.4915    0.5502    0.5502    0.5502    0.5502
   3.0362    2.8859    2.8859    2.8859    3.0786    3.0786    3.0786    3.0786
 Columns 17 through 20
   0.5502    0.5502    0.5502    0.5502
   3.0786    3.0786    3.0786    3.0786
w =
 Columns 1 through 8
67
   0.0000    0.0000    0.0000    0.0000    0.0000    0.0000    0.0034    0.0215
 Columns 9 through 16
   0.0400    0.0709    0.2125    0.1808    0.1106    0.2297    0.0963    0.0272
 Columns 17 through 20
   0.0070    0.0001    0.0000    0.0000
LogLikelihood =
   3.5667
D_max_val =
  4.6850e-04
The results show that  function is nonpositive everywhere and more over we can calculate
that there are approximately three distinct points of support of the distribution we found.

 
Fig 20: The graph of the simulated data for Case 2 and Model 1
68

Fig 21: The graph of the D function for K = 20, Case 2 and Model 1


We will test our guess by running the program for . The results are
>> Run_NPEM_linear
number of support points =  3
dimension of theta =  2
for unimodal model input “1”, for bimodal model input “2”  1
count =
  140
theta =
   0.3962    0.4915    0.5502
   3.0362    2.8859    3.0786
w =
   0.0649    0.4642    0.4709
69
LogLikelihood =
   3.5667
D_max_val =
  4.7606e-04
We can see that there are 3 distinct points of support as we guessed earlier. Moreover the
maximum value of the  function is close to zero and the graph of the  function, which
represented below, shows that the points we found are among the zeros of this function.
Therefore by the theorem 3 we can conclude that the distribution we found is the MLE of the
true distribution of .

Fig 22: The graph of the D function for K = 3, Case 2 and Model 1


In case when the elimination rate constant is simulated from the mixture of two normal
distributions by doing similar research we found that the MLE of the true distribution of in
70
this case has 8 points of support and the value of the likelihood function is -4.7029. The
following is the output of Run_NPEM_linear.m for , dimension of  is 2 and the
bimodal case.
>> Run_NPEM_linear
number of support points =  8
dimension of theta =  2
for unimodal model input “1”, for bimodal model input “2”  2
count =
   36
theta =
   0.2957    0.2957    0.2957    0.3466    0.4301    0.5825    0.6666    0.6666
   2.8686    2.8686    2.8686    3.0729    3.1708    2.9520    3.0955    3.0955
w =
   0.0002    0.0462    0.3818    0.3718    0.0493    0.0504    0.0986    0.0017
LogLikelihood =
  -4.7029
D_max_val =
 -3.1112e-04
71

Fig 23: The graph of the D function for K = 8, Case 2 and Model 2





Nonlinear approach
One-dimensional case. (Case 3)
The program Run_NPEM_nonlinear.m uses the algorithm described above to find the MLE of
the true distribution of . Here we will illustrate the results we got together with the description
of the way the program works.

We will start with the one-dimensional  and the unimodal case.
1. Type Run_NPEM_nonlinear in the command window and press return. The result is
>> Run_NPEM_nonlinear
72
number of support points =
2. Input the number of support points, say K = N = 20 and press return. The output is
>> Run_NPEM_nonlinear
number of support points = 20
dimension of theta =  
3. Input the dimension of theta, we start with one dimensional case
>> Run_NPEM_nonlinear
number of support points = 20
dimension of theta = 1  
             for unimodal model input “1”, for bimodal model input “2”  
4. Input the type of the distribution from where the elimination rate constant is simulated,
we start with the single normal distribution, i.e. we type 1. The output is
>> Run_NPEM_nonlinear
number of support points =  20
dimension of theta =  1
for unimodal model input “1”, for bimodal model input “2”  1
count =
   86
theta =
 Columns 1 through 8
   0.4377    0.4377    0.4377    0.4377    0.4377    0.4377    0.4377    0.4377
 Columns 9 through 16
   0.4377    0.4377    0.5013    0.5013    0.5675    0.5675    0.5676    0.5676
73
 Columns 17 through 20
   0.5676    0.5676    0.5676    0.5676
w =
 Columns 1 through 8
   0.0000    0.0000    0.0000    0.0000    0.0000    0.0009    0.0041    0.0117
 Columns 9 through 16
   0.0722    0.0758    0.1622    0.2454    0.1796    0.1651    0.0557    0.0214
 Columns 17 through 20
   0.0055    0.0002    0.0000    0.0000
LogLikelihood =
   8.8038
D_max_val =
  7.1013e-05


Fig 24: The graph of the simulated data for Case 3 and Model 1
74

Fig 25: The graph of the D function for K = 20, Case 3 and Model 1

Fig 26: The graph of the  for K = 20, Case 3 and Model 1 it shows 3 support points

The results indicate that the MLE of the true distribution of might have three distinct point
of support and the likelihood function is equal to 8.8038.  

Let us run the program for . The output is
>> Run_NPEM_nonlinear
75
number of support points =  3
dimension of theta =  1
for unimodal model input “1”, for bimodal model input “2”  1
count =
  100
theta =
   0.4377    0.5013    0.5675
w =
   0.1649    0.4076    0.4275
LogLikelihood =
   8.8038
D_max_val =
 -5.8039e-05


Fig 27: The graph of the D function for K = 3, Case 3 and Model 1
76


Fig 28: The graph of the  for K = 3, Case 3 and Model 1 it shows 3 support points

Indeed by the theorem 3 the distribution we found is the MLE of the true distribution of  in
this case.

We did similar research for the bimodal case and we found that the MLE of the true
distribution of  in this case has 6 points of support and the value of the likelihood function
is -78.5026. The following is the output of Run_NPEM_nonlinear.m for , dimension of
is 1 and the bimodal case.
>> Run_NPEM_nonlinear
number of support points =  6
dimension of theta =  1
for unimodal model input “1”, for bimodal model input “2”  2
count =
   41
77
theta =
   0.1663    0.3006    0.3644    0.5021    0.5882    0.6589
w =
   0.0516    0.5757    0.1754    0.0550    0.0562    0.0862
LogLikelihood =
 -78.5026
D_max_val =
  4.3684e-05

Fig 29: The graph of the D function for K = 6, Case 3 and Model 2
78

Fig 30: The graph of the  for K = 6, Case 3 and Model 2 it shows 6 support points

Fig 31: True distribution of  Model 2



Nonlinear approach  
Two-dimensional case. (Case 4)
Let us now represent the resulting distributions we found for the two-dimensional . We
first start with unimodal case. The distribution we got in this case has 3 points of support and
79
the likelihood is equal to 7.4824. The following is the output of Run_NPEM_nonlinear.m for
, dimension of  is 2 and the unimodal case. The code for this program is presented in
the appendix.
>> Run_NPEM_nonlinear
number of support points =  3
dimension of theta =  2
for unimodal model input "1", for bimodal model input "2"  1
count =
   61
theta =
   0.3923    0.5028    0.5497
  22.7622   19.1202   23.3316
w =
   0.1520    0.4140    0.4341
LogLikelihood =
   7.4824
D_max_val =
 -1.6324e-04
80

Fig 32: The graph of the D function for K = 3, Case 4 and Model 1

In case when the elimination rate constant  is simulated from the mixture of two
normal distributions the results we got show that the MLE of the true distribution of  
has 10 points of support and the likelihood function is equal to -45.3473. The output of
the Run_NPEM_nonlinear.m for , two-dimensional  and bimodal case is
represented below.

>> Run_NPEM_nonlinear
number of support points =  10
dimension of theta =  2
for unimodal model input "1", for bimodal model input "2"  2
count =
   95
theta =
 Columns 1 through 8
81
   0.2740    0.2740    0.2740    0.2986    0.3455    0.4384    0.5830    0.6658
  19.4619   19.4619   19.4619   19.0598   23.0426   26.1566   20.6542   23.7547
 Columns 9 through 10
   0.6658    0.6658
  23.7547   23.7547
w =
 Columns 1 through 8
   0.0001    0.0090    0.2032    0.2289    0.3642    0.0526    0.0430    0.0377
 Columns 9 through 10
   0.0608    0.0005
LogLikelihood = -45.3490
D_max_val = 0.0157

Fig 33: The graph of the simulated data for Case 4 and Model 2
82

Fig 34: The graph of the D function for K = 6, Case 4 and Model 2
           (a)
(b)
Fig. 35: a) Smoothed graph of  for K = 6, Case 4 and Model 2 it shows 6 support points
;
b)
Graph of the true distribution

83
Pharmacokinetics (PK) example with additive measurement noise
We will apply NPEM1 algorithm to the simulated data which represents the series of
measurements of the drug concentration in the blood of each patient in the group. The drug
concentration  in the organism  hours after administration of the initial dose  can be
described by the following exponential decay function
                                                          (3.36)                                    
where   is the elimination rate constant and  is the volume of distribution.
Suppose that we have  subjects (patients) and they were administrated an initial dose  of
some drug. Then we take a series of  measurements of the drug concentration in the blood of
each subject. These measurements can be described by the following PK model:
                                      (3.37)
where   is the drug concentration for th subject at a time ,  is the volume of distribution
and  is the elimination rate constant for th subject. The measurement error term  is
assumed to be standard normally distributed and  denotes possibly unknown scale factor.  
The unknown parameters are . We assume that they are
independent and identically distributed given their common but unknown distribution
function .  
Therefore using our earlier notation let  and the objective is to estimate the
distribution given the data .


84
NPEM 1 algorithm illustration
Let us now illustrate the main formulas we need to apply the NPEM1 algorithm for this model.  
Let parameter  to be two-dimensional where the first and the second
components  and   represent the unknown elimination rate  and the volume of
distribution respectively.  if we denote  then the
conditional density of  given  is  
                                  (3.38)
Then by the independence if  the conditional density of  given  is

Letting  to be the number of support points, as earlier, the conditional density of   given  
and log- likelihood function are


Thus if  is the number of support points of MLE of the true distribution of , the updated
values of  are
                 (3.39)

where   and .
85
Data Simulation  
To illustrate the NPEM1 algorithm we described above we used the simulated data.  The
observations were simulated according to (3.37) using the following parameters:

The volume of distribution is taken to be . The elimination rate constant  is
simulated from the mixture of two normal distributions that is
,  
where  (we used ). The measurement error term covariance matrix is taken to be
.
Results: We ran Run_NPEM_nonlinear.m for  support points and got the following
results.
Elapsed time is 1983 seconds.
count = 113
theta =
 Columns 1 through 9
   0.3357    0.1891    0.2925    0.2925    0.6184    0.6184    0.3357    0.2294    0.3357
  23.1324   20.2387   20.9565   20.9565   22.6233   22.6233   23.1324   29.1334    23.1324
 Columns 10 through 18
   0.3543    0.2737    0.3357    0.6184    0.3357    0.3357    0.2925    0.3357    0.3357
  15.2039   18.2366   23.1324   22.6233   23.1324   23.1324   20.9537   23.1324   23.1324
 Columns 19 through 20
   0.6184    0.6184
  22.6233   22.6233
86
w =
 Columns 1 through 9
   0.0101    0.1015    0.0001    0.0863    0.0387    0.0004    0.0001    0.0479    0.0007
 Columns 10 through 18
   0.0916    0.1246    0.0043    0.0095    0.0000    0.1003    0.2324    0.0000    0.0000
 Columns 19 through 20
   0.0836    0.0678
LogLikelihood = -44.4991
D_max_val = 0.6171

Fig 36: The graph of the D function. PK example with additive noise
87

(a)

(b)
Fig 37: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This figure
shows the graphs of (solid line) true distribution of the volume of distribution  (dashed line ‘--
’) smoothed MLE of the true distribution and (spikes) .



88
Conclusions:
The results we got show that the MLE of the true distribution of  has 7 points of support and
the likelihood function is equal to -44.4991. We can clearly see from the figures 36 (a) and (b)
that the estimates are very close to the true distributions.

EM1 +IP algorithm illustration:
Let us now illustrate the way EM1+IP algorithm works for this model.
Step 1 The update on the support points is done by the equation (3.39)
Step 2 The update on the weights is done by the Primal-Dual Interior Point method.
The data was simulated exactly the same way as in the NPEM1 algorithm. Moreover the
EM1+IP algorithm is a part of the Run_NPEM_nonlinear.m algorithm.  
Results: We ran Run_NPEM_nonlinear.m for  support points and got the following
results.
Elapsed time is 1295 seconds.
count = 70
theta =
 Columns 1 through 9
   0.3357    0.1891    0.2937    0.2936    0.6184    0.6184    0.3357    0.2294    0.2938
  23.1324   20.2387   21.0130   21.0106   22.6233   22.6233   23.1324   29.1335   21.0169
 Columns 10 through 18
   0.3543    0.2737    0.3357    0.6184    0.3357    0.3357    0.2925    0.3357    0.3357
  15.2038   18.2364   23.1324   22.6233   23.1324   23.1324   20.9544   23.1324   23.1324
 Columns 19 through 20
89
   0.6184    0.6184
  22.6233   22.6233
w =
 Columns 1 through 9
   0.0165    0.1015    0.0000    0.0000    0.0400    0.0400    0.0165    0.0479    0.0000
 Columns 10 through 18
   0.0916    0.1246    0.0165    0.0400    0.0165    0.0165    0.3188    0.0165    0.0165
 Columns 19 through 20
   0.0400    0.0400
LogLikelihood = -44.4991
D_max_val = 0.6176

Fig 38: The graph of the D function. EM1+IP

90

(a)

(b)
Fig 39: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . EM1+IP


91
Conclusions:
The results of the EM1+IP algorithm we got show that the MLE of the true distribution of  has
also 7 points of support and the likelihood function is equal to -44.4991. We can clearly see from
the figures 38 (a) and (b) that the estimates are very close to the true distributions and moreover
they are very similar to those we got for NPEM1 algorithm. There is one major difference in
performance between this two programs and it is the time. It takes almost twice less time to run
the EM1+IP algorithm that it is to run the NPEM1.

NPEM2 algorithm illustration:
Let us now illustrate the way NPEM2 algorithm works for this model.
Step 1 Choose a big set of support points.
Step 2 The update on the weights is done by the equation (3.39).
The data was simulated exactly the same way as in the NPEM1 algorithm. The NPEM2
algorithm is also a part of the Run_NPEM_nonlinear.m algorithm.  
Results: We ran Run_NPEM_nonlinear.m for a random set of support points of size 2129 and
got the following results.
Elapsed time is 1295.925732 seconds.
count = 358
theta =
 Columns 1 through 9
   0.2945    0.3402    0.2723    0.3303    0.2271    0.6104    0.3745    0.3358    0.3546
  20.9180   22.8809   18.2520   15.6299   29.4580   22.9248   15.2124   23.2104   15.2417
 Columns 10 through 15
92
   0.6218    0.2819    0.6241    0.6259    0.1923    0.2802
  22.3901   20.3394   22.5659   22.5220   20.2368   20.9106
w =
 Columns 1 through 9
   0.3056    0.0001    0.1153    0.0559    0.0467    0.0540    0.0413    0.1146    0.0001
 Columns 10 through 15
   0.0931    0.0102    0.0517    0.0013    0.1033    0.0069
LogLikelihood = -44.5343
D_max_val = 0.6695

Fig 40: The graph of the D function. NPEM 2
93

(a)

(b)
Fig 41: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . NPEM2
94
Conclusions: The algorithm NPEM2 works relatively fast and gives good results. The
optimization os the likelihood in this case very much depends on the set of points we choose at
the beginning. However the one main reason that makes this algorithm important to consider, is
that EM algorithm converges to the global maximum in this case according to the theorem 3.

EM2 +IP algorithm illustration:
Let us now illustrate the way EM2+IP algorithm works for this model.
Step 1 Choose a big set of support points.
Step 2 The update on the weights is done by the Primal-Dual Interior Point method.
The data was simulated exactly the same way as in the NPEM1 algorithm. The EM2+IP
algorithm is also a part of the Run_NPEM_nonlinear.m algorithm.  
Results: We ran Run_NPEM_nonlinear.m for a random set of support points of size 2129 and
got the following results.
Elapsed time is 37.027546 seconds.
count =  2
theta =
 Columns 1 through 9
   0.2945    0.2723    0.3303    0.2271    0.6104    0.3745    0.3358    0.6218    0.2819
  20.9180   18.2520   15.6299   29.4580   22.9248   15.2124   23.2104   22.3901   20.3394
 Columns 10 through 12
   0.6241    0.1923    0.2802
  22.5659   20.2368   20.9106
w =
95
 Columns 1 through 9
   0.3063    0.1152    0.0559    0.0467    0.0498    0.0414    0.1146    0.0934    0.0111
 Columns 10 through 12
   0.0568    0.1033    0.0055
LogLikelihood = -44.5343
D_max_val =  0.6696

Fig 42: The graph of the D function. EM2+IP

(a)
96

(b)
Fig 43: (a) This figure shows the graphs of (solid line) true distribution of the elimination rate  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . (b) This
figure shows the graphs of (solid line) true distribution of the volume of distribution  
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes) . EM2+IP

Conclusions:  
The algorithm EM2+IP works extremely fast, thousands of times faster then NPEM2. However
the answer that it gives depends a lot on the initial set of points that is chosen since it does not
change during the process of optimization. This algorithm is very similar to the USC-NPAG
algorithm, the difference is that USC-NPAG algorithm moves the grid while optimizing the
likelihood.  




97
Rolling thumbtacks example
Beckett and Diaconis [1994] generated binary strings from rolls of common thumbtacks.  For the
data set, 320 thumbtacks were flicked 9 times each.  Conditioned on the tack, it is assumed the
flicks are independent.  Let X denote the number of times each tack landed point up.  The data
are summarized in Table 3 below:
y 0 1 2 3 4 5 6 7 8 9
Count 0 3 13 18 48 47 67 54 51 19

Table 3. Rolling Thumbtacks example data
In this example  and
                                 (3.40)
where .
Assuming that  is the number of support points of and  we have
                                                 (3.41)
                                                               
Then for some   the updated values of  are

98
      (3.42)
To find this argmax we just need to solve the following equation for  


therefore  
                                                 (3.43)
and
                                                                            (3.44)
where
                                                          (3.45)

99
It is found in [Liu, 1996] that the maximum likelihood estimate of  has three support points at
with weights. Using our method we got the following results:
>> Run_NPEM_ThEx
number of support points K =10

count = 4202

theta = 0.4385    0.4385    0.6020    0.6020    0.8144    0.4385    0.5997    0.8144    0.4385    
0.4385
w = 0.0531    0.0032    0.1451    0.1449    0.1463    0.1282    0.1504    0.2229    0.0019    0.0040

LogLikelihood = -640.2757
max_value =  -4.0177e-04

Fig 44: The graph of for K=10. Rolling thumbtacks
100

Fig 45: The graph of the D function for K=10. Rolling thumbtacks

>> Run_NPEM_ThEx
number of support points K =3
count = 5194

theta = 0.6008    0.4381    0.8144

w = 0.4411    0.1892    0.3697

LogLikelihood = -640.2757
max_value = -8.7616e-04
101

Fig 46: The graph of for K=3. Rolling thumbtacks

Fig 47: The graph of the D function for K=3. Rolling thumbtacks



102
Eye Tracking Example
0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 1
 1
 1
 1
 
1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 2
 2
 2
 2
 2
 2
 2
 2
 2
 3
 3
 3
 3
 4
 4
 5
 5
 5
 6
 6
 6
 7
 7
 7
 8
 9
 9
 10
 10
 11
 11
 12
 12
 14
 
15
 15
 17
 17
 22
 24
 34
 

Table 4: Eye Tracking example, measurements
The model is given by Eqs. (3.46) with
                        (3.46)
The are 101 data points , but there are only 19 distinct
values of the set . The data are presented in the Table 4.
Assuming that  is the number of support points of and  we have
                                                 (3.47)                                                                
Then for some   the updated values of  are
      (3.48)
To find this argmax we just need to solve the following equation for  
103

therefore  
                                                 (3.49)
and
                                                                            (3.50)
where
                                                        (3.51)
We got the following results:
For K=6, eps = 10^(-9)
>> Run_NPEM_EtrackEx
number of support points K =6
count =   747
theta =   0.0000    1.0912    6.0335   12.0468   20.1079   31.7978
w =  0.3266    0.3826    0.1418    0.1094    0.0276    0.0119
LogLikelihood = -218.7495
104


Fig 48: The graph of for K=6. Eye Tracking example


Fig 49: The graph of the D function for K=6. Eye Tracking example
105
For K=101, eps =10^(-9)
>> Run_NPEM_EtrackEx
number of support points K =101
count = 608
theta =
Columns 1 through 12
0.0000    0.0000    0.0000    1.0912    1.0912    1.0912    1.0912    1.0912    1.0912    1.0912    
1.0912    1.0912
Columns 13 through 24
6.0334    6.0334    6.0334    6.0334    6.0334    6.0334    6.0334    6.0334    6.0334    6.0335    
6.0335    6.0335
Columns 25 through 36
6.0335    6.0341   12.0467   12.0468   12.0468   12.0468   12.0468   12.0468   12.0468   12.0468  
12.0468   12.0468
Columns 37 through 48
12.0468   12.0468   12.0468   12.0468   12.0468   12.0468   12.0468   12.0468   12.0468  
12.0468   12.0468   12.0468
Columns 49 through 60
12.0468   20.1079   20.1079   20.1079   20.1079   20.1079   20.1079   20.1079   20.1079  
20.1079   20.1079   20.1079
Columns 61 through 72
20.1079   20.1079   20.1079   20.1079   20.1079   20.1079   20.1079   20.1079   20.1079  
20.1079   20.1079   20.1079
106
Columns 73 through 84
31.7978   31.7978   31.7978   31.7978   31.7978   31.7978   31.7978   31.7978   31.7978  
31.7978   31.7978   31.7978
Columns 85 through 96
31.7978   31.7978   31.7978   31.7978   31.7978   31.7978   31.7978   31.7978   31.7978  
31.7978   31.7978   31.7978
Columns 97 through 101
31.7978   31.7978   31.7978   31.7978   31.7978
w =
Columns 1 through 12
0.1667    0.0917    0.0681    0.0754    0.0740    0.0591    0.0502    0.0436    0.0289    0.0219    
0.0182    0.0114
Columns 13 through 24
0.0115    0.0123    0.0129    0.0128    0.0126    0.0122    0.0118    0.0111    0.0101    0.0088    
0.0084    0.0070
Columns 25 through 36
0.0060    0.0042    0.0053    0.0064    0.0067    0.0067    0.0067    0.0066    0.0064    0.0063    
0.0059    0.0056
Columns 37 through 48
0.0055    0.0052    0.0049    0.0045    0.0041    0.0039    0.0037    0.0032    0.0030    0.0028    
0.0024    0.0020
Columns 49 through 60
107
0.0016    0.0012    0.0017    0.0017    0.0018    0.0017    0.0017    0.0017    0.0016    0.0015    
0.0014    0.0014
Columns 61 through 72
0.0013    0.0012    0.0011    0.0011    0.0009    0.0009    0.0008    0.0007    0.0007    0.0006    
0.0005    0.0005
Columns 73 through 84
0.0003    0.0003    0.0004    0.0005    0.0005    0.0005    0.0005    0.0005    0.0005    0.0005    
0.0005    0.0005
Columns 85 through 96
0.0005    0.0005    0.0005    0.0004    0.0004    0.0004    0.0004    0.0004    0.0004    0.0004    
0.0003    0.0003
Columns 97 through 101
0.0003    0.0003    0.0003    0.0003    0.0002
LogLikelihood = -218.7495

108

Fig 50: The graph of for K=101. Eye Tracking example

Fig 51: The graph of the D function for K=101. Eye Tracking example
The results show that the number of support points of  is 6 and they are listed in the K=6
results section.
109
Chapter 4. NPML Estimation for Stochastic Models

In the previous chapters we discussed estimation of an unknown distribution of the population
parameters. The models we looked at were deterministic, i.e. we only considered measurement
noise
,                                                        (4.1)            
where  are known measurements, which we consider to be independent but not
necessarily identically distributed.  is known vector function,  is a vector of unknown
parameters and  is a measurement noise, which is considered to be normally distributed with
zero mean and some possibly unknown covariance matrix. The vectors ,  are
considered to be i.i.d with common but unknown distribution function F. The question is to
estimate F based on the given measurements .  

However one of the drawbacks of this model is that is does not take into an account the
uncertainties in the model itself, i.e. process noise. For example, in case of the PK problem the
errors like
• dose errors
• sample timing errors
• model misspecifications
are not included in the deterministic models. The stochastic models on the other hand can
overcome this problem.

110
Discrete time Stochastic population model:  
Consider the ith subject in a sample population. The discrete time stochastic system that
corresponds to it can be written as follows:
                                     (4.2)
where at time  
- state vector
- known input
- vector of PK/PD parameters for the ith subject  
- known vector function that defines a structural model
is a process noise vector
The measurements at time are
                                    (4.3)
where  
is the measurement noise  
-  known vector function that defines a measurement model
Population Problem: Now let , be the vector of measurements for the ith subject
as above.  Then as before

are independent but not necessarily identically distributed
random variables , for . Assume also that the conditional density of given  
is known and denoted by , where the ’s are unobserved random parameters that are
111
independent and identically distributed with common but unknown distribution function . The
likelihood is

The population problem is as before: maximize  relative to whatever class of
distributions is being considered.


Historical Background:  In a number of papers, Mortensen et al considered the above maximum
likelihood problem for the case where  is assumed to be multivariable normal with unknown
mean vector and unknown covariance matrix.  They developed software programs for Matlab
and R [M07]. More recently, [DL11] under the same normal hypothesis, developed software
programs for MONOLIX.  We are going to study this problem using NPML methods.  
In 1991, Mark Welle wrote a MS Thesis, under the supervision of Alan Schumitzky, which
considered NPML for a simple one compartment model with process noise.  We have
generalized this work and made it compatible with the NPEM and IP programs.

It turns out that for linear problems, defined below, the likelihood
can be
calculated exactly using the linear Kalman-Bucy filter. For nonlinear problems, this cannot be
done and the more complicated nonlinear filtering must be used [J70].  We will therefore focus
on the linear case.


Linear case: In case when both  and  are linear functions of  and  
respectively the stochastic system is called linear. Otherwise it is called nonlinear. In this case
the system (4.2) and (4.3) can be written the following way:
112
                                         (4.5)

where  , and  are known matrices and given  the vectors and  are
mutually independent for all k.

The log-likelihood function  in this case can be written as follows:

                                                (4.6)
where  
               (4.7)
Then the log-likelihood function now has the following form:

                           (4.8)

Let .

113
Proposition 1: The density  is normal with mean m and covariance matrix .
Proof: We will drop the i subscript for cleaner notation.
 
Then for  we have  where ,
and .  , , and  are deterministic
vectors. Thus  is a sum of normal random variables and therefore is normally distributed
itself. Then by induction it can be shown that  is also normally distributed. The mean m and
covariance matrix  are calculated using Kalman-Bucy filer [J70].

                          (4.9)

where   and  are calculated as follows:
Linear Kalman-Bucy filter equations
Prediction step
                                                        (4.10)
                                       (4.11)
Updating step
                   (4.12)
                                  (4.13)
114
                                                (4.14)



Continuous state discrete observations linear stochastic population model:

                                       (4.15)
where
- state vector  
- known input
- vector of PK/PD parameters
, and  are known matrices
- is the white Gaussian process with the covariance matrix , i.e.
with  positive semi-definite matrix
- measurement vector  
is the measurement noise. We consider it to be Gaussian sequence with zero
mean and the covariance matrix .  

Let  on the interval . Then the first equation of (4.2) can be integrated over the
interval :
                                        (4.15)
115
where  is a fundamental matrix of the first equation of (4.2), that is  

Also  and . It is shown in Jazwinsky
[J70] that  is a zero mean Gaussian sequence with the covariance matrix  
                                        (4.16)


Example 1:  
PK one-compartment model is described by the linear stochastic differential equation for the
state vector  and discrete linear equation for the observations  at time ,  
                                                           ,                                   (4.17)
,                                              (4.18)
                                                                                                                       (4.19)
where
K – is an elimination rate
V – is the volume of distribution
D – is the initial dose of the drug
- is the white Gaussian process with the covariance matrix  
116
- is a measurement error that is considered to be normally distributed with zero mean and
covariance matrix  

Note: It the equation (4.14) we need to know the value of .

Solution: Before we can use the theory we described above we should integrate the stochastic
equation (4.17) over the intervals . We will use Ito formula with the function
. Then

If we integrate now from  to  we get

The bottom equation can be written as follows

                                         (4.20)
where
                                                      (4.21)
is a white Gaussian sequence with zero mean and the covariance  

117

Note: To integrate the last integral we need to know how W(t) depends on t.
If W(t) is a constant, say  then

                         (4.22)

Therefore now we have a discrete system  
 
                                                (4.23)
The log-likelihood function that is defined by (4.7) can now be written as follows



where by proposition 1

118
The linear Kalman-Bucy filter equations are written as:
 
Prediction step
                                                    (4.24)
                              (4.25)
Updating step
                            (4.26)
                                               (4.27)
                                                             (4.28)

Simulation  
To illustrate the algorithm we described above we used the simulated data.  The observations
were simulated according to (4.23) using the following parameters:

The volume of distribution is taken to be . The elimination rate constant  is
simulated from the mixture of two normal distributions that is
,  
where  (we used ). The initial condition on the signal is  where
and  measurement error term covariance matrix is taken to be .


119
Program Run_NPEM_nonlinear_kalman.m overview:

• NPEM 1 with process noise

The program Run_NPEM_nonlinear_kalman.m uses the simulated data described above to find
the maximum likelihood estimate  of the distribution of unknown parameters  
using Kalman filtering to get  and then it uses NPEM_nonlinear.m program from the
section: Nonlinear approach.  Two-dimensional case. (Case 4) to find . We also modified
the way to choose the initial values of the parameters  for the EM algorithm. We take
to be some random points on the interval (0.1, 0.7) and then the volume of distribution
.
We looked at some different situation when process noise in the simulation and in the program
was either included or exclude and looked at different values of it. The results are described in
the following table.

• Hybrid of the NPEM 1 and Interior Point algorithms (EM1+IP):

EM1+IP algorithm is a part of the Run_NPEM_nonlinear_kalman.m. The user just has to switch
between two optimization methods. This method is a combination of NPEM 1 and Interior Point
algorithms that can be described in the following way:
Step 1: Initialization of the algorithm. Initial values for the parameters  and weights
are set at this point. Usually it is done in the following way  is a set of at most N random
120
points taken form the interval where elimination rate is believed  to belong. For our example the
interval is (0.1, 0.7). Then the volume of distribution . All the constants are
defined in this step too.
Step 2: The parameters   are updated according to the EM algorithm, the same way as
it was done in the NPEM 1 algorithm.

Step 3: The weights are updated according to the Interior Point method. The way Interior Point
method works is described in section 2.

Step 4: The difference between log-likelihoods is checked. If the difference is small we found the
maximum likelihood estimate  of the true distribution, otherwise we repeat from step 2.
 


Process
noise in
the data
simulation

Process noise
assumed in
the program

Simulated data Loglikelihood
Function
(NPEM 1/
EM1+IP)

(NPEM 1/
EM1+IP)
0 0
            y =    3.0318    2.4652    1.4451    0.8318    0.5916
   3.8141    2.5238    1.9795    1.2461    1.5058
   4.1810    3.1880    2.4719    2.1553    1.1058
   3.6645    1.9045    2.2693    1.0860    1.3571
   2.6750    1.6603    0.4935    0.2337    0.6709
   3.4643    2.6637    1.8797    1.4125    0.7581
-44.4991 0.0739
121
 
   3.9512    2.4454    2.1788    1.6535    1.0127
   2.2899    1.1181    0.3411    0.9731    0.4810
   2.8839    1.6842    1.7039    1.8491    1.0013
   4.8104    3.2157    2.3594    2.0940    1.1465
   4.2350    3.3786    3.1700    2.1661    1.9050
   3.4235    2.7739    1.9768    1.4916    1.5822
   2.9950    2.1396    1.6545    1.3703    0.6233
   4.4025    2.7943    1.9540    2.2768    1.7389
   3.2249    2.5655    1.9551    1.3237    0.8217
   3.6158    3.3738    2.0733    1.7889    1.0597
   4.6233    3.2999    1.6755    1.4241    1.2517
   4.1627    3.0668    2.4418    2.4528    2.1717
   2.2047    1.3471    0.7285    0.2067    0.0544
   2.4372    0.9972    0.6792    0.0700    0.5235
-44.4991 0.0739
-46.9579 0.0346 0.1 0
             y = 2.9803    2.4120    1.4556    0.7496    0.5644
3.8929    2.4253    1.9122    1.0439    1.4493
4.1691    3.2325    2.5893    2.1454    1.0509
3.5969    1.9100    2.3438    1.1250    1.4337
2.6276    1.6523    0.3679    0.1338    0.5570
3.5057    2.6040    1.8024    1.4661    0.8684
3.9765    2.6862    2.3170    1.6160    0.7539
2.3401    1.1759    0.1561    1.0024    0.4985
2.9567    1.7214    1.8198    1.8021    1.0501
4.8857    3.1586    2.4195    2.2902    1.2349
4.1081    3.3691    3.0937    2.1834    1.9876
3.4272    2.6797    2.0425    1.4497    1.5872
3.0232    2.1404    1.6564    1.2716    0.5444
4.4719    2.7188    2.0330    2.2849    1.6383
3.3788    2.6207    2.0040    1.2972    0.8387
3.5074    3.1263    1.8006    1.6011    0.9607
4.4034    3.1084    1.4809    1.3076    1.2133
4.1283    3.0530    2.3440    2.3777    2.1803
2.3482    1.3492    0.6917    0.0717   -0.0270
2.4349    1.0532    0.8673    0.1959    0.6842
-46.9579 0.0346
122
-49.1069 0.0511 0.1 0.1
             y = 2.9803    2.4120    1.4556    0.7496    0.5644
3.8929    2.4253    1.9122    1.0439    1.4493
4.1691    3.2325    2.5893    2.1454    1.0509
3.5969    1.9100    2.3438    1.1250    1.4337
2.6276    1.6523    0.3679    0.1338    0.5570
3.5057    2.6040    1.8024    1.4661    0.8684
3.9765    2.6862    2.3170    1.6160    0.7539
2.3401    1.1759    0.1561    1.0024    0.4985
2.9567    1.7214    1.8198    1.8021    1.0501
4.8857    3.1586    2.4195    2.2902    1.2349
4.1081    3.3691    3.0937    2.1834    1.9876
3.4272    2.6797    2.0425    1.4497    1.5872
3.0232    2.1404    1.6564    1.2716    0.5444
4.4719    2.7188    2.0330    2.2849    1.6383
3.3788    2.6207    2.0040    1.2972    0.8387
3.5074    3.1263    1.8006    1.6011    0.9607
4.4034    3.1084    1.4809    1.3076    1.2133
4.1283    3.0530    2.3440    2.3777    2.1803
2.3482    1.3492    0.6917    0.0717   -0.0270
2.4349    1.0532    0.8673    0.1959    0.6842
-49.1069 0.0511
-275.4227 -2.8045 1 0
             y = 2.5166    1.9335    1.5503    0.0101    0.3196
4.6022    1.5391    1.3068   -0.7755    0.9402
4.0621    3.6332    3.6465    2.0563    0.5568
2.9889    1.9598    3.0141    1.4751    2.1225
2.2010    1.5806   -0.7630   -0.7650   -0.4680
3.8784    2.0659    1.1060    1.9480    1.8614
4.2043    4.8538    3.5603    1.2787   -1.5752
2.7922    1.6964   -1.5086    1.2665    0.6558
3.6124    2.0565    2.8629    1.3789    1.4892
5.5637    2.6447    2.9598    4.0559    2.0308
2.9656    3.2830    2.4071    2.3385    2.7310
3.4602    1.8317    2.6336    1.0724    1.6320
3.2771    2.1478    1.6741    0.3836   -0.1655
5.0964    2.0388    2.7440    2.3577    0.7330
4.7638    3.1178    2.4439    1.0595    0.9918
-274.8439 -2.8578
123
 
2.5313    0.8984   -0.6531   -0.0892    0.0700
2.4242    1.3850   -0.2713    0.2593    0.8681
3.8182    2.9295    1.4635    1.7018    2.2573
3.6394    1.3685    0.3606   -1.1433   -0.7595
2.4142    1.5574    2.5602    1.3291    2.1305
-141.6724 -0.4788 1 1
             y = 2.5166    1.9335    1.5503    0.0101    0.3196
4.6022    1.5391    1.3068   -0.7755    0.9402
4.0621    3.6332    3.6465    2.0563    0.5568
2.9889    1.9598    3.0141    1.4751    2.1225
2.2010    1.5806   -0.7630   -0.7650   -0.4680
3.8784    2.0659    1.1060    1.9480    1.8614
4.2043    4.8538    3.5603    1.2787   -1.5752
2.7922    1.6964   -1.5086    1.2665    0.6558
3.6124    2.0565    2.8629    1.3789    1.4892
5.5637    2.6447    2.9598    4.0559    2.0308
2.9656    3.2830    2.4071    2.3385    2.7310
3.4602    1.8317    2.6336    1.0724    1.6320
3.2771    2.1478    1.6741    0.3836   -0.1655
5.0964    2.0388    2.7440    2.3577    0.7330
4.7638    3.1178    2.4439    1.0595    0.9918
2.5313    0.8984   -0.6531   -0.0892    0.0700
2.4242    1.3850   -0.2713    0.2593    0.8681
3.8182    2.9295    1.4635    1.7018    2.2573
3.6394    1.3685    0.3606   -1.1433   -0.7595
2.4142    1.5574    2.5602    1.3291    2.1305
-141.6647 -0.5361

Table 5: The table shows the results of two NPML methods NPEM 1 and EM1+IP that were
calculated using different levels of process noise. It also compares the results when the process
noise was included in the simulation but was present or not in the model.  

124

(a) NPEM 1 with process noise

(a) EM1+IP with process noise

(b) NPEM 1 with process noise
125


(b) EM1+IP with process noise


(c) NPEM 1 with process noise

126

(c) EM1+IP with process noise


  (d) NPEM 1 with process noise

127

(d) EM1+IP with process noise


(e) NPEM 1 with process noise

128

(e) EM1+IP with process noise
Fig 1: This figure shows the graphs of (solid line) true distribution of the elimination rate
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes)  in case when  
(a) simulated and assumed noise are equal to 0
(b) simulated noise is equal to 0.1 and assumed noise is 0  
(c) simulated and assumed noise are equal to 0.1
(d) simulated noise is equal to 1 and assumed noise is 0
(e) simulated and assumed noise are equal to 1


(a) NPEM 1 with process noise
129

(a) EM1+IP with process noise



(b) NPEM 1 with process noise

130

(b) EM1+IP with process noise


(c) NPEM 1 with process noise

131

(c) EM1+IP with process noise


(d) NPEM 1 with process noise

132

(d) EM1+IP with process noise


(e) NPEM 1 with process noise
133

(e) EM1+IP with process noise
Fig 2: This figure shows the graphs of (solid line) true distribution of the volume of distribution
(dashed line ‘--’) smoothed MLE of the true distribution and (spikes)  in case when  
(a) simulated and assumed noise are equal to 0
(b) simulated noise is equal to 0.1 and assumed noise is 0  
(c) simulated and assumed noise are equal to 0.1
(d) simulated noise is equal to 1 and assumed noise is 0
(e) simulated and assumed noise are equal to 1
 
Conclusions:  
The above results can be classified into three situations:
1. The data was generated with no process noise and the model did not assume any noise
2. The data was generated with process noise  and
    a) the model assumed no noise
    b) the model assumed process noise    
134
3. The data was generated with process noise  and
    a) the model assumed no noise
    b) the model assumed process noise    
The random variables that represent the PK parameters  and  are assumed to be independent
therefore the covariance of  and . In the above simulations we see that if the data was
simulated with the process noise, then if we assume the process noise in the solution then the
covariance is closer to zero than if it was ignored.
We can also notice that the more noise we included the more disperse and chaotic the marginal
distributions of   and got, when we neglected the process noise in the solution. Therefore
the estimates that are produced when the process noise is included in the solution are definitely
better.
135
Chapter 5. Nonparametric Bayesian (NPB) Estimation  

Unlike in the NPML approach in the Nonparametric Bayesian (NPB) approach the distribution
is considered to be an unknown random distribution and therefore we need to place a prior
distribution on . A convenient prior in our case is the Dirichlet process prior. Let us now
define the Dirichlet Distribution and the Dirichlet Process.

Dirichlet distribution
Let be a finite space and let
and                                     (5.1)
Note that is also called the dimensional simplex.

Then a random vector is said to have a Dirichlet distribution with parameter ,
, which is denoted by , if the density function is
                          (5.2)
for .  

Notice that in case when  the density of the Dirichlet distribution is the same as Beta
density with parameters . In case when one of , the Dirichlet distribution is
136
defined by setting the corresponding  and interpreting the density as a density of the
lower-dimensional set.

Properties of the Dirichlet distribution
The properties below are important for motivating and in some case proving corresponding
properties of the more general Dirichlet Process.  Proofs are given in [GR03].  In the proposed
thesis, the derivations will be included for completeness.

1. Let  be independent Gamma random variables with parameters  where  
and , then ~                                         (5.3)

2. If  and  are integers such that  then
~                                                                   (5.4)
3. If  then
                                             
             (5.5)
4. Let where  and given , let  be - valued i.i.d .  Then
the posterior distribution of  given  is where  is the point (delta)
measure such that .


5. Let and let  be the st observation from . Then for some ,  
137
the predictive probability of  given  is
                                             (5.6)

Dirichlet Distribution through the Polya Urn Model
Let  be the -algebra of all possible subsets of  and let  be a non-null measure on
such that  for some measurable subset .

Consider a bag with  balls of which  are of color ,  (assuming  
for all ). Draw balls at random from the urn, replacing each ball by two balls of the
same color. Then let  if the th ball is of color . Then
                             
(5.7)  
Which is the same as property 5 of the Dirichlet distribution. Therefore Dirichlet distribution
follows the Polya Urn Model




138
Dirichlet Process  
Let be a subset of  and let  be a -algebra generated by the Borel subsets of . Let
be a set of probability distributions on

 

Definition: Let be a non-null finite measure on ,  and

so that .  Then the Dirichlet Process is a unique distribution over the space
with parameter , such that the relation  
                               (5.8)

holds for any  and any measurable partition  of , i.e. for all
,  for all  and .  The probability measure is
our prior guess of and the parameter  is the “strength” of the guess.  The existence of this
process is shown in Ferguson [F73].

Properties
The only properties of the Dirichlet process that are need in this thesis are as follows:  
(D1)    
 
(D3)

139

Note:  In the above properties,  is the point (delta) measure such that
.  In Property (D4), the l.h.s is a probability density;
while the r.h.s. is a probability distribution (or measure).  This “abuse of notation” is common
with papers in this research area.  Property (D4) should be written as:

By context it will always be clear, which is density and which is measure.
Properties (D1)-(D3) are proved in [F73].  

Proof (D4):
We will show the proof for the discrete space and the general result can be generalized from
there.
Let  then the predictive probability of  given  is


since given ,  is independent of . Then  


140
An important property of DP also follows from (D2) and (D4):
                                        (5.9)

Sketch of the proof of Eq. (5.9):
This can be shown as follows, let  then


Example 1:
Consider the simple model:
           
           
           
so that we observe directly the PK parameters .   Now rewrite eq. (D2) as follows:
           
which is also the
nonparametric maximum likelihood estimate .  
So (D2) can be written as:  
141
i.e. a mixture of the prior estimate and the NPML estimate.

It follows that as we have , i.e. the expected posterior of
tends to the maximum likelihood estimate of  (data eventually overcomes the prior).

Example 2: Let be a deterministic discrete distribution such that

Then the random distribution:
                                                (5.10)
belongs to .  
Note: is a deterministic distribution while is a random distribution because the are r.v.
The proof of this example follows from the properties of the Dirichlet distribution.  
For some applications  can be very large, for example when  is used as non-informative
prior.

The property (D4) and Eq. (5.9) are very important in the calculation of the Dirichlet Processes.
It gives the way for sampling from the posterior distribution of  given  as follows:
                                         (5.11)
w.p. means “with probability”

142
This last representation implies the “replication” property of the . It can be shown that the
number of distinct  in  is order of magnitude log(n), see [A73]. This ultimately
limits the computational complexity of the problem.

Equivalence Theorem: Not only does the Dirichlet process imply (D4) and so satisfies a Polya
Urn Model, but if any random sequence { } satisfies the Polya Urn Model, then with
for some random distribution , see [BM73].

Chinese Restaurant Process: Another interpretation of (D4) is given by the Chinese Restaurant
Process [A85, P96], describing a Chinese restaurant with an infinite number of infinitely large
round tables. The first customer will be seated at an empty table with probability 1. The next
customer has two choices: to join the first customer or to pick another empty table. When the (n
+ 1)st customer arrives, chooses the first unoccupied table with probability  or an
occupied ith table with probability , where  is the number of people sitting at that table
and  is a scalar parameter of the process.

For convenience, in the rest of this section, and throughout Sections 5, 6 and 7, we then assume
“strength” parameter .



143
Bayesian Version of Statistical Problem: Eqs. (2.1)-(2.2) now become:
                                                        (5.12a)
 
                                                                                       
(5.12b)
                                                                                  (5.12c)

The main property of the Dirichlet process that is needed for PK dosage regimen design is:

          (5.13)  

Consider the representation (5.13). Suppose we could sample from the posterior probability
. Call these samples , .  Then it follows that the
posterior expectation of  is approximately:
       
           
(5.14)
This is the main formula needed for PK control applications
For this thesis, to obtain samples from  we will use Gibbs Sampler (GS) discussed in
the next section.  




144
Chapter 6. Theoretical Foundations of the Gibbs Sampler

Again in this section, we take the “strength” parameter .
Gibbs Sampler [CS92]):
To obtain  sample from the joint distribution  proceed as follows:
If th sample is  then
                                     
(6.1)  
The Gibbs sampler generates a dependent Markov chain. It is shown in [CS92] that under
reasonable assumptions, the limiting distribution of the Gibbs samples is .

Gibbs Sampler (GS) for the model (5.12a)-(5.12c)
GS requires sampling from the full conditional distributions:

By Bayes’ law: . Now is calculated from Eq. (5.12a)
and  is calculated from property (D4) by rearranging the order of .  

Gibbs Sampler
Let be any point in .
145
For sample , , from  
                         (6.2)
where now  
 
(In other words, during a GS cycle, the most current updates are used)

Eq. (6.2) can be written as follows
                                    (6.3)
where  
                                                                                   (6.4)

Note that is the posterior density of  and assumed prior .
Sampling from Eq. (6.3) is performed by:

                                             (6.5)
Note: GS generates a dependent Markov chain whose limit is
. Implementing GS becomes an art. One usually throws away the first segment
of chain (“Burn In”) and then “Thins “ out rest of chain.  

146
Convergence tests must then be used.  One of the simplest and most effective tests for
convergence is the method of Brooks and Gelman [BG28]. In this method at least two chains are
run independently and statistical tests are made to see if the chains converge to the same
quantities.  Actually visual evidence is usually sufficient.


147
Chapter 7. Computer Simulations for Gibbs Sampler

Linear model  
                                                              (7.1)
                                                                                           (7.2)

                                                                                 (7.3)

                                                         (7.4)

We will need the following formula for the product of two normal densities.
Let be the normal density with mean vector and covariance matrix , evaluated at the
point :
 
where  

Matrix Identity [LS72]
                                                                         (7.5)



148
For scalar case:
                          (7.6)

1. Galaxy Data
To illustrate the Gibbs sampler (and NPML) and the replication property of DP, we describe
results for the “Galaxy Data” problem.  This problem is a benchmark for practically all density
estimation programs, see [EW95].  The data consists of 82 galaxy velocities ranging from 9 units
to 35 units.  
The histogram of the data is given below.


Frequency Histogram of Galaxy Data (velocities on x – axis)

149
The model for the Galaxy data is a scalar linear model.  In terms of our linear model, Eqs. (7.1)-
(7.4), we have:


                                              (7.7)

Back to Eq. (6.4)
        (7.8)
To summarize
                                                                              (7.9)
 

150
Results:  
Below is the smoothed graph of  that was obtained using Gibbs sampler that
generated samples for the following values of the parameters:  = 20.8315
(sample mean of the data),  = 4.5681 (sample standard deviation of the data). To produce this
graph we used the following argument.
Write . Then from Eq. (4.14)  
We could approximate  as normal distribution with mean and very small std .  
(We used = 0.1).  Write  .  Then  
                                                        (7.10)
would be a sum of normals.  Then using Matlab we graph the density of as a function of
.  

Fig. 1: The graph of

Galaxy Data
For comparison we also display the  graph  
151

Fig. 2: The graph of (6 support points). Galaxy Data


 
2. Linearized PK model
Let us consider the PK model of Eq. (3.15).  As before
         
                         
(7.11)
Where  and ’s are considered to be known and  is an unknown parameter. For
simplicity, we consider the “linearized” version of the model, which can be derived as follows.
Take log on both sides of the Eq. (7.11), then

Notice that for small enough values of ,  therefore we have

and since  and  are known constants we can write
where                            (7.12)
If we use the notations as in (6.12a)-(6.12c) for , then
152
(7.13)
                                                                                                                        (7.14)
,                                                                                              (7.15)
where , , and  are known
Then using Eq.(6.4) and (7.5) we have

                                                                                                       (7.16)
Results:  
Below is the histogram of the Gibbs sample that was obtained using the above algorithm that
generated samples for the following values of the parameters:  standard
error term, , , = 0.3
(sample mean of the simulated values for the elimination rate constant ), =
0.0190 (sample variance of  the simulated values of the elimination rate constant ). In a real
data problem (not simulated) we can find

and by first running NPML and then using sample
mean and covariance of .  
153

 
Fig 3: The graph of the simulated data for the above values of parameters. PK example

Fig. 4: Frequency Histogram of Gibbs sample for 20 subjects (elimination rate constants on x –
axis). PK example

For comparison we also display the smoothed graph of  and the graph of the true
distribution of  (we know it exactly since we use simulated the data for this example in
particular  and ) The value of the smoothing
standard deviation is 0.01.
154

Fig. 5: The graph of   for  (5 support points). PK example

Fig. 6: The graph of true distribution of . PK example
If we compare Fig. 4 with the true distribution of  in Fig. 6 we can definitely see that NBP
estimate shows two bumps. They are even more obvious if instead of 20 subjects we use 100
subjects. We present the histogram of the Gibbs sample for 100 subjects as before.

Fig. 7: Frequency Histogram of Gibbs sample for 100 subjects (elimination rate constants on x-
axis). PK example
155
Chapter 8. Stick-Breaking Approach in NPB Estimation

Consider the following statistical model:
,                                                        (8.1)
 

                                                                                        (8.2)
where    

 is a vector of independent but not necessarily identically distributed measurements with
known   density  
 is a vector of  unknown parameters defined on a space in finite dimensional
Euclidean space ( )

is an unknown probability distribution of  
 
In general  will mean that or  has distribution .


Assume that given  are independent and identically distributed random vectors with
common (but unknown) probability distribution on .

Our main problem is to estimate  based on the data


We will use the Nonparametric Bayesian (NPB) approach to estimate . In the NPB approach,
the distribution  is itself a random variable. The first requirement is to place a prior
distribution on .  A convenient way of doing this is with the so-called Dirichlet process (DP)
156
prior, see [F73, WW97, WW98].  A Dirichlet process prior on a random distribution  is
defined by a probability distribution  and a scale parameter , where  is our prior guess
of and the parameter  will be the strength of the guess. We write

for the
Dirichlet process.

Bayesian Version of Statistical Problem: Eqs. (8.1)-(8.2) now become:
                                                        (8.3a)
 
                                                                                       
(8.3b)
                                                                                (8.3c)
 
Now the population analysis problem is to estimate the full conditional distribution of  given
the data .

Most methods to solve this problem employ a marginal approach:  it turns out that  can be
“integrated out” leaving a much simpler problem for the .  The resulting model for the

is
given by the Polya Urn representation:    

The marginal approach leads to a Gibbs sampler algorithm, described in section 5, for estimating
i.e., the expected value of given the data – not . This approach is used
by practically all researchers, and is sufficient for the problems they consider. However, it does
not solve the classical population analysis problem as stated above, i.e. to estimate the full
157
conditional distribution of
.
To solve this problem we propose the Full Conditional Method.  
In this method we estimate we the full conditional distribution of given the data , not just
its expected value.

The Full Conditional method begins with the definition of the Dirichlet Process called the Stick-
Breaking representation, see [S94] and Ishwaran and James [IJ01]. Consider the random
distribution:
                                                 (8.4)  
where the random vectors  are independent and identically distributed (i.i.d) from the
known distribution  and the weights are defined from the so-called stick-breaking
process as follows:  
   
where  

is the Beta distribution with parameters . The name “Stick Breaking”
comes from the fact that the are random cuts on a stick of length 1 and the  are the lengths
of the pieces broken off. This gives an informal proof that the sum to 1. It is shown in [S94]
that the random distribution  if and only if can be written in the form of Eq.(8.4).
Intuitively “Stick Breaking” construction works as follows. Suppose  is represented as in  
Eq.(8.4). Suppose there is a stick with length 1. Let , and regard
them as fractions for how much we take away from the remainder of the stick every time. Then
can be calculated by the length we take away each time

158

 
Fig 1: Illustration of Stick Breaking (taken from Zoubin Ghahramani’s Tutorial in 2005)
 
By this construction, we will have  almost surely. In this way, we get an explicit
construction of  

                                                            (8.5)



Below, we show how to use this Stick Breaking representation to estimate , not just
. The corresponding estimate of  leads to the estimates of all moments and
their corresponding credibility intervals.  

159
Truncated Stick-Breaking: Ishwaran and James  [IJ01] consider replacing the infinite sum in
Eq. (8.4) by a truncated version:  
                                                        (8.6)
where it is now required that so the series of weights sums to one.  They show that the
truncated version is virtually indistinguishable from for sufficiently large .  The only
problem now is how large does

have to be. Ishwaran and James [IJ01] have suggested that

is sufficient.  We will show that this number can be dramatically reduced.
 
Full Conditional Approach:  Let us assume now that we have a sufficient number of
components in Eq. (8.6) to accurately approximate the infinite sum in Eq. (8.5).  Then using a
truncated version of the Stick Breaking definition of the Dirichlet process we can rewrite our
model in the following way

                                     
,
                                     
                                                (8.7)
                                     



Eq. (8.7) defines a mixture model with an unknown but finite number of components.  Much is
known about this subject, see [T06], [TBS08], and [FS10].  For a fixed number of components
160
, the posterior distribution of the weights  and the support points can be determined
by the Blocked Gibbs Sampling [IJ01, p. Sec 5.2].
Blocked Gibbs Sampling:
The model in (8.7) can be written in the following way:


                                                                 (8.8)


Since we have that , then we let where with probability .
Note: The set contains the values that are not distinct. For example, let  and
= (1,2,4,3,2,3,4). This set of indices  corresponds to the set
of values of ‘s. Then let  be the set of distinct values of
, in case of the above example = (1,2,3,4). Note also that the set of distinct
values of is not the same as .

The blocked Gibbs sampler allows sampling from the full posterior distribution of
( ) by iteratively sampling from the conditional distributions of

161


Then each sample of ( ) will define a random probability measure

Which eventually gives the sample from the posterior distribution of .
Note: The posterior distribution of is not the same as the  however if  
goes to infinity  converges to  which is a sample from , see [IJ01].
Blocked Gibbs Sampling requires sampling from the full conditional distributions just like the
regular Gibbs Sampling. Let us obtain them.  
The easiest way to calculate these densities is to consider joint density  
Use Bayes’ law repeatedly:

Now if .

Where  is a generalized Dirichlet distribution that can be defined as
follows.  
162
Definition: The generalized Dirichlet distribution (GD) is a generalization of the Dirichlet
distribution with a more general covariance structure and almost twice the number of parameters.
We say that the vector has a generalized Dirichlet distribution (GD) with parameters
and if the density function of  can be written as:

where and  is a gamma function.
By independence assumptions


Now let  be any of the variables  and let be the remaining
variables, i.e. .  Then  
.  Now we can write all the formulas we need.

1. Sample ,  the following way:
                                  (8.10)
163
Therefore  is either sampled from  if  and or otherwise from
, .
2. Sample ,  from:
                                                                     
Therefore  
                                                            (8.11)
3. Sample ,  from:

which is a generalized Dirichlet distribution  

where .
The way you sample from this distribution is  
                                                     (8.12)
164
Consequently, for fixed , we will get  samples of the support points
and the weights  by following these
steps:
1. Initiate:  
Let  and generate  
2. Draw the ’s sample as follows:
Step 1: Pick ,  distinct values of .
            For  
           ,
 
           Step 2:  
 
 
 
 
 
   
 

 
 
 
 
 
 
 
 
 
 
 
 
 Step 3:  
 
                       
Note: As opposed to Gibbs Sampler for the Marginal Method, here the Gibbs Sampler directly
involves the distribution .
Sampling the posterior of
,
and moments of .  
165
Let be  samples of from Gibbs Sampler after the sampler
has “converged”.  Then we have:  
Samples from   :  

Samples from the moments of :  Let  be the j
th
moment of .  
Then samples of  are given by:  
In particular, samples from the first moment or mean are given by

A histogram of the values of gives the estimated distribution of . Bayesian
credibility intervals are derived from this.
Samples from the expected value

. For this quantity we calculate:

Note: To check our algorithm, we can compare our estimate of with the estimate
from the marginalization method.  
Galaxy Data
To illustrate the Gibbs sampler (and NPML) and the replication property of DP, we describe
results for the “Galaxy Data” problem.  This problem is a benchmark for practically all density
estimation programs see [EW95].  The data consists of 82 galaxy velocities ranging from 9 units
to 35 units.  
The histogram of the data is given below.

166

Frequency Histogram of Galaxy Data (velocities on x – axis)

The model for the Galaxy data is a scalar linear model.  The equations (8.8) can now be written
as:


                                                                 (8.13)


Blocked GS fro the Galaxy data. Fix some K and repeat the following steps L times, we also did
a “burn in” of B first samples.
1.
 Initiate: Let  and generate  
2. Draw the ’s sample as follows:
Step 1: Pick ,  distinct values of .
167
where = sample mean of the data,  = sample variance of data
for  
           
where , ,
 and  
           Step 2:  
 
 
 
 
 
   
 

 
 
 
 
 
 
 
 
 
 
 
 
 Step 3:  
 
                       
Results: Below is the smoothed graph of  where  that was obtained using
Blocked Gibbs sampler that generated

samples burned  for the following
168
values of the parameters:  = 20.8315 (sample mean of the data),  
(sample variance of data). To produce this graph we used the following argument.
Write .  
We could approximate  as normal distribution with mean and very small std .  
(We used = 0.1).  Write  .  Then  
                                                                 (8.14)
would be a sum of normals.  Then using Matlab (see program in the Appendix) we graph the
density of as a function of .
The following table shows the comparison of three different methods for estimating the unknown
probability distribution .
Methods
Quantity of interest NPML Marginal GS Blocked GS
20.8315 20.8437 20.7874 CI [20.7774   20.7975]
4.4536 4.3983 4.2160 CI [4.2057    4.2262]
 
Table 6: This table shows the comparison of three different methods for estimating the unknown
probability distribution  for the Galaxy Data example.

To obtain the expected value of  in case of NPML estimate we used the following formula
169
, in case of marginal GS  and in case of blocked GS
estimate we calculated the expected value of the mean distribution
which is .

To obtain the standard deviation of  we used the following formulas. In case of NPML
estimate , for the marginal GS we used
where  and
.
To calculate the standard deviation of  in case of blocked GS estimate we used
where  and  

.
Notice that only in case of blocked GS we are able to calculate confidence intervals for both
quantities of interest  and . We did it the following way. First we ran the blocked GS
algorithm for J = 10,000 times for each of those runs we used 10,000 iterations and considered
only the last sample of ’s and ‘s which forms the distribution .
For each of those distributions we calculated the expected value of , call it  and the
170
standard deviation of  and call it . Therefore now we have two samples of J
expectations and J standard deviations. If we assume that these samples are normally distributed,
then the confidence intervals of the true means can be calculated as follows. The confidence
interval of the true mean of  is  where  ,
and  is a 97.5
th
quartile of the t-distribution with J-1
degrees of freedom. The confidence interval of the true mean of sample of standard deviations
is calculated the same way as for  just replacing  with .  


171
Chapter 9. Maximum Likelihood of Stick-Breaking (Connection between
NPML and NPB)

Let F
K
(t)
= w
k
(t)
δ
(t)
φ
k
k=1
K
∑ be the distribution at the t th iteration of the blocked Gibbs Sampler (GS)
algorithm.
1) The likelihood of F
K
(t)
is
L(F
K
(t)
)= p(Y
1
,...,Y
N
|F
K
(t)
)= p(Y
i
|F
K
(t)
)
i=1
N
∏
= p(Y
i
|θ)dF
K
(t)
(θ)
∫
i=1
N
∏
= w
k
(t)
p(Y
i
|φ
k
)
k=1
K
∑
i=1
N
∏

where the last equality is true since F
K
(t)
is a finite sum of weights and delta functions.
Then the log-likelihood can be written as follows
l(F
K
(t)
)= log(L(F
K
(t)
))= log( [ w
k
(t)
p(Y
i
|φ
k
)
k=1
K
∑
i=1
N
∏
])
= log[ w
k
(t)
p(Y
i
|φ
k
)
k=1
K
∑
]
i=1
N
∑
                                (9.1)
2) If we let l
max
(F
K
(t)
)= max
t
(l(F
K
(t)
)) then we can compare l
max
(F
K
(t)
) to the value of the log-
likelihood function of F
ML
the maximum likelihood estimate of the unknown distribution
F of θ that we found using nonparametric maximum likelihood (NPML) approach.  
Since F
ML
maximizes the likelihood function, the following inequality is always true
l
max
(F
K
(t)
)≤l(F
ML
)                                            (9.2)
Note: The equality is possible with a big enough number of iterations of the blocked GS.
172
We decided to check if this inequality is satisfied for the results of the full nonparametric
Bayesian (Full NPB) approach. First we present the results for the Galaxy data example.
Galaxy data
Model:  
             ,
             ,
               

We ran the blocked GS for 10000 times for the following values of the parameters ,
sample mean and  sample variance. The maximum value of the
log-likelihood that we got is  and the value of the likelihood function of
that we found before is . We can see that the inequality (9.2) is true in
case of Galaxy data example and more over the values are very close.
In the two figures below we present the graphs of  and .

Fig 1: The graph of , there are 6 support points. Galaxy data  
173

Fig 2: The graph of , there are 20 support points in total but only 7 of them have significant
weight. Galaxy data

The distributions are very similar. There are two reasons why  has 7 support points and
has 6 support points. First, we need to have more blocked GS iterations. Second, the
solution of the NPML problem is not unique, but the value of the log likelihood function
is. Therefore even with the big enough number of blocked GS iterations the
distributions can still look somewhat different.



174
Bibliography
[A85]  Aldous D.J. Exchangeability and related topics. Volume 1117 of In Ecole d’ Ete de
Probabilites de Saint-Flour XII (Edited by P. L. Hennequin) Springer Lecture Notes in
Mathematics. 1985.
[B06]  Baek, Y. (2006). An Interior Point Approach to Constrained Nonparametric Mixture
Models. PhD Dissertation, Mathematics Department, University of Washington.
[B12] Bell, Bradley (2012) http://moby.ihme.washington.edu/bradbell/non_par/non_par.xml
[BG98] Brooks S and Gelman A. General methods for monitoring convergence of iterative
simulations. Journal of Computational and Graphical Statistics 7: 434–455, 1998.
[BM73]  Blackwell D and MacQueen JB.  Ferguson Distributions Via Polya Urn
Schemes.  The Annals of Statistics, Vol. 12, No. 1 (1984), pp. 351-357
[BV04] Boyd, Stephen; Vandenberghe, Lieven (2004). Convex Optimization. Cambridge:
Cambridge University Press.
[CG92]  Casella G and George EI. Explaining the Gibbs Sampler. The American
Statistician, 46 (1992), pp. 167-174.
[DK84]   Katz, D., and D.Z. D'Argenio. Discrete approximation of multivariate densities with
application to Bayesian estimation. Computational Stat. Data Anal. 2:27-36, 1984.
[DL11]  Delattre, M.; Lavielle, M. PAGE 20 (2011) Abstr 2054 [www.page-
meeting.org/?abstract=2054]
[E94]  Escobar MD. Estimating Normal Means with a Dirichlet Process Prior. J. of the American
Statistical Association. Vol. 89: 268-277, 1994
[EW95]  Escobar MD, West M. Bayesian Density Estimation and Inference Using Mixtures.
Journal of the American Statistical Association 1995 Jun; 90(430):577-588.
175
[F73]  Ferguson T. A Bayesian Analysis of Some Nonparametric Problems. Ann.
Statist. Vol. 1, (1973), 209-230.
[G89]  Geweke J. Bayesian Inference in Econometric Models Using Monte Carlo Integration
Econometrica. 57: 1317-1339, 1989.
[GR03]  Ghosh JK and Ramamoorthi RV. "Bayesian nonparametrics."  Springer-
Velag, New York, 2003.
[HO05]  Hjort LD and Ongaro A. Exact Inference for Random Dirichlet Means Statistical
Inference for Stochastic Processes. 8:227–254, 2005.
[IJ01]  Ishwaran  J and James LF.  Gibbs Sampling Methods for Stick-Breaking Priors J. of the
American Statistical Association. 96: 161-173, 2001.
[IT02]  Ishwaran H, Takahara G. Independent and Identically Distributed Monte Carlo
Algorithms for Semi-parametric Linear Mixed Models, Journal of the American Statistical
Association, Vol. 97, No. 460 Dec., 2002, pp. 1154- 1166
[J70] Jazwinski, Andrew H. (1970), Stochastic processes and filtering theory, Academic Press,
New York.
[KLW94]  Kong A, Liu JS, Wong WH. Sequential Imputations and Bayesian Missing Data
Problems. Journal of the American Statistical Association 1994 Mar; 89(425): 278-288.
[KT51] Kuhn, H. W.; Tucker, A. W. (1951). Nonlinear programming. Proceedings of 2nd
Berkeley Symposium. Berkeley: University of California Press. pp. 481–492.
[L01]  Leary, R., Jelliffe, R., Schumitzky, A., & Guilder, M. V. (2001). An adaptive grid non-
parametric approach to pharmacokinetic and dynamic (PK/PD) population models. Computer-
Based Medical Systems, IEEE Symposium.  
[L83]  Lindsay BG. The Geometry of Mixture Likelihoods: A General Theory. Annals of
176
Statistics 1983; 11:86-94.
[Liu 96]  Liu JS. Nonparametric Hierarchical Bayes Via Sequential Imputations. The Annals of
Statistics 1996 Jun; 24(3):911-930.
[LJSG01]  Leary R, Jelliffe R, Schumitzky A, Van Guilder M. An adaptive grid non-parametric
approach to population pharmacokinetic/dynamic (PK/PD) population models. Proceedings, 14
th

IEEE symposium on Computer Based Medical Systems 2001, Jul, 389-394.
[Lo84]  Lo AY. On a class of Bayesian nonparametric estimates: I. Density Estimates. Annals of
Statistics 1984; 12: 351-357.
[M07]  S. Mortensen, S. Klim, B. Dammann, N. Kristensen, H. Madsen, R. Overgaard "A
Matlab framework for estimation of NLME models using stochastic differential equations",
Journal of Pharmacokinetics and Pharmacodynamics  vol:34, pages: 623-642, 2007.
[M86]  Mallet A. A maximum likelihood estimation method for random coefficient regression
models. Biometrika 1986; 73:645-656.
[NDYSJ 12] Neely, M, van Guilder, M, Yamada, W, Schumitzky, A & Jelliffe, R (2012), Pmetrics: a non-
parametric and parametric pharmacometric package for R. Theraputic drug monitoring. 34(4), 467–476.
[P96]  Pitman J. Statistics, Probability and Game Theory. (Chapter: Some developments of the
Blackwell-Mac Queen urn scheme, volume 30 of IMS Lecture Notes-Monograph series.)  1996
(pp. 245-267)
[PD04]  Park K and D’Argenio DZ.  Control of Uncertain Pharmacodynamic Systems. In:
Advanced Methods of Pharmacokinetic and Pharmacodynamic Systems Analysis – III. Edited by
DZ D’Argenio.  New York, Plenum Press, 2004; 275-293.  
[Q98]  Fernando A. Quintana, Nonparametric Bayesian Analysis for Assessing Homogeneity in
k x I Contingency Tables with Fixed Right Margin Totals, Journal of the American Statistical
177
Association, Vol. 93, No. 443 Sep., 1998, pp. 1140- 1149
[S91]  Schumitzky A. Nonparametric EM Algorithms for Estimating Prior Distributions. Applied
Math and Computation 1991; 45:141-157.
[T13] Tatiana Tatarinova, Michael Neely, Jay Bartroff, Michael van Guilder, Walter Yamada,
David Bayard, Roger Jelliffe, Robert Leary, Alyona Chubatiuk, Alan Schumitzky, (2013) Two
general methods for population pharmacokinetic modeling: non-parametric adaptive grid and
non-parametric Bayesian. Journal of Pharmacokinetics and Pharmacodynamics, Volume 40,
Issue 2, pp 189-199
[W91]  Welle M, (1991).   Implementation of a Nonparametric Algorithm for Estimating Prior
Distributions.  MS Thesis, Department of Mathematics. University of Southern California.
[W96]  Wakefield JC. Bayesian individualization via sampling based methods. Journal of
Pharmacokinetics and Biopharmaceutics, 24: 103-131, 1996.
[WSRG94]  Wakefield J, Smith A, Racine-Poon A, Gelfand AE. Bayesian Analysis of Linear
and Nonlinear Population Models. Applied Statistics. 43: 201-222, 1994.
[WW97]  Wakefield J, Walker S. Bayesian nonparametric population models: formulation and
comparison with likelihood approaches. J Pharmacokinet. Biopharm. 1997 Apr; 25(2):235-253.
[WW98]  Wakefield J, Walker S. Population models with a nonparametric random coefficient
distribution. Sankhya Series B 1998;60:196-212.





178
Appendix

Appendix 1. NPML PK Code –Matlab

% Matlab program for NPML general non-linear PK problem
 
%Main program:
K = input('number of support points =  ');
dim = input ('dimension of theta =  ');
type = input('for unimodal model input "1", for bimodal model
input "2"  ');
format compact
 
[y, t, sigma, theta_0, w_0, eps] = initial_data(K, type, dim);
y
% theta_0    
% w_0
 
[count, theta, w, LogLikelihood, PYL] = NPEM_nonlinear(y, t,
w_0, theta_0, eps, sigma);
 
count
theta
w
LogLikelihood
 
if dim == 1
   Dfun = @(theta_parameter)D(theta_parameter, y, t, sigma,
PYL);
   NDfun = @(theta_parameter)(-1)*Dfun(theta_parameter);
   x = fminbnd(NDfun,0.2,0.8);
   D_max_val = Dfun(x)
179
   fplot(Dfun,[0,1])
   grid on
   plot_as_pulse(theta,w,2,1,'b')
end
 
if dim == 2
   Dfun = @(theta_parameter)D(theta_parameter, y, t, sigma,
PYL);
    K = (0.1:0.01:1);
    V = (10:0.1:50);
       for i = 1:length(K)
            for j = 1:length(V)
                Z(i,j) = Dfun([K(i),V(j)]);
            end
       end
  D_max_val = max(max(Z))
   surf(K, V, Z')
end
function [y, t, sigma, theta_0, w_0, eps] = initial_data(K,
type, dim)
 
rand('seed', 0)
theta_0 = zeros(1,K);
 
[t,y] = simulation(type, dim);
sigma = 0.3;
 
a = 0.1;
b = 0.8;
c = 14;
d = 26;
for l = 1:K
180
   if dim == 1
       theta_0(l) = a + (l-rand)*((b - a)/K);
   end
   if dim == 2
       theta_0(1,l) = a + (l-rand)*((b - a)/K);
       theta_0(2,l) = c + (l-rand)*((d - c)/K);
   end
end
 
w_0 =(1/K) * ones(1,K);
eps =10^(-8);
function [t,y] = simulation(type, dim)
 
N = 20;
J = 5;
t = zeros(N,J);
k = zeros(1,N);
sigma = 0.3;
rand('seed', 0)
randn('seed', 0)
 
for i = 1:N
    t(i,:) = [1.5 2 3 4 5.5];
end
% Model (1) where K is generated as a single normal distribution
if type == 1  
  mu = 0.5;
  sigma_k = 0.06;
  k = randn(1,N) * sigma_k + mu;
end
% Model (2) where K is generated as a mixture of two normal
distributions
181
if type == 2  
  mu = [0.3 0.6];
  sigma_k = [0.06 0.06];
  p = 0.8;      
   for i = 1:N
       w = rand;  
       if w <= p
           k(i) = randn * sigma_k(1) + mu(1);
       else
           k(i) = randn * sigma_k(2) + mu(2);
       end
    end
end
% main loop, simulation of y  
if dim == 1
  for i = 1:N
       for j = 1:J    
            y(i,j) =5*exp((-1)*k(i)* t(i,j))*(1 + randn *
sigma);  
       end
  end
end
 
if dim == 2
   for i = 1:N
       V(i) = randn * 2 + 20;
       for j = 1:J
           y(i,j) = (100/V(i)) *exp((-1)*k(i)* t(i,j))*(1 +
randn * sigma);  
       end
   end
end
182
function [count, theta, w, LogLikelihood, PYL] =
NPEM_nonlinear(y, t, w_0, theta_0, eps, sigma)
 
%---matrix: y(NxJ), t(NxJ), w_0(1xK), theta_0(dim x K)
%---scalar: eps,sigma,linar
 
[N,J] = size(y);
[dim,K] = size(theta_0);
old_theta = theta_0;
old_w = w_0;
new_theta = zeros(dim,K);
count = 0;
%  initial likelihood calculation--------------------------
 
PYL  =  p_y_l(y, t, old_theta, old_w, sigma);           %% PYL =
p(Yi l lamda) is a vector (1xN)
 
new_likelihood = L(PYL);
likelihood = new_likelihood - 2 * eps;
 
%  main loop-----------------------------------------------
 
while (abs(new_likelihood - likelihood) >= eps)&&(count <= 1000)
   
   PTYL = p_t_y_l(y, t, old_theta, old_w, PYL, sigma);  %% PTYL
= p(theta(l) l Yi, lamda) is a matrix (NxK)
   
   % theta_prime calculation------------------------------  
   if dim == 1  
      for l = 1:K  
          fun = @(theta)Q(l, y, t, theta, sigma, PTYL);
          new_theta(l) = fminbnd(fun,0.1,0.8);        
183
      end
   end
   
   if dim == 2
          for l = 1:K    
              fun = @(theta)Q(l, y, t, theta, sigma, PTYL);
              new_theta(:,l) = fminsearch(fun,[0.2,14]);          
          end
   end
 % w_prime calculation------------------------------------
 
   new_w = (1/N) * sum(PTYL,1);                                
   
% making new parameters old before it goes to the next loop
 
   old_theta = new_theta;
   old_w = new_w;
   
   PYL = p_y_l(y, t, old_theta, old_w, sigma);
   
   likelihood = new_likelihood;
   new_likelihood = L(PYL);
   count = count+1;
end
 
LogLikelihood = new_likelihood;    
theta = old_theta;
w = old_w;
function p = p_y_l(y, t, theta_0, w_0, sigma)
 
N = length(y(:,1));
K = length(theta_0(1,:));
184
f_lamda = zeros(N,K);
 
for i = 1:N
   for l = 1:K
   f_lamda(i,l) = w_0(l) * prob(y(i,:), t(i,:), theta_0(:,l),
sigma);  
   end
end
p_y_lamda = sum(f_lamda,2);
 
p = p_y_lamda;
function p = prob(y, t, theta, sigma)
 
J = length(y);
dim = length(theta);
z = zeros(1,J);
 
for j = 1:J
   if dim == 1
      m =5*exp((-1)*theta*t(j));
      z(j) = normpdf(y(j),m,m*sigma) ;
   end
   if dim == 2
      m = (100/theta(2))*exp((-1)*theta(1)*t(j));
      z(j) = normpdf(y(j),m,m*sigma) ;
   end
end
p = prod(z);
function p = p_t_y_l(y, t, theta_0, w_0, p_y_lamda, sigma)
 
N = length(y(:,1));
K = length(theta_0(1,:));
185
F = zeros(N,K);
 
 
for i = 1:N
   if p_y_lamda(i)==0
            F(i,:) = (1/K)*ones(1,K);
   else
         for l = 1:K
              F(i,l) = w_0(l) * prob(y(i,:), t(i,:),
theta_0(:,l), sigma) / p_y_lamda(i);
         end
    end
end
p = F;


function f = Q(l, y, t, theta, sigma, p_theta_y_lamda)
 
N = length(y(:,1));
f_y = zeros(1,N);
 
for i = 1:N    
       z = prob(y(i,:), t(i,:), theta, sigma);
       if z == 0                                       %% it is
done to avoid log of zero
           f = Inf;
       else
           f_y(i) = p_theta_y_lamda(i,l) * log(prob(y(i,:),
t(i,:), theta, sigma));
       end    
end
 
186
f = -sum(f_y);
function L = L(p_y_l)
 
l_p = log(p_y_l);
 
L = sum(l_p);
function D = D(theta_parameter, y, t, sigma, PYL)
 
N = length(y);
D_components = zeros(1,N);
 
for i=1:N
   
   D_components(i) = prob(y(i,:), t(i,:), theta_parameter,
sigma)/PYL(i);
   
end
D = sum(D_components) - N;
 
end

To plot the 3D spikes graph we used the following code

clear

theta = [0.2779 19.3815; 0.2779 19.3815; 0.2779 19.3815; 0.2779
19.3815; 0.2779 19.3815; 0.2779 19.3815; 0.3051   19.0041;    
0.3456   23.0530;    0.3456   23.0530;    0.3456   23.0530;    
0.4384   26.1566;    0.4384   26.1566;    0.5830   20.6542;    
0.5830   20.6542;    0.6658   23.7547;    0.6658   23.7547;  
0.6658   23.7547;    0.6658   23.7547;    0.6658   23.7547;    
0.6658   23.7547];
187
w = [0.0000,    0.0001,    0.0008,    0.0105,    0.1002,    
0.1841,    0.1470,    0.0960,    0.1412,    0.1256,    0.0230,    
0.0295,    0.0165,    0.0265,    0.0075,    0.0403,    0.0449,    
0.0060,    0.0004,    0.0000];
 
r=[18:0.1:27];
t=[0.2:0.01:0.7];
 
Z=0;
 
F =@(q)MLF(q,theta',w,0.003);
 
 
for i = 1:length(t)
            for j = 1:length(r)
                Z(i,j) = F([t(i),r(j)]);
            end
end
 
surf(t,r,Z');
function F = MLF(t,theta,w,sigma)
 
[M,N] = size(theta);
 
F = 0;
 
if M==1
   for i = 1:N
     F = F + (w(i)* normpdf(t, theta(i),
sigma))/normpdf(theta(i), theta(i), sigma);
   end
end
188
 
if M==2
   for i = 1:N
     F = F + (w(i)* normpdf(t(1), theta(1,i),
sigma)*normpdf(t(2), theta(2,i),
(sigma+0.09)))*(2*pi*sigma*(sigma+0.09));
   end    
end

To plot the true distribution of  

trueF =
@(theta)((0.8*normpdf(theta(1),0.3,0.06)+0.2*normpdf(theta(1),0.
6,0.06))*normpdf(theta(2),20,2));
K = (0.2:0.02:0.7);
    V = (18:0.5:27);
     
    Z=zeros(length(K),length(V));
     
       for i = 1:length(K)
            for j = 1:length(V)
                Z(i,j) = trueF([K(i),V(j)]);
            end
       end
       
       surf(K, V, Z')


Appendix 2. Marginal NPB Galaxy Code –Matlab

% The Matlab program for NPB Galaxy Data

189
% Data
y = [ 9.1720 9.3500 9.4830 9.5580 9.7750 10.2270 10.4060 16.0840 16.1700 18.4190 18.5520
18.6000 18.9270 19.0520 19.0700 19.3300 19.3430 19.3490 19.4400 19.4730 19.5290 19.5410
19.5470 19.6630 19.8460 19.8560 19.8630 19.9140 19.9180 19.9730 19.9890 20.1660 20.1750
20.1790 20.1960 20.2150 20.2210 20.4150 20.6290 20.7950 20.8210 20.8460 20.8750 20.9860
21.1370 21.4920 21.7010 21.8140 21.9210 21.9600 22.1850 22.2090 22.2420 22.2490 22.3140
22.3740 22.4950 22.7460 22.7470 22.8880 22.9140 23.2060 23.2410 23.2630 23.4840 23.5380
23.5420 23.6660 23.7060 23.7110 24.1290 24.2850 24.2890 24.3660 24.7170 24.9900 25.6330
26.9600 26.9950 32.0650 32.7890 34.2790];
 
 
M = 10000;
burn = 5000;
sigma1 = 1;
gtheta = gsample(M,burn,y,sigma1);
 
mu = mean(y);
sigma0 = std(y);
sigma = 0.1;
 
F = @(theta)ExpF(theta,gtheta,mu,sigma0,sigma);
 
theta = 5:0.1:35;
 
K = length(theta);
 
for k = 1:K
   k
   X(k) = F(theta(k));
end
 
190
plot(theta,X)


function w = weights(i,y,theta,sigma1)
 
n= length(y);
mu = mean(y);
sigma0 = std(y);
 
unw = zeros(1,n);
 
for j=1:n
   if j==i
       q0 = normpdf(mu, y(i), sqrt(sigma0^(2)+sigma1^(2)));
       unw(j) = q0;
   end
   unw(j) = normpdf(y(i),theta(j),sigma1);
   
end
b = sum(unw);
w = (1/b)* unw;

______________________________________________________________________________
function theta = rulet(i,x,w,mu,sigma)
 
% rand('twister', 0)
%randn('state', 0)
n = length(w);
u = rand;
y = 0;
s = 0;
191
 
for j = 1:n
   if j == i
       if (s < u)&&(u <= (s+w(i)))
           y = sigma*randn+mu;
       end
       s = s+w(i);
 
   else
       if (s < u)&&(u <= (s+w(j)))
       y = x(j);
       end
       s = s+w(j);
   end
end
theta = y;

_____________________________________________________________________________
function gtheta = gsample(M,burn,y,sigma1)
 
n = length(y);
G = zeros((M-burn),n);
mu = mean(y);
sigma0 = std(y);
sigmastar =  sqrt(sigma1^(2) * sigma0^(2))/(sigma1^(2) + sigma0^(2));
 
k = -burn;
 
G(1,:) = y;
 
x = y;
192
 
 
for m = 1:M
   
   for i = 1:n
       
       w = weights(i,y,x,sigma1);
       
       mustar = (mu*sigma1^(2) + y(i)*sigma0^(2))/(sigma1^(2) + sigma0^(2));
       
       x(i) = rulet(i,x,w,mustar,sigmastar);
   
   end
   
   k = k+1;
   if k > 0
       G(k,:) = x;
   end
   
end
gtheta = G;




function EF = ExpF(theta,GS,mu,sigma0,sigma)
 
[M,N] = size(GS);
 
EF = (1/(N+1))* normpdf(theta,mu,sigma0);
 
193
for m = 1:M
   for i = 1:N
       EF = EF + (1/M)*(1/(N+1))* normpdf(theta, GS(m,i), sigma);
   end
end

Appendix 3. Full NPB Galaxy code – Matlab

y = [ 9.1720 9.3500 9.4830 9.5580 9.7750 10.2270 10.4060 16.0840 16.1700
18.4190 18.5520 18.6000 18.9270 19.0520 19.0700 19.3300 19.3430 19.3490
19.4400 19.4730 19.5290 19.5410 19.5470 19.6630 19.8460 19.8560 19.8630
19.9140 19.9180 19.9730 19.9890 20.1660 20.1750 20.1790 20.1960 20.2150
20.2210 20.4150 20.6290 20.7950 20.8210 20.8460 20.8750 20.9860 21.1370
21.4920 21.7010 21.8140 21.9210 21.9600 22.1850 22.2090 22.2420 22.2490
22.3140 22.3740 22.4950 22.7460 22.7470 22.8880 22.9140 23.2060 23.2410
23.2630 23.4840 23.5380 23.5420 23.6660 23.7060 23.7110 24.1290 24.2850
24.2890 24.3660 24.7170 24.9900 25.6330 26.9600 26.9950 32.0650 32.7890
34.2790];
 
L = 10000;
 
burn = 1000;
 
K = 20;
 
alfa_0 = 1;
 
mu_G0 = mean(y);
 
sigma_G0 = std(y);
 
sigma_prob = 1;
 
[w,fi,S] = BlockedGS(y,alfa_0,K,mu_G0,sigma_G0,sigma_prob,L,burn);
I =length(w(:,1));
mu = zeros(1,I);
sigma = zeros(1,I);
for i = 1:I
mu(i)  = w(i,:)*(fi(i,:))';
dif = fi(i,:)-mu(i);
wsqdif = zeros(1,K);
for j = 1:K
   wsqdif(j) = w(i,j)*dif(j)^2;
end
sigma(i) = sqrt(sum(wsqdif));
end
avrmu = mean(mu);
stdmu = std(mu);
avrsigma = mean(sigma);
stdsigma = std(sigma);
194
Std = sqrt((1/I)*(sigma*sigma'+mu*mu')-avrmu^2);
 
ConfIntmu = [avrmu - 1.96*stdmu/sqrt(I) avrmu + 1.96*stdmu/sqrt(I)];
 
ConfIntsigma = [avrsigma - 1.96*stdsigma/sqrt(I) avrsigma +
1.96*stdsigma/sqrt(I)];
 
N = length(S(:,1));
C = zeros(1,N);
 
for i = 1:N
   C(i) = length(unique(S(i,:)));  
end
 
avrC = mean(C);
 
figure(3);
hist(C)

function [w,fi,S] = BlockedGS(y,alfa_0,K,mu_G0,sigma_G0,sigma_prob,L,burn)
 
w_0 = (1/K)*ones(1,K);
N =length(y);
 
fi = zeros((L-burn),K);
S = zeros((L-burn),N);
w = zeros((L-burn),K);
 
k = -burn;
 
for i = 1:N
   S_new(i) = gets(w_0);
end
 
fi_new = zeros(1,K);
w_new = w_0;
 
for l = 1:L
 
   S_old = S_new;
   w_old = w_new;
   
   fi_new = zeros(1,K);
   S_new = zeros(1,N);
   w_new = zeros(1,K);
   
   fi_new = getfi_normal(y, mu_G0,sigma_G0, sigma_prob, S_old, K);
   S_new = step2(y,fi_new,w_old,sigma_prob);
   w_new = getw(alfa_0, S_new,K);
   
   k = k+1;
   if k > 0
       fi(k,:) = fi_new;
       S(k,:) = S_new;
195
       w(k,:) = w_new;
   end
   
end
 
fi_h = fi(1,:);
S_h = S(1,:);
w_h = w(1,:);
 
if k>1
   for k = 2:K
       fi_h = [fi_h,fi(k,:)];
       S_h = [S_h,S(k,:)];
       w_h = [w_h,w(k,:)];
   end
end
 
 
figure(1)
clf
hist(S_h)
 
sigma = 0.2;
 
figure(2)
clf
F = @(t)(1/(L-burn))*MLF(t,fi_h,w_h,sigma);
fplot(F,[min(fi_h)-3*sigma, max(fi_h)+3*sigma])
 
 
end
%----------------------------------------------------------------------------
---
function F = MLF(t,theta,w,sigma)
 
N = length(theta);
 
F = 0;
 
 
   for i = 1:N
     F = F + (w(i)* normpdf(t, theta(i), sigma))/normpdf(theta(i), theta(i),
sigma);
   end
 
 
 
end


function S = gets(w)
 
K = length(w);
196
 
u = rand;
y = 0;
s = 0;
 
for k = 1:K
   
       if (s < u)&&(u <= (s+w(k)))
           y = k;
       end
       s = s+w(k);
 
end
S = y;


function fi = getfi_normal(y, mu_G0,sigma_G0, sigma_prob, S, K)
 
N = length(S);
fi = zeros(1,K);
 
for k = 1:K
   ind{k} = {};
end
 
 
for i = 1:N        
    ind{S(i)}{end+1} = i;
end
 
 
for k =1:K
   mu = 0;
   M = 0;
   if isempty(ind{k})
           
       fi(k) = randn*sigma_G0 + mu_G0;
       
   else
       M = length(ind{k});
       for i = ind{k}
           mu = mu+y(i{1});
       end
       fi(k) = randn*sqrt(sigma_G0*sigma_prob/(M*sigma_G0+sigma_prob))+
(mu*sigma_G0+mu_G0*sigma_prob)/(M*sigma_G0+sigma_prob);
   
   end
end
 

function S = step2(y,fi_new,w_old,sigma_prob)
 
N = length(y);
K = length(fi_new);
197
S = zeros(1,N);
w1 = zeros(1,K);
 
for i = 1:N
   for k = 1:K
       w1(k) = w_old(k)*prob(y(i),fi_new(k),sigma_prob);
   end
   nor = sum(w1);
   w = (1/nor)*w1;
   S(i) = gets(w);
end

function w = getw(alfa_0, S_new,K)
 
M = zeros(1,K);
V = zeros(1,K);
N = length(S_new);
 
 
for k = 1:K
   ind{k} = {};
end
 
for i = 1:N        
    ind{S_new(i)}{end+1} = i;
end
 
for k = 1:K
  M(k) = length(ind{k});
end
 
A = 0;
B = 0;
 
for k = 1:K-1
   A = 1+M(k);
   B = alfa_0;
   for r = k+1:K
       B = B+M(r);
   end    
   V(k) = betarnd(A,B);
end
 
V(K) = 1;
 
w(1)= V(1);
 
for k = 2:K
   w(k) = w(k-1)*(1-V(k-1))*V(k)/V(k-1);
end

function p = prob(y,fi,sigma)                         %% f(Y_i l fi_S_i )
known conditional density of Y given fi
p =  normpdf(y,fi,sigma);  
end 
Asset Metadata
Creator Kryshchenko (Chubatiuk), Alona (author) 
Core Title Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches 
Contributor Electronically uploaded by the author (provenance) 
School College of Letters, Arts and Sciences 
Degree Doctor of Philosophy 
Degree Program Applied Mathematics 
Publication Date 08/01/2013 
Defense Date 05/21/2013 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag applied statistics,nonparametric Bayesian estimation,nonparametric maximum likelihood estimation,nonparametric population analysis,OAI-PMH Harvest,population pharmacokinetics theory,statistics 
Format application/pdf (imt) 
Language English
Advisor Schumitzky, Alan (committee chair), Bartroff, Jay (committee member), D'Argenio, David Z. (committee member) 
Creator Email achubatyuk@gmail.com,chubatiu@usc.edu 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c3-309797 
Unique identifier UC11293433 
Identifier etd-Kryshchenk-1897.pdf (filename),usctheses-c3-309797 (legacy record id) 
Legacy Identifier etd-Kryshchenk-1897.pdf 
Dmrecord 309797 
Document Type Dissertation 
Format application/pdf (imt) 
Rights Kryshchenko (Chubatiuk), Alona 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law.  Electronic access is being provided by the USC Libraries in agreement with the a... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Abstract (if available)
Abstract Suppose we observe a random sample Y₁,...,YN, of  independent but not necessarily identically distributed random variables Yᵢ,∈ ℝᵈ, for i = 1,...,N.  Assume also that the conditional density of Yᵢ given θᵢ is known and denoted by pᵢ(Yᵢ|θᵢ), where the θᵢ’s are unobserved random parameters that are independent and identically distributed with common but unknown distribution function F. ❧ The objective is to estimate F given the data Y₁,...,YN.  We used two different approaches to get the estimate of F: Nonparametric Maximum Likelihood (NPML) and Nonparametric Bayesian (NPB). ❧ For the nonparametric maximum likelihood approach, convex analysis shows that the maximum likelihood (ML) estimator Fᴹᴸ of F is a discrete distribution with no more than N support points where N is the number of subjects in the population.  The position and weights of the support points are unknown.  We developed four different methods of estimation: NPEM 1, NPEM 2, EM1+IP and EM2+IP.  These methods are based on the Expectation–Maximization (EM) and Primal-Dual Interior Point (IP) algorithms.  The NPEM 1 algorithm uses the EM algorithm to maximize the log-likelihood function over the set of support points and weights.  NPEM 2 fixes the large grid of point at first and them uses EM algorithm to maximize the log-likelihood only over the set of weights.  The NPEM 2 algorithm is much faster than the NPEM 1 algorithm, but to find the optimal solution a very large grid of points has to be used.  Therefore in case of multidimensional parameters, NPEM 2 can be impractical.  The EM1+IP algorithm is a hybrid of the NPEM 1 and IP algorithms in the way that support points are updated the same way as in NPEM 1, but the weights are updated using Primal-Dual Interior Point (IP) algorithms.  This method is much faster then NPEM 1 and comparable in speed with NPEM 2, however it does not need a large grid of points.  EM1+IP requires no more then N support points to start the estimation process and therefore can be used in cases of multidimensional parameters.  The EM2+IP algorithm is a hybrid of NPEM 2 and IP.  It starts with a large grid of support points and then updates the weights using the IP algorithm.  This method gives results instantly if compared to the speeds of the rest of the methods.  However the optimality of the results, as in case of the NPEM 2 algorithm, very much depends on the initial grid of points. ❧ Another highlight of this thesis is an ability of building nonparametric maximum likelihood estimates of distributions of the parameters of linear stochastic models, i.e. population models with process noise.  The advantages of the models with process noise are that, in addition to the measurements errors, the uncertainties in the model itself are taken into the account, i.e. process noise.  For example, in case of the PK problem the errors like dose errors, sample timing errors, model misspecifications. are not included in the deterministic models.  The stochastic models on the other hand can overcome this problem. ❧ We used linear Kalman-Bucy filtering [J70] to find the likelihoods of these models and then employed all the above methods to find the maximum likelihood estimates of an unknown distribution F. ❧ In the NPB approach, the distribution F is itself a random variable. The first requirement is to place a prior distribution on F.  A convenient way of doing this is with the so-called Dirichlet process (DP) prior, see [F73, WW97, WW98].  A Dirichlet process prior on a random distribution F is defined by a probability distribution G₀ and a scale parameter α₀, where G₀ is our prior guess of F and the parameter α₀ will be the strength of the guess.  We write DP(α₀G₀) for the Dirichlet process. ❧ The main reason we consider the NPB approach in the PK Population Problem is because of its ability to calculate Bayesian credibility intervals, no matter what the sample size is. This is not possible with NPML in general, however it is possible to get bootstrapped confidence intervals for NPML which are somewhat similar to Bayesian credibility intervals.  In both cases we get approximations of accuracy of the estimates.  Even though NPML and NPB give us an estimates of accuracy, the Bayesian nature of the estimated population distribution F is more preferable. ❧ Two algorithms were developed to find NBP estimates of the distribution F, Marginal NPB and Full Conditional NPB.  The Marginal NPB approach uses Gibbs sampling to estimate the posterior expectation of the distribution F given the data, but it does not solve the classical NPB problem of finding the posterior distribution of F given the data.  The Full NPB approach uses the Stick-Breaking representation of the Dirichlet process and gives a way to sample from the posterior distribution of F given the data.  This approach allows us to build credibility intervals on all the desired quantities of the distribution F. ❧ Finally we established a very nice connection between NPML and NPB methods.  Using the Full NPB approach we can sample potential estimates of the true distribution F, moreover we can calculate the value of the log-likelihood function for each of those distributions.  All these values of likelihoods are smaller than or equal to the maximum likelihood value that was found by NPML methods.  The equality is possible with a big enough number of draws from the posterior distribution of F given the data. 
Tags
applied statistics
nonparametric Bayesian estimation
nonparametric maximum likelihood estimation
nonparametric population analysis
population pharmacokinetics theory
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button