Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Saliency based image processing to aid retinal prosthesis recipients
(USC Thesis Other)
Saliency based image processing to aid retinal prosthesis recipients
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SALIENCY BASED IMAGE PROCESSING TO AID RETINAL PROSTHESIS RECIPIENTS
by
Neha Parikh
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(BIOMEDICAL ENGINEERING)
August 2010
Copyright 2010 Neha Parikh
ii
Acknowledgements
I owe my deepest gratitude to my advisor Dr. James Weiland for being my
guiding support throughout this journey. Dr. Weiland has been the greatest source of
knowledge, encouragement and enthusiasm. To stay positive and keep striving harder are
important qualities of a good researcher and that I have learnt from him. I owe sincere
thanks to him for bettering my research and me as a person.
I would like to thank Dr. Mark Humayun for his support, insights and great
suggestions towards my work. I also thank Dr. Laurent Itti who has always been a source
of great advice and guidance. I thank Dr. Armand Tanguay and Ben McIntosh for the
simulated vision system that helped me conduct my experiments. I would also like to
acknowledge Dr. Gisele Ragusa, Dr. Manbir Singh and Dr. Bartlett Mel for their inputs.
My colleagues at the Retinal Prosthesis lab have always been supportive of my
work. They have helped me through the ups and downs of this journey - be it as pilot
study subjects or as fellow researchers with useful insights. I would like to acknowledge
Dr. Aditi Ray, Vivek Pradeep, Devyani Nanduri, Alice Cho, Dr. Leanne Chan, Andrew
Weitz, Samantha Cunningham and Navya Davuluri for making my experience such a
memorable one at the lab. All the people involved at the Retinal Prosthesis Lab and at the
BMES-ERC have also contributed immensely towards my success.
I would like to thank my parents and brother for always encouraging me and
imbibing in me the motivation, patience and perseverance to work towards my goals.
With the continued love, support and encouragement from my husband, Nishit Rathod,
iii
this journey never felt difficult. I also thank all my family members and friends who
have always shown so much warmth and love for me.
iv
Table of Contents
Acknowledgements ii
List of Tables vi
List of Figures vii
Abstract xi
Chapter 1: Introduction 1
1.1 Background and Motivation 1
1.1.1 Human Eye and the Retina 1
1.1.2 Retinal Blindness 5
1.1.3 Aids for the Visually Impaired 7
1.1.4 Visual Prostheses 12
1.2 Image Processing in the Retina / Cortex 15
1.2.1 Image Processing in the Retina 15
1.2.2 Image Processing in the MidBrain and Visual Cortex 20
1.3 Retinal Prosthesis and Related Image Processing 23
1.3.1 Design of Retinal Prosthesis 23
1.3.2 Image Processing for a Retinal Prosthesis 25
1.4 Organization of this Thesis 33
Chapter 2: Image Processing and Hardware Constraints 35
2.1 Image Processing Approach for Retinal Prosthesis 35
2.2 Filtering and Decimation of Images 36
2.3 Image Contrast Enhancement and Edge Detection 39
2.4 Object Detection and Recognition 45
2.4.1 Visual Attention and Object Detection/Recognition
in the Visual Cortex 45
2.4.2 Object Detection / Recognition Algorithms 46
2.5 Computational Models of Visual Attention 50
2.5.1 Koch and Ullman 50
2.5.2 Milanese 51
2.5.3 Itti and Koch (Neuromorphic Vision Toolkit) 53
2.5.4 VOCUS (Frintrop et al.) 54
2.6 Hardware and Computational Limitations 55
2.6.1 DSP and FPGA 55
2.6.2 Power and Computational Efficiency Trade-Off 56
2.6.3 Benchmarking Basic Image Processing Algorithms on DSPs 59
2.7 Discussion 61
v
Chapter 3: Algorithm to Detect Salient Regions 64
3.1 Algorithm Model 65
3.1.1 Saliency Detection Model by Itti et al 65
3.1.2 The ‘New’ Model 66
3.1.3 Results 76
3.1.4 New vs. Full Model 76
3.2 Hardware Performance Comparison 79
3.3 Validation of the algorithm 81
3.3.1 Subject Population 81
3.3.2 Methods 82
3.3.3 Data Analysis 84
3.3.4 Results 90
3.4 Modeling for Top Down Information 97
3.4.1 Top-Down Approach by Frintrop et al. 98
3.4.2 The New Model and Top-Down Information 101
3.5 Summary 107
Chapter 4: Simulated Vision Experiments 111
4.1 Background 112
4.1.1 Studies with the Visually Impaired 112
4.1.2 Simulated Vision Studies with the Normally Sighted 116
4.2 Set up for Experimental Designs for Testing the Saliency Algorithm 119
4.3 Experiments 121
4.3.1 Finding Objects on an Uncluttered Table Top 122
4.3.2 Mobility Task with Similar Looking Obstacles 129
4.3.3 Mobility Task in an Office Area 136
4.3.4 Mobility Task in a Corridor 143
4.3.5 Desk Task to Search for a Target 157
4.4 Summary 165
Chapter 5: Summary and Discussion 168
Bibliography 173
vi
List of Tables
Table 2.1 Comparison between different TI DSP processors 58
Table 2.2 Execution rates for basic image processing algorithms on
different processors 61
Table 3.1 Comparison between the full model and the new model 78
Table 3.2 Comparison of execution time for the new and full models on
the DM642 80
Table 3.3 Ratio of medians analysis for the new and full models 91
Table 3.4 Ratio of medians analysis after image shuffling for the new
and full models 92
Table 3.5 Analysis using the NSS method for the new and full models 94
Table 3.6 Analysis using ratio of medians after removing centrally biased
Images 96
Table 3.7 Analysis using NSS after removing centrally biased images 96
Table 3.8 Hit numbers and percentage of images in which the target coke
can is found in the Bottom Up (BU), Top Down (TD) and
Global salience maps 103
Table 3.9 Hit numbers and percentage of images in which the target cell
phone is found in the Bottom Up (BU), Top Down (TD) and
Global salience maps 106
vii
List of Figures
Figure 1.1 The adult human eye 3
Figure 1.2 The human retina and its different layers 4
Figure 1.3 Design of an epiretinal retinal prosthesis 24
Figure 1.4 Visual field for a normal human visual system and for a retinal
Prosthesis recipient (red box) 25
Figure 2.1 Example of filtering using averaging and Gaussian kernels and
decimation with and without filtering 38
Figure 2.2 Example of contrast enhancement using histogram equalization 40
Figure 2.3 Outputs using different edge detection kernels 45
Figure 2.4 Example of an input image and its object detection output using
thresholding 48
Figure 2.5 Example of input image and corresponding salience map computed
by the visual attention algorithm by Itti et al. 54
Figure 2.6 DM642 Schematics Diagram 59
Figure 3.1 Architecture for the Full Model 67
Figure 3.2 Architecture for the New Model 68
Figure 3.3 An example input image and its saturation and intensity images 69
Figure 3.4 Gaussian pyramids at six successive levels for the saturation, intensity
and edge streams 70
Figure 3.5 Feature maps at Gaussian scales (3-6), (3-7), (4-7) for the input image
in figure 3.3 for each of the 3 information streams 71
Figure 3.6 Effects of Normalization on maps with distinct (a) or similar peaks (b) 74
Figure 3.7 Conspicuity salience maps for each of the 3 information streams
and the final salience map for the input image in figure 3.3 75
Figure 3.8 Examples of input images and corresponding salience maps computed
by the new and full models 77
viii
Figure 3.9 Distribution of gaze fixation and random points 90
Figure 3.10 Gaze distribution (a) and average salience map (b) for the 150 image
data set 92
Figure 3.11 Gaze fixation points distribution for the salience and random maps 93
Figure 3.12 Examples of training images for the coke can and corresponding
bottom-up salience maps 104
Figure 3.13 Images of test cases for the coke can with their bottom-up salience
maps and global salience maps created by combining the
bottom-up and top down salience maps 104
Figure 3.14 Examples of testing images for a cell phone with the corresponding
bottom-up and global salience maps 107
Figure 4.1 Head Mounted System for displaying simulated vision and a scene
camera to capture the real-world information 121
Figure 4.2 Set up for the object finding task 123
Figure 4.3 Head movements in degrees averaged over 7 subjects for each trial
for the no cueing and cueing cases with 1 object 125
Figure 4.4 Head movements in degrees averaged over 7 subjects for each trial
for the no cueing and cueing cases with 2 objects 125
Figure 4.5 Head movements in degrees averaged over 7 subjects for each trial
for the no cueing and cueing cases with 3 objects 126
Figure 4.6 Average head movements and standard error of mean (sem) for the
1, 2 and 3 object cases 126
Figure 4.7 Time in seconds averaged over 7 subjects for each trial for the
no cueing and cueing cases with 1 object 127
Figure 4.8 Time in seconds averaged over 7 subjects for each trial for the
no cueing and cueing cases with 2 objects 127
Figure 4.9 Time in seconds averaged over 7 subjects for each trial for the
no cueing and cueing cases with 3 objects 128
Figure 4.10 Average time and standard error of mean (sem) for the 1, 2 and 3
object cases 128
Figure 4.11 Chair and target set up for mobility task 131
ix
Figure 4.12 Head movement velocity in the horizontal direction for all subjects 133
Figure 4.13 Head movement velocity in the vertical direction for all subjects 133
Figure 4.14 Total head movements in degrees for all the subjects for the no cueing
and cueing trials 134
Figure 4.15 Bar graph representing the average head movements in degrees
over all trials and all subjects for the no cueing and cueing cases
along with the standard error of mean (s.e.m) 134
Figure 4.16 Time in seconds to finish the task for all subjects 134
Figure 4.17 Bar graph representing the average time in seconds over all trials
and all subjects for the no cueing and cueing cases along with the
standard error of mean (s.e.m) 135
Figure 4.18 Experimental setup and simulated vision 138
Figure 4.19 Head movement velocity in the horizontal direction for all subjects 139
Figure 4.20 Head movement velocity in the vertical direction for all subjects 140
Figure 4.21 Total head movements in degrees for all the subjects for the no cueing
and cueing trials 140
Figure 4.22 Bar graph representing the average head movements in degrees over
all trials and all subjects for the no cueing and cueing cases
along with the standard error of mean (s.e.m) 141
Figure 4.23 Time in seconds to finish the task for all subjects 141
Figure 4.24 Bar graph representing the average time in seconds over all trials
and all subjects for the no cueing and cueing cases along with the
standard error of mean (s.e.m) 142
Figure 4.25 Corridor set up for mobility testing 145
Figure 4.26 Scene camera and simulated vision view 146
Figure 4.27 Head movements averaged over all subjects for the corridor
mobility experiment 147
Figure 4.28 Average head movements for session 1 and session 2 of phase 1 147
Figure 4.29 Average head movements for phase 1 and phase 2 148
x
Figure 4.30 Horizontal head movements for trials 1, 5 and 9 in phase 1 for the
no cueing and cueing groups 150
Figure 4.31 Horizontal head movement velocities for trials 1, 5 and 9 in phase 1
for the no cueing and cueing groups 151
Figure 4.32 Time averaged over all subjects for the corridor mobility experiment 152
Figure 4.33 Average time in seconds for session 1 and session 2 of phase 1 152
Figure 4.34 Average time in seconds for phase 1 and phase 2 152
Figure 4.35 Number of errors averaged over all subjects for the corridor
mobility experiment 153
Figure 4.36 Number of errors for session 1 and session 2 of phase 1 154
Figure 4.37 Average number of errors for phase 1 and phase 2 154
Figure 4.38 Set up for the desk task of finding the coke can and the corresponding
simulated vision 159
Figure 4.39 Phase 1 head movements for the no cueing and cueing groups 160
Figure 4.40 Phase 2 head movements for the no cueing and cueing groups 160
Figure 4.41 Head movements for Phase 1 and Phase 2 for the no cueing and
cueing groups 160
Figure 4.42 Phase 1 time in seconds for the no cueing and cueing groups 161
Figure 4.43 Phase 2 time in seconds for the no cueing and cueing groups 162
Figure 4.44 Time in seconds for Phase 1 and Phase 2 for the no cueing and
cueing groups 162
Figure 4.45 Phase 1 number of errors for the no cueing and cueing groups 163
Figure 4.46 Phase 2 number of errors for the no cueing and cueing groups 163
Figure 4.47 Number of errors for Phase 1 and Phase 2 for the no cueing and
cueing groups 163
xi
Abstract
Diseases like Retinitis Pigmentosa and Age-related Macular Degeneration result in a
gradual and progressive loss of photoreceptors leading to blindness. A retinal prosthesis
device imparts partial and artificial vision to patients in the central 15-20 degrees of the
visual field by electrically activating the remaining healthy cells of the retina using
electrical currents and an electrode array. Many visual aids available commercially aid
blind patients with their day to day activities and navigation tasks. Most of these
technologies have equipment or infrastructure overhead and are designed for indoor or
outdoor use. The retinal prosthesis system design consists of an image processing module
that can be utilized to process camera images in indoor and outdoor environments using
algorithms to provide information about the surroundings and aid patients in their daily
activities. This thesis presents work towards developing, validating and testing image
processing algorithms for a retinal prosthesis system that could be used to aid retinal
prosthesis recipients in navigation and search tasks.
A computationally efficient implementation of a saliency detection algorithm is
presented. This is a bottom-up algorithm that can be used to detect the presence of objects
in the peripheral visual field of the patients and direct the attention of the patients towards
these objects using cues. Implementing the algorithm on the TMS320DM642 Digital
Signal Processor (DSP) shows that the execution rate is approximately 10 times faster
than an earlier visual attention model. To validate the algorithm outputs, for a set of
images, the areas computed as salient by the algorithm are compared to areas gazed at by
xii
human observers. The results show that the algorithm predicts regions of interest better
than chance. To optimize algorithm performance in scenarios when patients are searching
for an object of interest, the bottom-up model is also integrated with a top-down
information module. The integrated algorithm uses the information about the features of
objects of interest also, and enhances the computed salience maps to give greater weights
to the objects of interest. Testing the integrated algorithm with everyday objects like a red
colored coke can and black cell phone show that the integrated model indeed detects the
objects of interest sooner than the bottom-up only model.
To test the anticipated benefits that could be offered by a saliency based image
processing algorithm to retinal prosthesis recipients in navigation and search tasks,
simulated vision experiments with normal sighted volunteers were conducted. The
subjects were provided with 6x10 pixels vision in the central 15 degrees of their visual
field and their performance measured when they performed navigation and search tasks.
Results show that for all tasks, the cumulative head movements and the errors of the
subjects using help from the saliency algorithm are significantly lower when compared to
subjects using natural head scanning. Time was significantly lower for the cueing group
subjects only for search tasks. The greatest improvement in the performance of the cueing
group over the no cueing group was observed in the initial trials in new environments,
which implies that such a system may benefit the patients most in new and unfamiliar
surroundings. A cueing system may provide additional confidence to the patients in their
day to day activities.
xiii
This thesis discusses the computational limitations for possible image processing
algorithms to be used for a retinal prosthesis system. Generic as well as customized
versions of the saliency based image processing are discussed for use by the subjects
according to the relevant tasks at hand. The experiments discussed in this thesis are one
of the first to explore the advantages of having additional help from image processing
algorithms for retinal prosthesis implant recipients and give insights into ways in which
additional information through such algorithms might benefit users of the system.
Chapter 1
INTRODUCTION
1.1 Background and Motivation
Vision is one of the most important senses for humans that enables the perception of the
world and aids in several activities. An absence of visual perception leads to blindness.
Various organs in the human visual system facilitate the conversion of light information
into visual perception. Eyes, in particular the retina, optic nerve, optic chiasm, optic tract,
the lateral geniculate body and the visual cortex form the major components of the human
visual system.
1.1.1 Human Eye and the Retina
Light entering the human eye through the pupil passes through several structures before
reaching the retina. The iris which is a circular muscular tissue, adjusts the amount of light
entering the eye based on lighting conditions by controlling the size of the pupil. Many
nerves and blood vessels are carried by the sclera which is an opaque white part of the eye
1
that is continuous with a transparent cornea. The transparent cornea covering the iris and a
crystalline lens aid in the production of a focused image at the retina by bending the light
rays. The pupil regulates the amount of light entering the eye while also having a role in
the focusing of light. At the center of the retina is a yellow spot called the macula covering
about 13 degrees of the visual field. In the central region of the macula is the fovea covering
appromixately 3 degrees of the visual field. The fovea provides fine and detailed vision
necessary for humans to perform tasks like reading for which visual details are important.
The retinal tissue is a circular disc of about 42 mm diameter and 0.3 mm thickness lying
at the back half of the eye. Central retina is a region of about 6 mm diameter around the
fovea and beyond this is the peripheral retina. Sharp foveal vision forms the central vision
wheras peripheral vision is beyond the central vision in the peripheral retina. The incoming
visual information undergoes a cascade of complex processes in the multi-layered retinal
tissue. Light information is converted into nerve impulses by the photoreceptor layer of the
retina and passed on to the visual centers of the brain for further processing through the
optic nerve fibers exiting the eye. [Kolb et al.]
The retina is made up of 9 distinct layers of cells and neurons. Although the photore-
ceptor layer is responsible for detecting the incoming light, this layer of the retina lies at
the back of the eye near the pigment epithelium and choroid. This implies that light has to
travel through the entire retina before reaching the photoreceptors. After the various layers
process the visual input, this information along with the information of the incoming im-
age is transmitted to the higher visual centers of the brain in the form of spiking discharge
2
Figure 1.1: The adult human eye
(Figure printed with permission from: Webvision: The Organization of the Retina and Visual
System)
patterns of ganglion cells. The ganglion cells form the innermost layer of the retina. Figure
1.1 shows the diagram of an adult human eye. [Kolb et al.]
Functionally, the retina can be divided into 3 layers of cell bodies and 2 layers of
synapses, namely the the outer nuclear layer, the inner nuclear layer, the ganglion cell
layer and the outer plexiform layer (OPL) and the inner plexiform layer (IPL). The cell
bodies of rods and cones are in the outer nuclear layer, cell bodies of bipolar, horizontal
and amacrine cells are in the inner nuclear layer while the cell bodies of ganglion cells and
displaced amacrine cells are found in the ganglion cell layer. Synaptic connections between
the rods and cones and vertically running bipolar cells and horizontally oriented horizontal
cells occur in the OPL, while the IPL connects the vertically running bipolar cells with the
ganglion cells. Horizontal cells and vertically running amacrine cells also interact to influ-
ence the ganglion cell signals. At the IPL, all the neural processing in the retina culminates
and information about the visual image is transmitted to the brain in the form of spiking
3
Figure 1.2: The human retina and its different layers
(Figure printed with permission from: Webvision: The Organization of the Retina and Visual
System)
from the ganglion cells via more than a million nerve fibers in the optic nerve. Figure 1.2
shows a diagram of the adult human retina and its different layers. [Kolb et al.]
Rods and cones form the photoreceptor layer of the retina. The photoreceptor density in
the central retina is dominated by the cones whereas rods dominate in the peripheral retina.
A densely packed layer of cone photo receptors in the central retina provides fine vision.
The central retina is thicker than the peripheral retina due to the increased number of photo
receptors which result in an increase in the corresponding bipolar and ganglion cell layers.
The density of cones in the retina ranges from 147,000/mm
2
to 238,000/mm
2
. The retina
has approximately 6,400,000 cones and 110,000,000 to 125,000,000 rods. Approximately
200,000 cones lie in the fovea which is free of rods with around 17,500 cones in the central
1 degree. Rod density is the highest at about 5 mm from the center of the fovea at 160,000
rods/mm
2
[Kolb et al.]. There isn’t a one to one mapping between the number of rods and
cones and the ganglion cells. The mapping between the cones to ganglion cells ranges from
4
2.9:1 to 7.5:1 [Bron et al., 1997]. Cones provide the fine and detailed vision in the center.
The convergence rate is much higher for the rods which results in a loss of resolution pri-
marily when the rods are engaged in low light conditions. About 15-30 rod cells converge
to one bipolar cell from where the information passes on to the ganglion cells. The rods
are very sensitive but slow in response to light whereas cones are less sensitive but fast in
their response and adapt quickly to different brightness of lights. Because of these different
sensitivities, rods are used for night vision whereas cones facilitate day light vision.
1.1.2 Retinal Blindness
Retinitis Pigmentosa (RP) and Age-related Macular Degeneration (AMD) are the two lead-
ing causes of blindness. RP and AMD both result in a gradual and irreversible degeneration
of the photoreceptor layer of the retina. Other diseases of the retina like Glaucoma and Di-
abetic Retinopathy can also lead to partial or total blindness. [Kolb et al.]
AMD is a common disease in which the photoreceptors of the macula degenerate lead-
ing to a central vision loss [Margalit and Thoreson, 2006, Klein et al., 2007]. According to
a U.S. Census in 2000, there are 1.75 million patients with AMD in the United States and
the number is expected to increase to more than 3 million by 2020 [Friedman et al., 2004].
AMD has two types - wet and dry. Dry AMD constitutes 80% of all the cases of AMD. Dry
AMD is caused by atrophic and hypertrophic changes in the retinal pigment epithelium and
also because of debris and deposits accumulating in the outer retina whereas wet AMD is
caused by choroidal neovascularization [Rakoczy et al., 2006].
5
RP is a genetic and hereditary disease in which the rods in the peripheral retina begin
to degenerate. As RP is a disease of the rod photo receptors, gradually, patients lose their
peripheral vision and are left with only the central vision in the foveal area. However, in
most cases with severe degeneration, even the central vision is lost in RP leading to total
blindness. RP affects about 1.5 million people around the word with a prevalence rate of 1
in 4000 people [Berson, 1993].
Treatments for AMD and RP
Wet AMD has a FDA approved treatment named Lucentis from Genetech to slow or contain
the progression of the disease [Genentech]. However, besides that, there are no available
treatments for RP and Dry AMD. Using gene therapy, success has been achieved in animal
models as well as humans for treating Leber’s disease that has a single mutation in the
RPE65 gene and causes the degeneration of retinal ganglion cells and axons [Preising and
Heegard, 2004, Maguire et al., 2008]. However, with gene therapy, each genetic mutation
needs to be treated with individualized therapy. Transplants for the photoreceptor layer us-
ing immature cells is an alternative treatment to replace the degenerated photoreceptor layer
of the retina if the other layers are relatively intact. The problem with a transplant is how
well it interacts with the degenerating retina and rest of the visual system and how many
useful synaptic connections get established between the host and the donor tissue [Lund
et al., 2001]. If donor cells are taken from a developing retina at the peak of rod genesis,
the cells can integrate with the degenerating or adult retina to form synaptic connections
6
and improve visual function [MacLaren et al., 2006]. Expressing channelrhodopsin-2 in
mice in the surviving inner retinal neurons can lead to the restoration of retina’s capacity to
encode light signals and to transmit these light signals to the visual cortex [Bi et al., 2006].
However, high light intensity was needed to evoke a response.
1.1.3 Aids for the Visually Impaired
The quality of life for the visually impaired patients changes greatly based on the extent of
the loss of their visual field and acuity. Many patients may not have lost enough visual field
to be legally blind but they could still experience difficulty with reading, moving around at
night, continuing with their job and driving [Nutheti et al., 2006]. Studies show that based
on the extent of the loss of vision, activities like walking down the steps, moving about in
stores, avoiding contacts with obstacles etc. may get difficult for the patients [Turano et al.,
2002].
Several assistive devices and aids help blind and low-vision people to have an indepen-
dent lifestyle. Such aids can be useful in avoiding obstacles and for navigation in unknown
surroundings. Numerous kinds of technologies are used in the development of such aids.
These technologies focus on avoidance of immediate obstacles as well as warnings for
possible upcoming obstacles. Many technologies also focus on orienting the users when
they are in unknown areas and surroundings. Device outputs are provided in the form of
varying frequencies of sound or left/right auditory directions or tactile feedback in form
of vibrations and are helpful for obstacle avoidance when navigating. However, all these
7
technologies are not a substitute for a dog or a cane but instead to enhance user confidence
for independent travel.
The Sonic Pathfinder [Pathfinder] is a head-mounted pulse-echo sonar system that is
controlled by a micro-computer. Ultra sonic rays are radiated around the path of the user
and based on the echoes from objects lying in the path, the user through his ear pieces
is given auditory cues in the right, left or both ears to signify the direction of an object
to be to the right, left or front of the user respectively. It has an auditory display design
consisting of musical notes where in users are given auditory cues in terms of familiar
tones that progress as users approach an object. It thus gives the users warning about the
nearest objects or obstacles lying in their path. Priority is given to the object lying in the
center rather than on the sides. In the absence of obstacles directly ahead in the center,
the device provides information about presence of objects like walls on the right or left
sides of the subject. This also helps the users in orientation. In order to use the Sonic
Pathfinder effectively, the use of the device has to be demonstrated by a mobility instructor
and the user needs to take five to six training sessions to acquire basic mobility skills. Sonic
Pathfinder is the most useful for outdoor use but could be used in large areas like public
buildings where not many objects are in close proximity of the user. It is not very efficient
for an indoor use because of the possibility of the presence of several objects and obstacles
around the user.
The Laser Cane [Cane and Polaron, a,b] is also for assisting the blind, visual impaired
or the deaf in mobility and to avoid objects. It emits invisible beams of light which get
reflected from the objects in the path of the user. It has a 3.5 meter range for detecting
8
objects. The cane scans for the presence of objects to the sides and ahead of the user and
uses audible or tactile cues to warn the user of the objects. Tactile cues are provided using
vibrating stimulators on the index finger of the user. The user needs to hold the cane at an
angle of 50 degrees while walking. When not using the cane for obstacle avoidance, the
user can turn it off and use it as a standard cane. This device is now a discontinued product.
The Polaron [Cane and Polaron, a,b] is a compact design for a mobility aid for the blind,
visually impaired or the deaf. It uses ultrasonic waves to detect objects within 5 meters of
the user. It is designed to be used as a secondary aid to a standard cane or a guide dog. It
can be used as a hand held device or can be worn at the chest. The Polaron uses tactile or
auditory signals to warn the user about the objects. When the object is within 8 to 16 feet
of the user, the vibration feedback remains steady. Within 1.2 to 2.5 meters of the user, the
vibration becomes more pronounced and becomes very intense when the user is within 1.2
meters of the object. This device has also been discontinued.
The BAT K-Sonar [K-Sonar] is designed to be attached to a long cane or to be used
as a hand-held device. It uses ultrasonic waves that bounce off from the objects in front
and to the sides of the user. The K-sonar uses echoes from the objects to convert them into
multiple sounds which the users learn to identify. The users listen to these tones through
ear pieces. The tones change as the distance between the users and the objects changes.
The users can also learn to distinguish between the sounds for different objects and thus
also learn how to identify objects.
Another device vOICe [vOICe] for the totally blind, uses a camera on a pair of sun-
glasses to get image information. It then converts these images into complex sound patterns
9
based on high and low frequencies which are passed on to the user through head phones.
Users learn to decode these complex patterns of sound to understand the visual informa-
tion being passed on to them. Experienced users may even perceive the sounds to feel like
visual information. Users are able to understand the presence of walls, on/off lights etc.
using this device.
For outdoor navigation, GPS (Global Positioning Satellite) based technologies like the
Sendero GPS or the Trekker are available [GPS]. The Sendero GPS can search for points of
interest in the surroundings like restaurants, landmarks, etc. and plan a route for the users
according to the destination of their interest. Users can orient themselves and create routes
from home to work, or library etc. It can also help users get familiarized with street layouts
of new cities or new surroundings. An optional built-in braille display can spell street and
business names. The Sendero Trekker GPS has similar functionalities but also adds a vocal
feature to the GPS mentioned above and offers route planning and recording. It lets the
users create a point of interest using vocal information. It can also provide with real-time
information about intersections, points of interests, map browsing as well as offline map
browsing capability.
Some other technologies need development of separate infrastructure to aid users. The
talking signs system uses invisible infrared light beams which are encoded and transmitted
by permanently installed transmitters at various public places [Signs]. A hand-held device
can decode these light beams and give the users auditory feedback to aid in way finding.
This system can be used effectively in both indoor and outdoor environments. As an ex-
ample, it can help users identify the bus stop to which they want to walk and then identify
10
the correct bus that takes them to their destination. This system has already been imple-
mented in the San Francisco BART system. This system can help users while crossing
intersections, and also to identify retail stores, restaurants and other commercial places.
However, this system can be expensive to build on a wide basis covering a majority of
public buildings.
A Geographic Information System for guiding a traveler in familiar and unfamiliar sur-
roundings was proposed by [Loomis et al., 1998]. The system is designed to determine the
current orientation and position of the users and use an existing database of the surround-
ings to plan a route for the users. They proposed different types of user interfaces to convey
information about the surroundings through auditory information. A virtual acoustic dis-
play system provides sounds which are names of buildings, streets etc. to the users. These
sounds appear to the user to come from a certain direction and certain distance and become
louder as the user approaches that region. The databases created by them for the study
included details about the buildings, walkways, streets, intersections, trees etc. of the test
site. For this technology to be used effectively, comprehensive and customized databases
would be required for users in different locations.
Another system for aiding way finding for the blind and visually impaired in unfamiliar
indoor environments was proposed by [Tjan et al., 2005]. GPS technologies are effective
outdoors, but in indoor and unfamiliar environments like buildings, blind subjects do not
have access to maps, signs and other devices for orienting their positions and in helping
with navigation. Their system consists of passive retro-reflective tags which are coded
with 16-bit numbers and are designed to be attached and posted onto doors, exit signs, etc.
11
in buildings. On establishing a building code book about the locations of these tags, the
user can use a hand-held device to read the tags in his vicinity and the software can identify
the location of the user in the building based on the codes of the tags. Again, for this system
to be widely used, majority of the buildings will have to be coded with the retro-reflective
tags and building code databases will have to be established.
1.1.4 Visual Prostheses
Different types of visual prostheses involving cortical implants, retinal implants and optic
nerve implants have been proposed as possible ways to impart partial vision to visually
blind patients.
As a first step towards light perception, small spots of light or phosphenes were elicited
when a point in the exposed occipital pole of a subject with normal vision was stimulated
[Foerster, 1929]. When a subject who had been blind for almost 8 years was given elec-
trical stimulation in the left occipital lobe, well defined sensations of light similar to those
discussed by Foerster were observed [Krause and Schum, 1931]. This showed that even
after so many years of blindness and deprivation of light input, the adult visual cortex had
not entirely lost its functional capacity. Almost 4 decades later, an array of 80 electrodes
was implanted in a 52 year old patient at the occipital pole of the right cerebral hemisphere
and stimulated using an array of radio receivers connected to the electrodes [Brindley and
Lewin, 1968]. Stimulation of a single electrode most times elicited a single phosphene but
sometimes elicited two or more than two such spots of light. Phosphenes could easily be
12
distinguished when they were produced by electrodes at least 2.4 mm apart but when the
electrodes were placed closer, a strip of light was perceived by the subject. They inferred
that a patient could be made to see simple patterns by stimulating several electrodes si-
multaneously. Stimulation of the visual cortex during occipital lobe surgery showed that
the light sensation changed little with changes in electrode size and stimulation parameters
[Dobelle and Mladejovsky, 1974]. Thresholds did not vary with electrode size but varied
with stimulation parameters. The currents used in the experiments ranged from (3-5 mA).
Phosphenes faded after stimulating continuously for 10-15 seconds and it was observed
that phosphene flicker may or may not occur. This implied that for a cortical prosthesis the
image to be elicited would have to be refreshed at a certain rate.
By the 1990’s, cortical prosthesis research started focusing on penetrating micro-electrodes
in the visual cortex instead of using surface electrodes on the visual cortex. One of the first
studies with penetrating electrodes showed that at an insertion depth of 3 - 5 mm, the
currents for stimulation and eliciting percepts were very small (20 - 200 mA) [Bak et al.,
1990]. Percepts were elicited for electrode separation as small as 0.7 - 1.0 mm and were
similar to those with surface stimulation except that they did not flicker. On implanting a
38 micro electrode array in a 42 year old woman who had been blind for 22 years due to
glaucoma, responses were evoked even after 22 years of blindness. However, there was no
set of optimal stimulus parameters and the effects of phosphenes changed with currents or
when stimulating multiple electrodes. Subjects adapted differently to the different param-
eters of the stimulus and the thresholds also changed over time. Also, for a commercial
visual prosthesis, a greater number of micro electrodes would be required. A 100 micro
13
electrode array for recording and stimulating cells in the cortex was well tolerated by the
neural tissue into which it was inserted [Normann et al., 1999]. This device was later used
in a human brain-machine interface implant.
A retinal prosthesis will require that some of the retina remains. In RP and AMD, after
the degeneration of the photo receptors in the retina, the inner nuclear layers of the retina
and the ganglion cell layer are relatively intact [Kim et al., 2002, Santos et al., 1997, Stone
et al., 1992]. Majority of approaches towards a retinal prosthesis aim to electrically stim-
ulate these remaining layers of the retina by implanting an electrode array in the epiretinal
or the sub retinal areas [Humayun et al., 1996, 1999, Chow and Peachey, 1998, Zrenner
et al., 1999, Rizzo et al., 2001]. Initial studies showed that phosphenes can be elicited by
electrically stimulating electrodes positioned on the retina of patients blind due to RP or
AMD [Humayun et al., 1996, 1999]. For the sub retinal approach, an active or passive
device is placed between the inner and the outer retina. The right eyes of six RP patients
were implanted with a 5000 microelectrode-tipped microphotodiode device powered by
incident light [Chow and Peachey, 1998, Chow et al., 2004]. The device converted the
incident light to stimulating pulses via photoelectric effects. While improvements in visual
functions like detection of brightness, contrast and shape were observed, a lot of these were
in areas farther away from the device placement which could have been because of neu-
rotrophic factors induced by implantation. A sub retinal implantation of a microphotodiode
array for a retinal prosthesis was also proposed to directly replace the photoreceptors with
the microphotodiodes so that no external camera or image processing modules are required
[Zrenner et al., 1999]. For the epiretinal approach, an electrode array is tacked onto the
14
retina and information from an external camera capturing the real-world is converted into
electrical stimulation patterns for the electrode array. Chronic human implants have been
successfully done with the epiretinal prosthesis. Six initial patients were implanted with
the first version epiretinal device having a 4x4 electrode array of of 50-500 mm diameter
Pt electrodes. The patients were able to recognize simple shapes, detect the presence of
objects and identify the direction of movement of horizontal and vertical bars [Humayun
et al., 2003, 2005, de Balthasar et al., 2008, Yanai et al., 2007, Horsager et al., 2009].
Recent studies with patients implanted with the second version of the device having 60
electrodes show that patients are able to read letters and words with the device. The letters
were presented in the form of 600 point Century Gothic at a 12 inch distance from the
subjects. The subjects performed significantly better with the device on compared to when
the device was off [da Cruz et al., 2010, Humayun et al., 2010].
1.2 Image Processing in the Retina and the Cortex
The retina is a very complex image processing structure in the eye performing basic pro-
cessing like color, edge, contrast and motion detection on the incoming visual information.
1.2.1 Image Processing in the Retina
Contrast Processing in the Retina
Ganglion cells receive input from the bipolar and amacrine cells in the inner nuclear layer
and strongly reflect the properties of these neurons. Ganglion cell response is triggered
15
by the illumination of a restricted but relatively large region of the retina. This region is
the receptive field of a ganglion cell. Adjacent ganglion cells have overlapping receptive
fields. Thus, many different ganglion cells each concerned with slightly different parts
of the visual field cover any one region of the retina. Receptive fields of a type of gan-
glion cells show spatial processing of visual information and are sensitive to contrast. This
type of ganglion cells are divided into ON-center and OFF-center cells. Illumination in
the center of the field causes the on-center cells to fire whereas illumination in the antag-
onistic surround causes the inhibition of the cell and decreased activity. A weak response
is evoked when both the center and surround are illuminated simultaneously as both the
regions antagonize each other. The off-center cells are the opposite of on-center cells and
respond vigorously when there is increased illumination in the surround. Because of the
antagonistic nature of the center and surround, these cells respond the best when there is a
maximum contrast between the stimuli for the center and the surround. Illumination when
confined to a certain region of the receptive field, makes these cells respond vigorously
which shows the spatial processing of the distribution of light being carried out by them.
This center-surround organization of the ganglion cells helps in the extraction of useful
information from a scene by making the cells respond primarily to contrasts in light rather
than absolute intensity of light. [Dowling, 1987, Kandel et al., 2000]
16
Retinal Cells Detecting Direction of Motion
The retina has the capability to derive the direction of motion of the light stimulus mov-
ing across its surface using special types of ganglion cells called the directionally sensitive
ganglion cells. This type of cells show evidence of temporal processing of the visual in-
formation. These cells respond vigorously to dark or bright spots of light moving in a
particular direction through the receptive field and are inhibited by spots of light moving in
the opposite direction. They respond with one burst of impulses at the onset of illumination
and another burst at the cessation when areas of the receptive fields are illuminated with
static spots of light and can also be categorized as on-off cells. They are more responsive
to the temporal aspects of the light stimulus and not to the spatial aspects as they strongly
respond to the onset, cessation or movement of light and their responses are not dependent
upon where in the receptive field is the spot of light presented. [Dowling, 1987]
Ganglion Cells for Various Kinds of Processing
The detection of contrast and changes in light intensities of the visual image is carried out
by the retina, but other visual processing like detection of color, form and motion which is
done by the visual cortex is also initiated in the retina in the form of a parallel processing by
ganglion cells. Most of the ganglion cells in the retina can be divided into Magnocellular
(M) or Parvocellular (P) cells that also have the center-surround mechanisms. M cells
primarily respond to large objects and thus the gross features of an image, and also respond
to fast changes of the stimulus and thus movement. The P cells have smaller receptive fields
17
than the M cells and participate in the perception of form and color and in the analysis of
the fine details of an image. A minority of ganglion cells do not belong to the M or P cell
classes and are thought to work with overall ambient light intensity. [Kandel et al., 2000]
Color Processing in the Retina
Color vision is provided by three different types of cones present in the retina. The three
types of cones, which are the S-cones (short wavelength sensitive), the M-cones (mid-
dle wavelength sensitive) and L-cones (long-wavelength sensitive), all have different but
overlapping spectral sensitivities that aid in color vision. These cones are predominantly
sensitive to blue, green and red colors respectively. The L-cones sensitive to red colored
light are present only in primates. Wavelength contrast between these different light wave-
lengths aids in the process of differentiating between objects. [Kandel et al., 2000] The
L and M cones generate highly correlated signals because their spectral sensitivities are
similar over a broad spectral region. There is a lower but still significantly good correla-
tion between the signals from the L and M cones and the S cones. Thus, for the efficient
transmission of color information, instead of there being 3 separate pathways for each kind
of cone, there are only 2 pathways that transmit the differences between these signals in
the form of Red-Green and Blue-Yellow color opponent streams. [Dowling, 1987] Cones
receive inhibitory feedback from horizontal cells allowing the signal to be more sensitive in
dim light and less sensitive in bright light and narrowing the action spectrum of the bipolar
cells that synapse with S cones [Kolb et al.]. L & M cones synapse with bipolar cells which
18
in turn synapse with ganglion cells. The S cones are thought to play a role in only the chro-
matic vision and hence are connected only to the ON bipolar cells whereas L and M cones
that play a role in both chromatic and achromatic vision are connected to the ON and OFF
bipolar cells. Two types of bipolar cells - ON and OFF are present. ON bipolars depolarize
when their associated cone cells hyperpolarize, thus getting turned on or excited by light.
OFF bipolars depolarize when their associated cone cells depolarize, thus getting excited in
the darkness and getting inhibited by light. These ON and OFF bipolar cells synapse with
the ON and OFF ganglion cells in the inner plexiform layer of the retina. At the ganglion
cell level, the color information is transmitted mostly by the P cell system. Two sub-types
of cells in the P cells form the Red-Green and Blue-Yellow color channels whereas the M
cells are mostly concerned with the transmission of achromatic signals and brightness of
the image. M cells are distributed sparsely whereas the P cells are very densely distributed.
The P cells have the capability to respond to both color and brightness in an image signal.
[Kandel et al., 2000]
Edge Processing in the Retina
The exact role of the different retinal cell layers in processing edge information is not
clearly known. Artificial retinal chips (i.e. silicon retinas) implement edge detection as
a combination of horizontal cell, bipolar cell and amacrine cell activity. Horizontal cells
low pass-filter the incoming image scene and by subtracting this low pass filtered image
from the incoming image, the cells compress all neuronal activity into a narrow range for
19
further processing by the bipolar cells. The bipolar cells receive this signal obtained from
the difference between the high pass photoreceptor activity and the low pass horizontal cell
activity that carries the high frequency information of the image scene. The function of
amacrine cells is not very well known. They are thought to alter the characteristics of the
visual image during further processing and assumed to be acting locally and inhibitorily
and being capable of enhancing edges. [Werblin et al., 2001]
1.2.2 Image Processing in the MidBrain and Visual Cortex
Information from the ganglion cells in the retina is carried to the optic chiasm by the optic
nerve. The optic chiasm divides the information from each eye and transfers it to the
two hemispheres of the brain. From the optic chiasm, the nerve fibers project onto three
pathways towards the pretectum, the superior colliculus and the lateral geniculate nucleus.
The superior colliculus controls saccadic eye movements and the pretectum of the midbrain
controls pupillary reflexes. The lateral geniculate nucleus is the most important pathway
for input to the cortex. The lateral geniculate nucleus and remaining processing stages of
the visual cortex are discussed here. The information presented here is from [Kandel et al.,
2000] except where specified.
Lateral Geniculate Nucleus (LGN)
LGN is a very important subcortical unit carrying visual information to the cerebral cor-
tex. About ninety percent of the retinal axons terminate in the LGN. If the LGN is not
20
functional, visual perception is lost although some limited amount of stimulus detection is
possible. More than half of the LGN represents the foveal region of the retina which is
densely packed with ganglion cells. The neurons from the M and P ganglion cells in the
retina remain separated in the LGN and provide input to the Magnocellular and Parvocel-
lular layers respectively of the LGN. On and off-center cells just like the on and off-center
retinal ganglion cells are present in the Magnocellular and Parvocellular layers. These lay-
ers project into separate layers in the primary visual cortex through two parallel pathways
called the M and P pathways consisting of M cells and P cells. P cells respond to color
contrast whereas M cells respond more to luminance contrast. P cells are important for
color vision and for vision requiring high spatial and low temporal resolution whereas M
cells are critical for vision requiring low spatial and high temporal resolution. Information
from the LGN is transmitted to the primary visual cortex.
Primary Visual Cortex (V1)
The primary visual cortex has a well-defined spatial map of the visual information in the
retina. Spatial relationship between two points represented on the retina is the same as in
the primary visual cortex. However, the area of the primary visual cortex representing the
information can be different. As an example, foveal region of the retina is represented by
25% of the primary visual cortex compared to only 1 degree representation in the retina.
The receptive fields of cells are significantly different in the primary visual cortex compared
to the retina. The primary visual cortex is also known as visual area 1(V1) or Brodmann’s
21
area 17 or the striate cortex. V1 consists of about 200 million simple and complex cells
compared to about 1.5 million in the LGN [Hubel, 1995]. Simple cells respond strongly
to bars of light with specific orientations and respond weakly or not at all to light at other
orientations. Simple cells have larger on-off zones (excitatory - inhibitory) in their receptive
fields compared to the cells in the LGN. The ’on’ regions represent input from the on-center
LGN cells and the ’off’ regions represent input from the off-center LGN cells. Complex
cells have larger receptive fields than simple cells and their receptive fields do have a critical
axis of orientation. However, there are no on-off zones in complex cells because of which
the position of the stimulus in the receptive field is not important for their response but
instead the movement across the receptive field acts as an effective stimulus. Simple and
complex cells transform the contrast information coming from the retinal ganglion and
LGN cells into boundaries and line segments which helps in analyzing the contours of
objects and thus in the identification of objects based on the edges. V1 is organized into
many functional modules like orientation columns that contain neurons that respond to
light bars with specific orientations, blobs that contain cells that are more responsive to
color than orientation and ocular dominance columns that receive inputs from one or the
other eye. These units organize into hypercolumns monitoring small areas of the visual
field and horizontal connections aid the communication between these vertically oriented
hypercolumns.
22
Extrastriate Cortex and Visual Pathways
Higher visual processing areas succeeding the processing in the visual cortex belong to the
extrastriate cortex that consists of V2, V3, V4, the infero-temporal cortex (IT), the middle
temporal area (MT or V5) and the posterior-parietal cortex (PP). The magnocellular and
parvocellular layers from V1 continue onto the V2 as two separate and parallel pathways
namely the dorsal and ventral pathways respectively. The dorsal pathway also called the
M-pathway or the ’where’ pathway is responsible for the processing of motion and depth.
This pathway extends from the M cells in the retina and magnocellular layer in the LGN
to the V1, V2, V3, and MT to the parieto occipital (PO) area. The ventral pathway also
called the P-pathway or the ’what’ pathway is responsible for the processing of color and
form. It extends from the P cells in the retina and parvocellular layer in the LGN to the V1,
V2 and V4 to the IT. The processing in the visual pathways is bi-directional. Top-down
connections from higher brain areas going as far as the LGN also influence the processing.
1.3 Retinal Prosthesis and Related Image Processing
1.3.1 Design of Retinal Prosthesis
Figure 1.3 shows the concept for an epiretinal prosthesis. A camera mounted on a pair of
glasses captures real-world information in the form of video images. An electrode array
is implanted on the retina with a tack. An inductive wireless link transmits the power and
implant data. The visual information from the camera is converted into stimulation patterns
23
Figure 1.3: Design of an epiretinal retinal prosthesis
and coded as a serial stream using a custom-built video processing unit. This stream is
then transmitted via the wireless link to the electrode implant. [de Balthasar et al., 2008,
Mahadevappa et al., 2005]
The electrode array has a layout of 6x10 electrodes in the current version of implants.
Previous versions of the implant consisted of a 4x4 electrode array. The end goal is to have
about 1000 electrodes in the array to provide the recipients with better resolution. Studies
using simulated prosthetic vision with normal sighted volunteers suggest that 600 – 1000
electrodes should be enough to provide reading, unaided navigation and face recognition
capabilities to blind patients [Cha et al., 1992b,c,a, Hayes et al., 2003, Thompson et al.,
2003]. Due to surgical and technological limitations, the electrode array cannot occupy
more than the central 15 - 20 degrees of the retina, unless the array is folded for implan-
tation, then unfolded on the retina. This implies that recipients of the implant will have
vision only in the central 15 - 20 degrees of their visual field. The normal human visual
24
Figure 1.4: Visual field for a normal human visual system and for a retinal prosthesis
recipient (red box)
system has a visual field of view close to 160 (H) degrees x 175 (V) degrees [Kolb et al.].
Figure 1.4 shows the comparison between the field of views for a normal human and a
retinal prosthesis recipient. This reduction in the field of view results in a loss of informa-
tion from the peripheral areas of the visual field. For visually impaired subjects, a narrow
visual field hampers mobility and continuous scanning using head movements is required
to gather more information about areas in their peripheral visual field.
1.3.2 Image Processing for a Retinal Prosthesis
Image processing for a retinal prosthesis involves two critical factors. Because of the re-
organization of the cell layers in the diseased retina and plasticity in the visual cortices,
how the stimulation from the device will be routed through the remaining visual pathways
is unknown. It is difficult to predict what kind of shapes and patterns would the subjects
see when the respective electrodes on their array that represent such a shape are stimulated.
One important part of the image processing is to convert the incoming camera image into
a stimulus pattern such that the subjects perceive the actual object shape [Weiland et al.,
2005]. The image processing subsystem will convert image frames captured by a camera
25
into electrical stimulation patterns and may also be required to handle functions such as
zoom, brightness adjustment, and contrast adjustment on a request basis by the user. Dif-
ferent settings might be required for each patient. The exact transformation for converting
the visual information into stimulus patterns cannot be known until a high resolution device
is implanted in patients. Hence, initial image processing units will have to be designed to
be able to account for lack of knowledge about the pattern stimulation of the retina and the
variability in patient response depending on the different stages of the disease.
The image processing module must be wearable and portable like a belt-worn system
for the processor and a glass mounted system for the camera. This will require custom
system design but may be achievable with available low power processors. The module
can be implemented on many types of hardware platforms like general purpose processors
or even on customized chips with hard wired image processing algorithms. Regardless of
what hardware platform is used, it is necessary to have a real-time execution of the image
processing algorithms. The subjects will correlate the location of the perception with the
camera direction and hence the stimulating patterns must update in real-time to keep the
perception in sync with the camera position. A lag in the processing would imply that
the perception at a certain moment in time would be based on the camera position a few
seconds in the past depending on how long the lag is. In this case, subjects would not be
able to reliably associate the location of their perception with the position of the camera.
The main algorithms for the image processor will include decimation of incoming camera
frames and different types of enhancements to improve the perception of vision for the
recipients.
26
Converting Camera Image Information into Electrical Stimulation Pat-
terns
A retinal encoder (RE) to predict the output of the ganglion cells in response to a given input
which could be used to drive the stimulus of an epiretinal device was developed [Eckmiller,
1997, Eckmiller et al., 2004]. Typical primate ganglion cell-receptive field properties of
primate retinal ganglion cells were approximated using individually tunable spatiotemporal
receptive field filters by the encoder. It mapped the visual patterns onto spike trains for a
number of contacted ganglion cells. One implementation simulated a part of the central
retina with a 64 x 64 hexagonal array that was an input to an evenly interlaced distribution
of partly superimposed 34×34 receptive field (RF) filters of P-On, P-Off, and M ganglion
cells. Each of the ganglion cell filters included one RF-center pixel and six RF-periphery
pixels. To validate the encoder algorithm, the encoder’s output pulse trains for the ganglion
cell output were sent to a visual decoder that models the central visual system. Using a
learning algorithm, the encoder was able to adjust filter parameters and produce pulse trains
that replicated the input of the encoder at the output of the decoder. The work suggested
that incorporating relative movements between the visual pattern and sensor array into the
generation of RE output codes could possibly be used to improve the visual percept quality
in patients with tunable retina implants. A real-time system with a refresh rate of more than
50 frames/second to convert the stream of incoming images into stimulation patterns that
could be interpreted by the brain has been proposed [Asher et al., 2007]. The algorithm
tracks the position of the implant and accordingly first crops the image falling outside the
27
implant. Thereafter, a geometrical transformation is applied to distort the cropped image
so that it is appropriate for the geometry of the fovea. This involves stretching the image
geometrically in order to account for the absence of the bipolar cells in the foveal center.
Thus the visual information is re-routed to bipolar cells outside the foveola just as is done
naturally by the photoreceptors in a healthy retina. The visual information that would have
corresponded to the foveal center maps to the bipolar cells at the edge of the foveola and
the visual information that would have arrived near the foveal periphery is sent to bipolar
cells which are farther away. A linear spatio-temporal filter then approximates the visual
processing at the photoreceptors and bipolar cell synapse where horizontal cells feedback
into the photoreceptor synapses that enhances spatial edges and temporal changes. Center-
surround filters inherited by bipolar cells can be modeled as difference-of-Gaussian (DOG)
filters that sharpen edges and de-emphasize areas of constant intensity. This filtered visual
information is then finally converted into a pattern of electrical stimulation.
When subjects used simulated vision to track a small, high-contrast target which could
be stationary or in motion, overlapping Gaussian kernels provided with improved perfor-
mance [Hallum et al., 2008]. Gaussian kernels proved to be more effective than the com-
monly used uniform-intensity kernels. A method for converting a high resolution camera
captured image to low resolution modulated charge injections for post-implantation device
fitting was also proposed [Hallum et al., 2004]. The method used a mutual-information
function that while accounting for the statistics of the visual stimulus quantifies the amount
of information conveyed to the device implantee observing the phosphene image.
28
Enhancing Perception and Knowledge of Surroundings
The other major aspect of image processing for a retinal prosthesis is the processing done on
each camera frame to evaluate what is the image information that should be converted into
a desirable stimulus pattern to electrically stimulate the electrode array. The required image
processing may involve basic functionalities like changing the brightness and contrast of
the images and also zooming into desired regions of the images. The electrode array size
for a prosthesis defines the number of pixels for the vision imparted by the implant. As
a comparison, commercially available cameras will usually have atleast 320x240 pixels
in an image frame compared to the 32x32 pixels for the final version of the electrode
array. Limited number of pixels mean the incoming camera information would have to
be processed using image processing algorithms to give more meaningful information to
the patients.
One way to get around the issue of losing information from the peripheral visual field is
to use a wide field camera and then simply reduce the size of the camera image to the size
of the electrode array. However, this will result in a loss of resolution as well as confusion
because of miniaturization of objects. In order to give better resolution, if only the central
15 degrees of a 320x240 image is extracted and stimulated, there is loss of information
about presence of objects of interest and obstacles in the peripheral areas of the visual field.
Considering the trade-offs in each scheme, - it might be best to extract only the central 15
degrees of the camera frame for stimulating the electrodes and using image processing
algorithms to process the rest of the camera image information and give the information
29
about objects lying in the peripheral visual field in the form of audio, visual or tactile cues
to the patients. Image processing to enhance the perception of patients when stimulated
with the low resolution version of the camera information in the central 15 degrees would
also be beneficial.
Application of image processing techniques such as contrast and brightness enhance-
ment, grayscale histogram equalization, edge detection, and grayscale reduction in real-
time to enhance visual perception provided by a retinal implant has been described [Liu
et al., 2005]. They show that image processing can also provide a means to reduce the data
rate to be transmitted to the electrode chip that stimulates the retina. Image processing and
retinal implant chip design are strongly coupled and this provides a way to achieve optimal
power efficiency for an epi-retinal implant.
An image processing alternative to spatial average for decimation was devised [Hal-
lum et al., 2008]. This approach reduces redundant information in the pixelized image.
Simulations and numerical analysis from this study show that arranging pixels in a hexag-
onal geometry and filtering using a Laplacian or Gaussian kernel may provide the most
information to a retinal prosthesis recipient.
A digital object recognition assistant (DORA) was designed to use the incoming camera
information to perform image recognition and to give this information, via an auditory cue,
to a blind individual [Fink et al., 2004]. A retinal prosthesis implantee could utilize this
information more effectively using their prosthetic vision. However, due to the limitations
of the implant, image recognition is difficult for the patients. Color of objects, shading,
shadow and perspective all influence the manner in which a object is recognized. With low
30
resolution vision, the user does not have such capabilities to distinguish between the dif-
ferent aspects. But with additional information about their surroundings, retinal prosthesis
recipients could still improve their performance. DORA could be used interactively by the
recipient to get multiple answers and make a logical choice. For example, a table viewed
from the side may look like a bridge and if the implant patient was inside a house versus
on a street, he could easily choose a table over a bridge if given both choices.
Hardware and Power
Low power is a requirement for a portable image processing module. At the same time,
hardware capable of performing important image processing computations efficiently is
necessary.
Custom chips for algorithms based on the biological aspects of visual processing have
been developed [Delbrück and Liu, 2004, Zaghloul and Boahen, 2004a,b]. Although the
programmability of such chips is limited, they may represent a good compromise between
computational capability and low power operation. [Zaghloul and Boahen, 2004a,b] de-
scribe the development and testing of a silicon neuromorphic retina chip with 5760 pho-
toreceptor elements and 3600 ganglion cell outputs. The chip is capable of performing
functions like luminance adaptation, bandpass spatiotemporal filtering, temporal adapta-
tion, and contrast gain control. The chip is 3.5 × 3.3 mm
2
in size and consumes 62.7mW
of power. The size and power consumption can be reduced substantially if a similar chip is
31
developed for only a 1000 channel stimulator and if implemented in modern digital fabri-
cation processes.
An external image acquisition and processing system is likely to be used for the ini-
tial generations of a retinal prosthesis. Such external systems will provide programming
flexibility and will not be limited by the computational requirements. Hardware processors
capable of carrying out generalized signal processing computations or hardware proces-
sors specialized for image processing requirements could be used for the external system.
Digital Signal Processors (DSPs) provide flexibility and ease of programming and hence
are suitable for the research phase of image processing when efficient algorithms that are
capable of improving the performance and quality of life of implant receipients are be-
ing worked on. DSPs specially designed for image processing, for example the TMS320
DM642 from Texas Instruments Inc., that runs at about 1.05–1.7 W power, would require
a significant battery source to maintain operation for long periods of time. Other signal
processing DSPs from Texas Instruments Inc., like the C55x or C5000 processors require
about 65 mW–160 mW of power and could be run on the equivalent of a cell phone bat-
tery. However, as a comparison, DM642 operates at 4000 million instructions per second
(MIPS), whereas the C55 operates at 400 MIPS which means that the C55x may not have
enough computing power for algorithms like object recognition. Signal processors like the
Blackfin, Sharc and SigmaDSP processors from Analog Devices Inc., also offer a range of
functionalities. These processors have a frequency of operation between 333 - 600 MHz
and are used in video surveillance, audio/voice processing etc. The ADSP 21xx micro
controllers from Analog devices provide performance up to 160 MHz and are suitable for
32
speech processing and real-time control applications although not for image processsing
and video applications. Because of the flexibility and ease of programming offered by
the DSPs in software, DSPs offer a great option of developing algorithms in software and
implementing them on these hardware platforms to have a benchmark idea of the computa-
tional capabilities and power that would be required for the desired algorithms. [Tsai et al.,
2009] have also proposed an external camera module consisting of a dual-core processor
with DSP (C55x) and ARM chips with a bluetooth and ethernet capability to communi-
cate with other devices. The module can execute basic image processing algorithms like
smoothing and image sampling for pixelization in real-time.
Once efficient image processing strategies have been determined, a customized low-
power chip mimicking biologically inspired algorithms could be developed. The chip could
be miniaturized so that it could be implanted in the eye with the stimulator chip [Humayun
et al.].
1.4 Organization of this Thesis
The work in this thesis focuses on developing computationally efficient image processing
algorithms executable on portable processors to guide the prosthesis recipients in mobility
and search tasks. The implant will provide vision only in the central visual field of the
patients. The goal is to be able to provide the implantees with information about objects
and layout of surrounding regions in their peripheral visual field. The information could be
provided to them in the form of audio, tactile or even visual cues. The users could be made
33
aware of the presence of important objects around them and also be guided towards objects
of interest. With limited vision, new environments and new arrangment of objects could
get confusing for visually impaired patients. Hence, the goal is to be able to process the
visual images coming in from the camera in a manner similar to the normal human visual
system and extract regions and objects of importance present around the patients. Towards
this goal, algorithms like basic edge detection, contrast enhancement and object detection
which represent the kind of visual information processing done by the retinal and cortical
visual areas in the human visual system are explored first. The major part of this thesis then
focuses on the development of a computationally efficient saliency algorithm that is based
on another primate based visual attention algorithm for the detection of regions that are
found interesting to human observers. The performance of the saliency algorithm is eval-
uated by comparing the similarities between the regions computed to be important by the
algorithm and the regions gazed at by human observers. This algorithm is then used with
a simulated vision set up and a set of normal sighted volunteers in various environments to
assess its efficacy in being a guiding aid to visually impaired or retinal prosthesis patients.
34
Chapter 2
Image Processing and Hardware Constraints
2.1 Image Processing Approach for Retinal Prosthesis
The work towards this thesis evaluates various image processing algorithms to find the
right model that can meet the image processing requirements as well as the computational
constraints put forth by hardware processors for a retinal prosthesis system. As discussed
in Chapter 1, the retinal prosthesis implant will provide vision to the implantee only in
the central 15-20 degrees. A scene camera on a pair of glasses will capture a wider field
of view, at least 40-50 degrees. Information from the central visual field of the image
captured by the camera will be downsampled to match the number of electrodes (pixels).
For the information in the peripheral areas of the image frame, it would be useful to cue
the users using audio, tactile or video cues towards important objects or areas. Filtering
and decimation are two important kinds of processing required to be executed by the image
processing module. The retina and the cortex both process images to detect contrast and
edges/contours of objects that help us identify and recognize objects. Algorithms that could
35
help with object detection or in segregrating regions based on the importance to each user
could be used for the peripheral information processing from the image frame. In this
chapter, the most common methods of carrying out such image processing functions and
also their execution frame rates on different hardware processors are discussed.
2.2 Filtering and Decimation of Images
Low-pass filtering or high-pass filtering of image frames can be implemented by convolving
the image with a filter kernel. The averaging kernel is a simple low-pass filtering kernel.
This kernel calculates an average pixel value around the neighborhood of each pixel and
substitutes this value in the output image. The kernel for an averaging filter is stated in
equation 2.1. A smoother low-pass filtered output can be obtained using the Gaussian
kernel. An example for the Gaussian filter kernel is stated in equation 2.2. The degree of
smoothing provided by Gaussian kernels depends on the standard deviation of the kernel.
This kernel weights the pixels in the neighborhood of each pixel and then calculates the
average. This results in the mean values being weighted more towards the value of the
center pixels compared to the uniform weights given by the averaging filter. The Gaussian
filter helps in preserving edges better than the averaging filter. High weights for the center
pixels with the weights falling off in the periphery result in a more gentler smoothing of
the image. [Pratt, 1991]
36
A 5x5 Averaging Filter Kernel
1
25
2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
(2.1)
A 5x5 Gaussian Filter Kernel
1
256
2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
1 4 6 4 1
4 16 24 16 4
6 24 36 24 6
4 16 24 16 4
1 4 6 4 1
3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
(2.2)
Decimation of an image refers to the resizing of the current image to a new smaller
image in terms of the number of pixels. Based on the ratio of the size of the input and output
images, the required number of rows and columns are dropped from the input image. For
example, if an image is to be reduced to half its current size, every alternate row and column
are dropped and the remaining rows and columns form the new output matrix. Aliasing
due decimation can be avoided by low-pass filtering the input image before decimation.
37
Figure 2.1: Example of filtering using averaging and Gaussian kernels and decimation
with and without filtering
A computationally efficient implementation would process only those pixels in the input
image with a low-pass filter kernel which are going to be present in the output image.
Figure 2.1 shows an example of an input image, its corresponding gray scale image, the
low-pass filtered outputs using the averaging and Gaussian filter kernels and the decimated
outputs with and without low pass filtering the image before decimation. A more gradual
smoothing of the image with the Gaussian filter compared to when using the averaging filter
and also a more smooth decimated image when using filtering compared to when not using
filtering can be observed. The images have been processed in Matlab using the imfilter and
imresize image processing functions in the Image Processing Toolbox.
38
2.3 Image Contrast Enhancement and Edge Detection
Several algorithms have been proposed for edge detection and contrast enhancement using
histogram equalization for computer vision and image processing applications.
Contrast Enhancement
Contrast enhancement improves the visual appearances of images that are over exposed or
under exposed to light with the brigher image areas looking white and the darker areas look-
ing black and carrying more of the high frequency or low frequency information respec-
tively. Such images have uneven histogram spreads over their gray scale range. Contrast
enhancement for such images modifies their histograms using different transfer functions
and equalizes the histogram spread over all the gray scale levels. One common method of
histogram modification for grayscale images uses the cumulative density function (cdf) and
probability density function (pdf) of the image histogram to calculate the transfer function
[Pratt, 1991]. For an image x with n pixels and gray scale levels ranging from i= 0toL,
the pdf and cdf are calculated as stated in equations 2.3 and 2.4 respectively.
p
x
(i)=
n
i
n
where 0 i< L (2.3)
cd f
x
(i)=å
i
k=0
p
x
(k) (2.4)
39
Figure 2.2: Example of contrast enhancement using histogram equalization
To have a uniform density function for the output image, a transfer function for the
histogram equalization is given in equation 2.5. In the equation, g
max
and g
min
refer to the
maximum and minimum gray scale levels for the output image and g
min
gg
max
. For each
graylevel i in the input image, the output gray level is given by g.
g=(g
max
g
min
) cd f
x
(i)+ g
min
(2.5)
Figure 2.2 shows an example of the grayscale image after histogram equalization using
the histeq function in the Image Processing Toolbox of Matlab. A contrast increase in the
output image when compared to the input image can be observed.
Edge Detection
Edges contain important information about visual images and are present at points where
sharp changes in brightness occur. Edges can provide useful information about object
boundaries in an image frame. Many edge detection algorithms process high frequency in-
formation using filter kernels to detect relevant edges. Smoothing the images using Gaus-
sian kernels before edge detection reduces sensitivity to noise. The larger the width of
40
the Gaussian kernel, the smaller the sensitivity to noise for the edge detection process. A
typical 5x5 Gaussian kernel is stated in equation 2.6. Various methods using different ker-
nels and thresholds for edge detection are discussed here and can be found in [Forsyth and
Ponce, 2002].
Gaussian Kernel
1
115
2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
2 4 5 4 2
4 9 12 9 4
5 12 15 12 5
4 9 12 9 4
2 4 5 4 2
3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
(2.6)
The sharp changes in brightness for detecting an edge can be observed by taking a
second derivative of the image and detecting zero values or taking a first derivative and
finding the gradients at different pixels and their orientations. Edges are smoothed before
taking the derivatives. For the first-order derivative, the gradient magnitude and orienta-
tions are calculated. Gradients are calculated by taking running differences between pixels
along rows and columns of the image or for diagonal edge gradients, along diagonal pairs
of pixels. Two-dimensional gradient kernels performing differentiation in one direction
and spatial averaging in the orthogonal direction simultaneously can be used to reduce the
sensitivity of the edge detection process to small luminance fluctuations in the image. The
pixels whose gradient magnitude is maximum along the direction perpendicular to the edge
41
are edge pixels. Edge detection kernels like the Prewitt and Sobel edge detectors are stated
in equations 2.7 and 2.8. The Prewitt operator is more sensitive to horizontal and vertical
edges while the Sobel operator is more sensitive to diagonal edges.
Prewitt Kernel for the Horizontal and Vertical Directions
1
3
2
6
6
6
6
6
6
4
1 0 1
1 0 1
1 0 1
3
7
7
7
7
7
7
5
1
3
2
6
6
6
6
6
6
4
1 1 1
0 0 0
1 1 1
3
7
7
7
7
7
7
5
(2.7)
Sobel Kernel for the Horizontal and Vertical Directions
1
4
2
6
6
6
6
6
6
4
1 2 1
0 0 0
1 2 1
3
7
7
7
7
7
7
5
1
4
2
6
6
6
6
6
6
4
1 0 1
2 0 2
1 0 1
3
7
7
7
7
7
7
5
(2.8)
Edge detection using a second-order derivative is done by calculating the second-order
Laplacian derivative which is defined as in equation 2.9. Taking the Laplacian can also be
represented as convolving the image with the Laplacian of the kernel used for smoothing.
The image can thus be convolved with a Laplacian of Gaussian and the pixels where the
output has zero crossings can be marked as edge pixels.
(Ñ
2
f)(x;y)=
¶
2
f
¶x
2
+
¶
2
f
¶y
2
(2.9)
42
The kernel for the Laplacian operator can be defined as stated in equation 2.10. When
convoluting with this kernel, the difference between the center pixel and its surrounding
pixels is averaged. This difference will be large at edges and small in other areas, thus
marking the edges in the image frame.
Examples of Laplacian Kernel
1
4
2
6
6
6
6
6
6
4
0 1 0
1 4 1
0 1 0
3
7
7
7
7
7
7
5
1
8
2
6
6
6
6
6
6
4
1 1 1
1 8 1
1 1 1
3
7
7
7
7
7
7
5
(2.10)
An optimal edge detection algorithm to enhance the already existing edge detection
algorithms was proposed by Canny [Canny, 1986]. Several stages towards edge detection
add to the computational complexity of this algorithm but also give better results than only
the filter masks mentioned before. The first stage for Canny’s edge detection removes noise
using a low-pass Gaussian filter typically represented by a 5x5 kernel mask. The second
stage calculates the edge gradient and direction using the first derivative values obtained
after applying the Sobel, Prewitt or other edge detecting kernel masks on the low-pass
filtered image. The gradients are the square root of the summation of the squared first
derivatives in the vertical and horizontal direction at each pixel value as stated in equation
2.11. The edge direction is the inverse tangent of the ratio of the first derivative values in the
vertical direction and first derivative values in the horizontal direction as stated in equation
2.12. Edge direction angles are rounded to 0, 45, 90 or 135 degrees. A process of non-
43
maximum suppression then determines if the gradient magnitudes are the local maximum
in the respective gradient directions. Following this is a process of hysteresis that uses two
thresholds - high and low. Large intensity gradients are more likely to be the edges and
hence the image values are first thresholded using the high threshold. Pixels with values
higher than the threshold are marked as edges. After this, the pixels connected to these
edge pixels and having a value greater than the lower threshold are also marked as edges.
Using only one threshold makes the edge markings susceptible to noise. The problem with
this implementation lies in finding the correct threshold values. Very high thresholds can
miss important edges whereas very low thresholds can detect noisy pixels as edges. It is
also difficult to find generalized thresholds that would work for all kinds of images.
G=
q
G
2
x
+ G
2
y
(2.11)
q = arctan
G
y
G
x
(2.12)
Figure 2.3 shows the edge detection outputs using the Prewitt, Sobel, Laplacian and
Canny edge filters. The outputs for the Prewitt, Sobel and Canny filters have been gen-
erated with the ’edge’ function in the Image Processing Toolbox of Matlab using default
parameters. The Laplacian output has been generated using the second Laplacian kernel
specified in equation 2.10.
44
Figure 2.3: Outputs using different edge detection kernels
2.4 Object Detection and Recognition
2.4.1 Visual Attention and Object Detection/Recognition in the Visual
Cortex
All objects can be described using a combination of several features. In the human vi-
sual processing system, these features could be a part of the dorsal or ventral streams in
the visual cortex. To perceive an object with multiple features and properties, the cell
types processing each feature of the object for e.g. color, form, motion etc. have to be
associated through a binding mechanism. Psychophysical studies have shown that focused
attention on different aspects of the visual field is required for such associations [Treisman
45
and Gelade, 1980]. Visual perception involves a pre-attentive process also called bottom-
up processing that searches for objects based on basic features like color, orientation, size
or direction of motion in parallel. This is followed by a slow attentive process that is serial
and top-down and is required when the object is represented by a conjunction of the basic
features. If the object of interest differs from its distractors in one feature and the distractors
are homogeneous, it can be detected in parallel and quickly. However, if the object differs
in more than one feature from its distractors, serial search is required. Treisman proposed
that these different features are coded in the form of feature-maps in different regions of the
brain and that there could be a master map that codes for the conjunctions of these features
in the image. The master map receives inputs from all feature maps but retains only those
features that are the distinguishing factor for the object of attention.
2.4.2 Object Detection / Recognition Algorithms
Several object detection and recognition algorithms have been implemented and developed
in the field of computer vision. Object detection for a retinal prosthesis can detect objects
in the visual field and cue subjects towards their location. Object recognition can aid the
subjects by giving them extra information about the type of object along with its location.
Object Detection using Adaptive Thresholding
One of the simplest and more computationally efficient object detection algorithms imple-
ments adaptive thresholding for an image frame [Shapiro et al., 2002]. Using thresholding,
46
the image pixels are segmented to belong to either the background or the object. A binary
image is created by assigning a value 0 to the background pixels and value 1 or 255 to the
object pixels. Choosing the correct threshold is a key in getting an accurate and desired
segmentation of the object and the background. Thresholds can be either chosen manually
or can be calculated using adaptive thresholding. Extracting the mean or the median values
of the image pixels to use as a threshold is the simplest manner of choosing a threshold.
However, most times this does not result in a very good segmentation. Manual thresholds
may not work very well for all types of images and hence adaptive thresholding is a better
option. Adaptive thresholding is an iterative process with the following stages.
1. Any pixel value is chosen as an initial threshold T
2. For an image I, pixels with values > T (PG) and values < T (PS) are segregated
3. Mean of pixel values PG and PS is calculated
4. New threshold T’ = (mean(PG) + mean(PS))/2
5. Compare T’ to T. If T’ and T are different, a new iteration starts from step 2 with T
= T’
This algorithm converges in very few iterations and hence is computationally efficient to
implement. Figure 2.4 shows an example of segmentation using adaptive thresholding. The
output image shows the segmentation of image regions like the sofa, door way, books, etc.
marked as the objects and the floor, walls and ceiling marked as the background.
47
Figure 2.4: Example of an input image and its object detection output using thresholding
Viola - Jones Classifier for Object Recognition
Using simple features for gray scale images, rapid face detection which can be very eas-
ily generalized to object detection and recognition was proposed by Viola-Jones [Viola and
Jones, 2001]. The concept of integral images was introduced where the value of the integral
image at a location (x,y) is the sum of pixels above and to the left of (x,y) including (x,y).
Using these integral image representations at various scales, a simple set of features similar
to the Haar Basis functions can be computed easily. Rectangular features are computed
using two, three or four adjacent or diagonal pixels. Using integral images, a rectangular
feature can be calculated using only four references which speeds up the feature computa-
tion process. The Gentle Ada Boost Algorithm [Freund and E.Schapire, 1997] is used as
the learning mechanism to select a set of the simple features and train the classifier because
the training error for this classifier exponentially approaches zero as the number of rounds
increase. For a 24x24 image sub-window, the number of rectangular features is 180,000
which is much larger than the number of pixels. A cascade of classifiers decreases the com-
48
putation time and increases detection performance by having a series of boosted classifiers
that reject negative sub-windows. In the cascade, a positive result from one classifier trig-
gers the next classifier and so on until the sub-window is rejected by one of the classifiers.
If there’s a negative result at any stage of the classifiers, the sub-window is completely
rejected there on. This leads to efficient processing and high quality recognition by the
classifier. The learned cascade for each target object is then used for recognizing the object
in test images. A search window started from the upper-left corner pixels of the image is
checked for the presence of the target object. If the object is not present, the search window
is moved by a few pixels. The sub-window is moved over the entire image until the object
is detected.
Scale Invariant Feature Transforms (SIFT) for Object Recognition
SIFT uses local image features that are invariant to image scaling, translation, rotation and
partially invariant to illumination changes and affine or 3D projections for object recogni-
tion [Lowe, 2001, 2004]. Difference of Gaussian images are computed for the scale space.
Locations invariant to image scaling, translation, rotation and minimally affected by noise
are identified from these difference of Gaussians as the key points which are the maxima
and minima in scale-space. To detect the maxima and minima, each point is compared to
its 8 neighbors and if it is a maximum/minimum then it is compared to its corresponding
neighbors one scale below and above. If it is the maximum/minimum of all these points,
it is an extrema. About 1000 key points are calculated for a typical 512 x 512 image. Key
49
points that are poorly localized on an edge or have low contrast are eliminated. Image gra-
dients and orientations are calculated for the key points in the Gaussian smoothed image at
the scale of the key point. Localized key points are assigned dominant orientations. A SIFT
descriptor is created by taking the pixels in a region around the keypoint and blurring and
resampling of local orientation image planes. A nearest-neighbor indexing method uses the
SIFT keys to identify matching objects. Results show that robust object recognition can be
achieved with a computation time of less than 2 seconds in cluttered and partially occluded
images.
2.5 Computational Models of Visual Attention
Based on the psychophysical models of visual attention, computer vision algorithms have
been developed and implemented. Such computer vision algorithms are used widely in
robotics and other applications like search and surveillance. Based on the theory of feature
integration by Treisman and Gelade, a first computational model of visual attention was
proposed by Koch and Ullman which forms the basic foundation of the visual attention
models that followed after and are used today. The model by Koch and Ullman and several
other derivatives are discussed here.
2.5.1 Koch and Ullman
Koch and Ullman proposed one of the first models for visual attention based on the fea-
ture integration theory by Treisman [Koch and Ullman, 1985]. It was a theoretical model
50
offering a possible approach for the development and implementation of visual attention
models. They proposed the computation of several different features like orientation, color
or motion in parallel in form of topographical maps as part of an early representation. For
each of the different features like orientation, color or motion, selected locations in the vi-
sual field that differ significantly from their surround are represented in the feature map for
that particular feature. This forms a part of the early pre-attentive mechanisms for visual
attention depiction in the model. Succeeding the early representation is the central repre-
sentation or the non-topographical representation. Selective attention combines the feature
maps to form a master saliency map which represents the relative salience for the differ-
ent locations in the visual field. A Winner Take All (WTA) mechanism extracts the most
salient region from this master map and the properties of this location are sent to the central
representation. After this, the WTA mechanism inhibits this region and shifts focus to the
next most consipicuous region and so on. The most salient region at any point can be de-
termined using proximity to the previous region or certain preferences like similarity. This
model focuses only on the bottom-up implementation for visual attention. The selective
attention in a normal human visual system can also be influenced by top-down processes
based on the kind of information in the visual field and the task.
2.5.2 Milanese
Based on the model by Koch and Ullman, Milanese proposed a first implementation to
integrate bottom-up and top-down information from an image frame to detect regions of
51
interest [Milanese et al., 1994]. This model mimics the pre-attentive and attentive processes
that reduce the computational cost of tasks like object recognition in the human visual
attention system. Feature and conspicuity maps from the image are created using filtering
techniques for the bottom-up information whereas a priori models are used for top-down
information like for object recognition. Local curvature, 16 different orientations and two
color opponencies Red-Green and Blue-Yellow form the different types of features. These
feature maps are analyzed using a conspicuity operator that compares their local values
to their surrounds. This is based on the center-surround mechanism in the visual cortex.
A relaxation process integrates the conspicuity maps into a final binary salience map by
identifying a few convex regions of interest. A simple averaging of various conspicuity
maps could be used to form the salience map. However, such a salience map would be noisy
and may give ambiguous results. Top-down information processing is incorporated using
Distributive Associative Memories (DAM) which are used to learn objects from training
samples. Top-down parameters of interest can be generated by using DAMs to identify
targets during the recognition phase. Object recognition is applied to a small set of regions
detected in the bottom-up salience map. A top-down map is created that highlights objects
that are known and recognized. This map then competes with the bottom-up saliency map
effectively resulting in the final map in which the unknown objects are suppressed and the
known objects are detected as the regions of interest.
52
2.5.3 Itti and Koch (Neuromorphic Vision Toolkit)
The model proposed by Itti et al. is very popular and widely used in various implementa-
tions of visual attention based models [Itti et al., 1998, Itti and Koch, 2000, Itti, 2000]. The
code is freely available as part of the Neuromorphic Vision Toolkit which facilitates further
research and implemention of computational models of visual attention.
This model is mainly based on the first model proposed by Koch and Ullman and in
parts is based on the model by Milanese. It implements the concepts of feature maps,
conspicuity maps, WTA and region of interest. It also uses filtering to calculate center-
surround contrasts for creating feature maps. Itti et al., in this model, propose a 8 level
pyramid scheme of low-pass filtering and decimation. The different scales of the pyramids
can then be used for the center-surround calculations. They propose a new scheme for
combining the different feature maps and conspicuity maps to form the final salience map.
This scheme is termed Normalization and is computationally less expensive than the relax-
ation process proposed by Milanese. This model uses 7 different streams of feature-maps
namely the Red-Green and Blue-Yellow color streams, the Intensity stream and the 0, 45,
90 and 135 degree orientation streams. The algorithm development work in this thesis is
based on this model by Itti et al. and hence detailed description of this model is provided in
Chapter 3. Figure 2.5 shows an example of an input image and the corresponding salience
map computed by the algorithm.
53
Figure 2.5: Example of input image and corresponding salience map computed by the
visual attention algorithm by Itti et al.
2.5.4 VOCUS (Frintrop et al.)
Based on the model by Itti et al., Frintrop et al. proposed a bottom-up implementation as
part of their visual attention system named VOCUS [Frintrop, 2005]. This implementa-
tion extracts 10 different feature streams from the input image. There are 6 color streams
for red, yellow, green, cyan, blue and magenta, 2 intensity streams for the ON and OFF
streams and 4 orientation streams for 0, 45, 90 and 135 orientations. Gaussian pyramids at
4 levels are created for these feature streams by successively low-pass filtering and down-
sampling the image. Conspicuity maps for the intensity and color streams are created using
center-surround interactions. The center-surround computations are carried out at 6 dif-
ferent scales creating a total of 42 feature maps for the 2 intensity and 6 color streams.
The center-surround computation method proposed in this model is computationally more
intensive than the one proposed by Itti et al. Differences between each pixel value and
the average of the surrounding pixels are calculated as part of the center-surround calcu-
lations. Gabor filters are used at 3 different scales to create 12 orientation feature maps
for the 4 different orientations. All these feature maps then combine to form conspicuity
maps of the respective features and after a process of normalization, these are combined
54
to form the final saliency map. They also propose a normalization method different from
the first method proposed by Itti et al. The first normalization method proposed by Itti et
al. involved normalizing the maps to a fixed range and then multiplying the maps with the
squared difference between the global maximum and average of the local maxima. This
method was not biologically plausible and was biased towards enhancing those feature
maps that had one particular peak of activity significantly more conspicuous than the other
peaks of activity in the map [Itti and Koch, 1999]. To this end, Itti et al. proposed another
iterative normalization method for enhancing peaks of strong activity and suppressing other
peaks with lower activity. Based on Itti’s first normalization method, Frintrop et al. pro-
posed a normalization method where each map is divided by the square root of the number
of local maxima in a pre-specified range from the maximum.
As part of the VOCUS system, Frintrop et al. also proposed a top-down approach for
object detection to work with the bottom-up saliency [Frintrop et al., 2005]. This work is
discussed more in detail in Chapter 3.
2.6 Hardware and Computational Limitations
2.6.1 DSP and FPGA
The image processing module for a retinal prosthesis system would consist of a low-power
battery operated image processing unit that is portable and wearable on a belt. A require-
ment for a battery operated image processing unit can put significant constraints on the
kind of processors that can be used. Possible options are low-power DSP (Digital Signal
55
Processors) or FPGA (Field-Programmable Gate Array) chips. FPGA chips used to be
very power hungry chips. However, with advancement in technology, their power require-
ments have substantially been reduced. DSPs come in a range of chips based on different
functionalities and power consumptions. DSP chips have an added advantage of ease of
programmability. DSP chips can be programmed using a C/C++ software interface and the
code can be ported onto the target chip for execution on the hardware. With FPGA chips,
the algorithm has to be programmed using hardware description languages (HDL) like Ver-
ilog or VHDL (Verilog HDL). The algorithm once programmed in software is synthesized
to create its hardware logic. FPGAs offer more flexibility with hardware by allowing to add
more gates and hardware logic as needed according to the project requirement. However,
complex and high level designs with more logic will be very power consuming. During the
research phase, the kind and complexity of image processing algorithms is unknown and
the parameters and models are continuously changing. Hence, for research purposes, DSPs
offer more flexibility with programming. Once algorithms are finalized and power require-
ments for the chip are clear, it should be simpler to move to FPGAs. With the programming
flexibility in mind, DSPs from Texas Instruments Inc. were used for the programming, im-
plementation and benchmarking of various image processing algorithms.
2.6.2 Power and Computational Efficiency Trade-Off
Here we discuss a range of DSPs from Texas Instruments (TI) that could be used for the
image processing algorithm development for this project. TI offers a wide range of DSPs
56
in the C5000 series, C6000 series and the DaVinci family for video and image processing
applications. The DaVinci series of video processors ranges from ARM solutions to DSP
based System on Chips (SoCs) that provide a variety of video encoding and decoding ca-
pabilities. Some of these processors combine a C64x DSP core with an ARM processor
for increased performance. Image processing DSP kits like the one for the DM642 have
input ports for taking in camera information in real-time and output ports for displaying the
processed image frames. For real-time applications, it is necessary that the processor have
the capability of executing the required computations and displaying the output before the
next input frame from the camera is obtained.
Table 2.1 discusses the number of instructions per second in terms of MIPS (Millions of
Instructions Per Second) or MFLOPS (Millions of Floating Operations Per Second) for the
different processors mentioned above. Processors can be either fixed-point or floating-point
and hence the number of instructions per second are either in MIPS or MFLOPS. Fixed-
point and floating point precision can both be used to implement numbers on fixed-point
processors. However, using fixed-point precision where possible makes the implementa-
tions more efficient.
As can be observed from table 2.1, the power consumption of processors increases
as their frequency of operation and the number of instructions per second increase. The
C5000 series of processors offer the least power consumption but also offer the least MIPS
which would put constraints on the complexity of the algorithms. As discussed before, for
a real-time algorithm, the execution of each frame of data should be fast enough so that the
frame is processed before another frame is inputted i.e. each frame should be processed in
57
Table 2.1: Comparison between different TI DSP processors
Frequency of
Operation
(MHz)
Power
Consumption
(Watts)
MIPS / MFLOPS Time for each
instruction
cycle (nano
seconds)
DM642 500 / 600 / 720 1.3 / 1.9 / 2.15 4000 / 4800 / 5760 2 / 1.67 / 1.39
C6711 150 1.1 900 MFLOPS 6.7
C641x 500 / 600 0.64 / 1.04 4000 / 4800 2 / 1.67
C54x 50 - 160 0.04 - 0.09 50 - 532 0.02 - 6.25
C55x 144 - 200 0.065 - 0.16 288 - 400 6.94 - 2.5
1/f seconds where f is the frame rate of the camera input for e.g. 60 frames/seconds. For
computationally complex algorithms, processors from the C6000 and DM642 (DaVinci)
families might be better fits with high MIPS/MFLOPS but also with a higher power con-
sumption. As the algorithm model complexities and computational requirements were un-
known at the start of the project, the DM642 Imaging Developer’s Kit (IDK) was chosen
for the research and development of image processing algorithms for the retinal prosthesis
imaging module.
The DM642 (600 / 720 MHz) IDK is a specialized kit for video and image processing
applications. It has 1024 MBytes of external memory and 256 KBytes of on chip memory.
It also has 3 video port peripherals that provide interface to video encoder and decoder
devices and support video capture and display modes. Each video port also has a 5120
byte capture and display buffer with 2 channels amongst which the buffer capacity could
be split. The DM642 has a VLIW (Very Long Instruction Word) architecture that is highly
parallel and makes the execution of multiple instructions per cycle possible. A schematic
of the IDK kit with the DM642 chip is shown in figure 2.6.
58
Figure 2.6: DM642 Schematics Diagram
2.6.3 Benchmarking Basic Image Processing Algorithms on DSPs
As a first step towards algorithm development, the capability of the DSP chips in terms
of the computational load for different image processing algorithms was evaluated. Basic
image processing algorithms like edge detection, decimation, object detection etc. were
implemented on the DSP kits. Initial development and implementation was done on the
TMS320C6711 floating point processor. Later, the development was switched to the DM642
board. The results here are stated for the benchmarking of the different algorithms on the
respective boards.
Programming and User Interface for the DSP Boards
The C6711 and DM642 kits provide a Code Composer Studio software interface for the
user to program in C. Basic software codes for the capture and display of video information
are a part of the software package that comes with the kit. Once the code is programmed
59
and built, a JTAG emulator interface enables porting it onto the hardware kit. Connecting
the output ports to a video display unit like a monitor enables the display of the video
output from the image processing algorithm. Simulink from Mathworks Inc., offers a wide
range of toolboxes that can do signal processing, image and video processing and many
other mathematical and input output functions. These toolboxes consist of different blocks
and blocksets that can perform different functions. A working model can be created by
connecting many such blocks according to the flow of the algorithm model. In the absence
of certain specific blocksets required by the algorithm, user code for the missing blocksets
can be implemented by utilizing blocksets for customized MATLAB and C code. Simulink
toolboxes can interface with Code Composer Studio and can convert the Simulink blockset
model into C code to target the hardware processor.
For the initial implementation of basic image processing algorithms, programming was
carried out using C in the Code Composer Studio environment. For the later, more com-
plex image processing implementations in the thesis, implementation was carried out using
Simulink models.
Performance of Different Algorithms on the DM642 and C6711
The algorithms that were implemented on the C6711 are filtering using the averaging and
Gaussian filters, the edge detection and the object detection algorithm using thresholding.
Of these, the object detection algorithm was also implemented on the DM642. The results
of the performance of these algorithms on the two processors are shown in table 2.2. The
60
Table 2.2: Execution rates for basic image processing algorithms on different processors
Number of
frames/sec
Low-Pass Filtering
and Decimation on
the C6711
30
Edge Detection on the
C6711
22
Object Detection
using Thresholding on
the C6711
12
Object Detection
using Thresholding on
the DM642
24
table discusses the number of frames processed in one second for each of these algorithms.
Implementing the 2-D filter kernels as two separable 1-D filter kernels for the horizontal
and vertical directions speeds up the computation process and makes the filtering imple-
mentation efficient. Convolution with a MxN 2-D filter kernel would require MXN mul-
tiplications and accumulations for each sample pixel whereas convolution with separable
1-D filter kernels would require M+N calculations per sample pixel, thus making separable
convolution implementations much more efficient than 2D kernel implementations.
2.7 Discussion
For retinal prosthesis, the processing of the incoming camera frames to stimulate the elec-
trode array with the visual information has to be real-time. A lag between the camera
capture and the stimulated information to the implantee will create confusion as the visual
61
information that is being perceived by the user will not have the correct spatial correspon-
dence to the user’s surroundings. This lag may build up over time resulting in incoherent
visual information.
Image processing algorithms can put significant load on the processors based on their
complexity. As observed from the implementation of basic algorithms like filtering, dec-
imation and edge detection on C6711 and DM642, these algorithms can run nearly in
real-time on the C6711 and DM642 DSPs. However, the object detection algorithm us-
ing thresholding runs only at 12 frames/sec on the C6711 and nearly in real-time at 24
frames/sec on the DM642.
These algorithms are very basic and simple in their functioning. For a system that aims
to provide useful information and aid blind people with their navigation and search tasks,
these algorithms may not suffice. Even the object detection algorithm using thresholding
has its limitations. With images having more than a few objects and crowded environments,
the segmentation with this algorithm may not be the best available. Also, when looking for
specific objects in an image frame, added processing will have to be done on the segmented
image to find the area where the object of interest is present. A cueing system is being
considered to guide the attention of subjects towards the area where an object of interest
is present. Subjects could ask for cues when in new and unknown surroundings to get
an idea of the objects and the layout around them. These cues could be audio, visual or
tactile cues providing information about the direction and region of the objects and areas
of interest. A real-time algorithm would have to process all the information in 1/30
th
of
a second when the cues are being provided continuously. Also, providing these cues to
62
subjects in real-time may lead to confusion due to continuous information. Thus, the goal
is to provide cues to users on an on-demand basis. Once a user asks for cues, cues can
be provided within a second. This 1 second window allows for extra processing time for
computationally complex image processing algorithms on the DSP.
For implementing the cueing system using image processing algorithms to provide in-
formation about important areas and objects in the visual field, a visual-attention model
based algorithm for the project is used. Based on the popular biologically inspired model
proposed by Itti et al. which was discussed earlier in this chapter, a simpler implemen-
tation of a saliency detection model is proposed. The rest of this thesis focuses on the
development, implementation and testing of such an algorithm.
63
Chapter 3
Algorithm to Detect Salient Regions
This chapter discusses the biologically inspired visual attention model by Itti et al. in
detail and presents a computationally efficient model for saliency detection that executes
on the DM642 processor at a much faster rate than the original model. The outputs of both
these models for a set of images are compared with human gaze movements for the same
set of images to analyze the correspondence between the regions computed as salient and
regions gazed at by human observers. The possibility of adding top-down information to
the bottom-up model to improve the performance of the model when searching for specific
target objects at minimal additional computational cost is explored and results for test cases
are discussed.
64
3.1 Algorithm Model
3.1.1 Saliency Detection Model by Itti et al.
This previously reported algorithm models primate vision by using intensity, color and
orientation information of an image frame to create salience maps [Itti et al., 1998, Itti
and Koch, 1999, 2000, Itti, 2000, Itti and Koch, 2001]. 7 information streams: intensity,
red-green color opponency, blue-yellow color opponency, 0 degree orientation, 45 degree
orientation, 90 degree orientation and 135 degree orientation are extracted from the input
image. For each of the information streams, 9 spatial scales of dyadic Gaussian pyramids
are formed by successively low pass filtering and down sampling the images by a factor of
two [Greenspan et al., 1994]. The input image for each pyramid is at level 0 after which
8 levels of Gaussian pyramids are created. Center-surround mechanisms observed in the
visual receptive fields of the primate retina are then implemented computationally to create
feature-maps for each information stream. Center scales are at levels (2, 3 and 4) while
surround scales are at levels c + d where de {3,4}. Feature-maps are created by taking a
point by point subtraction between the finer center scales and the coarser surround scales
after interpolating the coarser scales to the finer scales. Feature-maps undergo a process of
normalization before they are combined to form 3 conspicuity maps for color, intensity and
orientation. Normalization is a process to promote those maps which have a small number
of peaks with strong activity and suppress the maps which have many peaks with similar
activity. For the process of iterative normalization, each feature-map is first normalized to
a fixed range of 0 to 1. Thereafter, the map is iteratively convolved with a two dimensional
65
Difference of Gaussian (DoG) filter. The original map is summed with this result and neg-
ative values are set to zero. Feature maps of each kind of information sum up to create the
conspicuity map for that information. For e.g. red-green color opponency and blue-yellow
color opponency feature-maps are summed together to create a color conspicuity map and
similarly, the feature-maps for the 4 orientations and intensity form the orientation and in-
tensity conspicuity maps respectively. These three conspicuity maps are then normalized
and weighted by the number of feature-maps that formed the particular conspicuity map.
These weighted conspicuity maps on summation form the final salience map which is then
normalized. The region around the pixel with the highest grayscale value signifies the most
salient region. The new algorithm model presented in this chapter is inspired from this
model and uses many of the methodologies of this model. These methodologies are de-
scribed more in detail when the new algorithm model is discussed. For the remainder of
this thesis, the algorithm model by Itti et al. will be referred to as the ‘full model’ and a
different and reduced model is proposed which will be referred to as the ‘new model’. The
algorithm model for the full model is shown in figure 3.1.
3.1.2 The ’New’ Model
As will be discussed in later sections, the full model by Itti et al. is computationally very
expensive to implement on the DM642. Hence, a saliency detection algorithm referred
to as the ’new model’ is proposed, based on the ’full model’ with a few key differences.
The algorithm model for the new model is shown in figure 3.2. The information presented
66
Figure 3.1: Architecture for the Full Model
(Figure printed with permission from: A saliency-based search mechanism for overt and covert shifts of
visual attention, Vision Research, 40(10-12):1489–1506, 2000)
in sections 3.1.2 through 3.3 about the new model algorithm, validation data and DSP
computational complexity is discussed and published in the Journal of Neural Engineering
[Parikh et al., 2010].
The new model uses only 3 information streams namely color saturation, intensity and
edge. The input image is converted from the RGB (Red-Green-Blue) color space to the HSI
(Hue-Saturation-Intensity) color space. This conversion can be carried out in several differ-
ent ways. For the results in this thesis, the rgb2hsv function in Matlab (Image Processing
Toolbox) from Mathworks Inc. was used for the conversion.
9 scales of dyadic Gaussian pyramids for the Saturation (S), Intensity (I) and Edge (E)
streams are created by successively low pass filtering and down sampling images by a fac-
tor of 2. The Saturation and Intensity images extracted using the conversion from RGB -
67
Figure 3.2: Architecture for the New Model
HSI space form the 0
th
level of their respective pyramids. Edge pyramids are generated
using the Laplacian pyramid generation scheme [Burt and Adelson, 1983]. For Laplacian
pyramid generation, the image at each scale of the pyramid is created as a point-by-point
subtraction between the intensity image at that level and the interpolated intensity image
from the next level of the intensity Gaussian pyramids as stated in equation 3.1. The ’ex-
pand’ function in the equation relates to the interpolation of the image at the next level to
make it the same size as the image at the current level. ’I’ and ’L’ refer to the Intensity and
Laplacian pyramids respectively at level i which could range from 0 to 9 for the intensity
pyramids and 0 to 8 for the laplacian pyramids. Figure 3.3 shows an example input image
and the extracted saturation and intensity images from which the Gaussian pyramids for the
saturation, intensity and edge streams are computed. Figure 3.4 shows the Gaussian pyra-
68
Figure 3.3: An example input image and its saturation and intensity images
mids for the same example input image for each of the 3 streams. For the sake of clarity of
the images, only six levels of the 8 levels of Gaussian pyramid construction are shown for
all the streams.
L
i
= I
i
expand(I
i+1
) wherei= 0to8 (3.1)
Once the Gaussian pyramids are created, center-surround interactions as observed in
the visual receptive fields of the retina are implemented. Computationally these are imple-
mented in a manner similar to the full model. Levels 1-4 form the center scales and levels
5-8 form the surround scales. For each of the information streams, feature maps are created
using a point-by-point subtraction between different levels of the pyramids. The difference
69
Figure 3.4: Gaussian pyramids at six successive levels for the saturation, intensity and
edge streams
between the two models lies in the levels of the pyramids used to create feature maps. For
the full-model, a total of 42 feature maps for the 7 information streams are created using a
combination of center-surround levels (2-5), (2-6), (3-6), (3-7), (4-7) and (4-8). In the new
model, using only 4 different scales, a total of 9 feature maps are created for the 3 streams
using the center scales ’c’ at levels (3, 4) and surround scales ’s’ at levels s = c +d; where
d e {3, 4} and s< 8. The feature-maps are created as a point-by-point subtraction between
levels (3-6), (3-7) and (4-7) for each information stream as stated in equations 3.2, 3.3 and
3.4 for the Saturation, Intensity and Edge feature-maps respectively. The point-by-point
subtraction is done after interpolating the images at the coarser surround scales to the finer
center scales using bilinear interpolation. For the intensity and saturation pyramids, abso-
70
Figure 3.5: Feature maps at Gaussian scales (3-6), (3-7), (4-7) for the input image in figure
3.3 for each of the 3 information streams
lute values of the subtraction are calculated. Succeeding these calculations, all the resulting
feature map images are scaled by decimation to be 1/16
th
the size of the original image.
Figure 3.5 shows the 3 feature maps for each of the 3 streams.
S(c;s)=jS(c) S(s)j (3.2)
I(c;s)=jI(c) I(s)j (3.3)
E(c;s)= E(c) E(s) (3.4)
Conspicuity maps for each of the three different information streams are then formed
by a linear summation of the 3 feature-maps of each stream as in equations 3.5, 3.6 and
71
3.7. We refer to these conspicuity maps as S
c
for color saturation, I
c
for intensity and E
c
for edge.
S
c
=
4
å
c=3
c+4;s<8
å
s=c+3
S(c;s) (3.5)
I
c
=
4
å
c=3
c+4;s<8
å
s=c+3
I(c;s) (3.6)
E
c
=
4
å
c=3
c+4;s<8
å
s=c+3
E(c;s) (3.7)
Conspicuity maps then undergo an iterative process of normalization which is a com-
petition between peaks of activity in the maps. Normalization is implemented based on the
full model. The difference here is the number of iterations and the number of times normal-
ization is used. The full model normalizes each feature map prior to the linear summation
that forms the conspicuity map and also normalizes the conspicuity maps. The number
of iterations for these normalizations are typically 5. The new model only normalizes the
conspicuity map and uses 3 iterations for the intensity and saturation map normalization
and 1 iteration for the edge map normalization. The number of iterations are chosen based
on the computational load for the processor and pilot studies that were done to examine the
effects of different iterations of normalization on the image maps.
The iterative normalization process promotes maps with a small number of peaks of
strong activity and suppresses maps that have many peaks of similar activity. Figure 3.6
shows this process and is adapted from [Itti and Koch, 2000]. Normalization is referred
72
to by the operator N in the equations that follow. As observed in the figure, the spatial
competition between the different peaks in the maps enhances the peak corresponding to
the brighter circle in figure 3.6(a) in about 10 iterations. In figure 3.6(b), all the circles
correspond to similar peaks resulting in all the peaks being suppressed at the end of the it-
erative process. The normalization process represents the Winner Take All (WTA) network
suggested by Koch and Ullman [Koch and Ullman, 1985]. It enhances the maps to have a
few peaks of activity depicting a sparse distribution of winner areas in a visual scene.
The normalization process is implemented using a Difference of Gaussians (DoG) filter
[Itti and Koch, 1999, 2000]. The conspicuity maps are first normalized to a fixed range
between 0 and 1. A 2-D DoG filter is then convolved iteratively with the map. The output
of this convolution is summed with the original map and negative values are set to zero.
The DoG filter is constructed as a difference between an excitation gaussian and an inhi-
bition gaussian. The filter results in the excitation of each pixel with inhibition from its
neighboring pixels. Equation 3.8 shows how the DoG filter is constructed. The standard
deviation for the excitation and inhibition lobes is calculated as 2% and 25% of the input
image width respectively.
DoG(x;y) =
0:5exp
(x
2
+y
2
)=2s
ex
2Ps
2
ex
1:5exp
(x
2
+y
2
)=2s
inh
2Ps
2
inh
(3.8)
wheres
ex
= 2%ands
inh
= 25%o f input imagewidth
73
Figure 3.6: Effects of Normalization on maps with distinct (a) or similar peaks (b)
(Printed with permission from: A saliency-based search mechanism for overt and covert shifts of visual
attention, Vision Research, 2000)
74
Figure 3.7: Conspicuity salience maps for each of the 3 information streams and the final
salience map for the input image in figure 3.3
Once normalized, the three conspicuity maps are linearly summed, averaged and nor-
malized to form the final salience map. The highest valued pixel signifies the most salient
region. After inhibiting the region around this pixel, the region around the next pixel with
the highest grayscale value is the second most salient region and so on. Inhibiting a region
around each pixel computationally means setting those pixel values to zero and then repeat-
ing the highest pixel search. The three normalized conspicuity maps and the final salience
map for the example image in figure 3.3 are shown in figure 3.7.
75
3.1.3 Results
Figure 3.8 shows several examples of input images and corresponding salience maps gen-
erated by the new and the full models. On comparing the outputs from both the new and
full models, it can be observed that the regions computed as salient by both the models
are similar. The computed salience maps show the detection of objects like doors, tables,
curbs, walls, cars, signs, chairs etc.
3.1.4 New vs. Full Model
The new model is based on concepts from the full model but also differs from the full
model in a few key areas which makes it computationally much less intensive compared to
the full model. The new model uses only 3 types of information streams vs. 7 in the full
model, 4 scales of Gaussian pyramids to construct feature maps vs. 6 in the full model,
a total of 18 feature-maps vs. 42 in the full model and implements normalization only 4
times compared to 46 times in the full model. This comparison is tabulated in table 3.1.
The new model substitutes one color saturation stream for the two color opponent
streams in the full model. The color opponent streams are implemented based on the
visual information processing in the primate retina. However, for the image processing
application in consideration, the color saturation stream indicating purer hues with higher
grayscale values and impure hues with lower grayscale values might suffice. This one
stream will then represent all the different hues and their respective grayscale values for the
image frame. Again, instead of the primate attention based orientation processing where 4
76
Figure 3.8: Examples of input images and corresponding salience maps computed by the
new and full models
77
Table 3.1: Comparison between the full model and the new model
Full Model New Model
Number of information
streams
7 (RG, BY , Intensity, 0
deg, 45 deg, 90 deg, 135
deg)
3 (Color Saturation,
Intensity, Edge)
Number of Gaussian
pyramid scales used
7 scales at 2,3,4,5,6,7,8 4 scales at 3,4,6,7
Number of feature-maps
per stream
6 3
Total number of
feature-maps
42 9
Number of iterations for
normalization
5 3 for Saturation and
Intensity, 1 for Edge
orientation streams in the full model represent the different cells that are tuned to many ori-
entation directions, we only take one edge stream which signifies the objects in the image
with prominent edges. Also, for the center-surround interactions which create the feature
maps, the new model uses only 4 scales from the Gaussian pyramids. The coarser scale
images from the center and surround levels are used for the feature map creation. These
scales represent the low spatial frequency information for the center and surround levels.
This information may relate more to larger objects with low frequency details in the image
frame. Finer details in images are represented by higher frequencies. This may not be re-
quired for the retinal prosthesis application as the subjects are more likely to be interested
in the low frequency information which would represent bigger objects like doors, furniture
etc.
78
3.2 Hardware Performance Comparison
The new and the full models were first modeled in Simulink from Mathworks Inc., and then
ported onto the DM642 720 MHz DSP. The full model was only partially implemented
because of its computational complexity. The intensity stream which forms about 14%
of the full model algorithm was implemented and its performance was compared to the
implementation of the complete new model. For computational efficiency, all the 2-D
filters were implemented as separable 1-D filters for both the models. Table 3.2 shows the
comparison between the two models for the time taken by several modules of the models
to process one image frame on the DM642 IDK. Incoming camera information is in the
YCbCr format which has to be first converted to RGB and then to HSI format for the new
model. This adds a couple of modules to the processing stream. However, in the system
implementation, these modules can easily be replaced with a customized hardware chip
separate from the DSP in order to reduce the computational requirements from the DSP for
the new model. The models in this thesis are implemented using single precision floating
point and are not optimized for execution. Fixed-point hardware can implement numbers
in both floating point and fixed point precision. The DM642 being a fixed-point processor,
using fixed-point precision might make the execution of the algorithms more efficient.
The results in table 3.2 state the execution times for each of the modules and each frame
for the saliency processing by the new and full models. The processing times for each mod-
ule as stated in the table will not add up to the execution time for each frame. Repetitive
use of individual modules, other intermediate processing by the algorithm and the proces-
79
Table 3.2: Comparison of execution time for the new and full models on the DM642
Module Execution Time in seconds
for New Model
Execution Time in seconds
for Intensity Stream of
Full Model
YCbCr -> RGB 0.1250 -
RGB -> HSI 0.0320 -
Gaussian Pyramids 0.0647 0.0695
Laplacian Pyramids (New
Model)
0.0303 -
Feature Maps 0.0027 0.0158
Normalization (at
Different Scales of Feature
Maps for Full Model)
0.0096 0.6062 / 0.0757 /
0.0113
One Entire Frame 0.8416 1.5373
Frames/Second 1.1882 0.6505
sor and the time for all of this has to be considered when calculating the execution time for
each frame. The results in the table for each frame of processing are computed directly by
benchmarking on the DSP unit. The execution time for one image frame for just the inten-
sity stream which is 14% of the full model is 1.5373 seconds. The estimated execution time
for one frame using the entire full model can be appromixated by multiplying the execu-
tion time of the intensity stream with a factor of 7 as the intensity stream is one of 7 similar
streams in the full model. This implies that using the complete full model, one image frame
would process in about 11 seconds on the DM642. In contrast, the execution time for the
new model to process one image frame is 0.84 seconds which implies that the DM642
could process camera information at just about 1 frame per second for this model. This
processing rate is not real-time. However, for an on-demand cueing system as discussed
before, this execution rate is acceptable. When asked for cues by the implant recipients
about the direction of regions or objects in the visual field, the information computed by
80
the algorithm can be provided to the recipients in less than a second. The results show that
the new model is computationally much more efficient and approximately 10 times faster
than the full model.
3.3 Validation of the algorithm
The goal behind using a saliency algorithm to aid retinal prothesis implantees is to direct
their attention to important areas in the visual field just like a normal human visual system
would. Besides having a computationally efficient algorithm capable of running on portable
processors, it is necessary to have agreement between regions computed as salient by the
algorithm and areas found interesting to normal sighted human volunteers. The algorithm
performance is validated by comparing algorithm outputs to human gaze data for a set of
images. The details of this experimental design are discussed below.
3.3.1 Subject Population
After the approval from the Institutional Review Board (IRB) at the University of Southern
California (USC), 5 normally sighted volunteers were enrolled for this study and a signed
informed consent was obtained from each participant. Subjects were required to be En-
glish speaking, having reading knowledge, 18+ years of age, no history of vertigo; motion
sickness or claustrophobia; no cognitive or language/hearing impairments and have a vi-
sual acuity of 20/30 or better with normal or corrected vision (with lenses). Visual acuity
81
testing was carried out for each participant of the study using a Snellen visual acuity eye
chart in the lab.
3.3.2 Methods
Equipment Set-up
An eye tracking system from Arrington Research, Inc., Scottsdale, Az. acquired gaze data
of the human subjects. The system consists of a Z800 3D Visor Head Mounted Display
(HMD) having a diagonal field of view of 40 degrees. Gaze data was recorded using pupil
tracking at a frequency of 60 Hz by the Viewpoint eye tracking software from Arrington
Research. A set of images were displayed on the HMD at a resolution of 800x600 pixels.
Subjects rested their head on a chin-rest and were seated at a table. A calibration process
using a 12 point rectangular grid preceded the gaze data collection. Subjects looked at the
center of 12 different squares appearing successively on the HMD screen and their gaze
data was recorded by the software. The calibration process used this data to match the
subject’s gaze to spatial locations in the image frame. A good calibration would result in
a fine rectangular grid mapped from the 12 gaze points of the subject looking at the center
of the squares. Accuracy of the calibration was tested by displaying a test image with an
image of a circle in the center of the screen. Subjects were asked to look at the center of
the circle. With accurate calibration, the gaze of the subjects would fall right in the center
of the circle. Whereas with inaccurate calibration, the gaze points of the subjects would
not align with the center of the circle. In such a scenario, the calibration process would be
82
repeated again. Gaze data recording with images was not done until good calibration was
obtained.
Data Collection
150 natural images consisting of indoor and outdoor environments were displayed each for
3 seconds on the HMD. Instructions given to the subjects were to freely gaze the images.
No other instructions were given to the subjects in order to avoid biasing them before
the experiment. Between two images, a test image with a white circle was displayed for
subjects to rest their eyes. They were not required to look at the circle when this image
appeared. After every 3 natural images, an image with a red colored circle was displayed.
When this image appeared, subjects were instructed to look at the center of the circle. This
was done to keep a check on the calibration. When subjects looked at the center of the
circle, the calibration was noted. This helped account for calibration drifts at different
stages of the gaze data recording.
Using the calibration offsets during post-processing, an accurate gaze data set corrected
for calibration drifts was created. Fixations and saccades were extracted from the corrected
gaze data set using a custom fixation and saccade filtering software which is a part of the
Neuromorphic Vision Toolkit (NVT) and is available freely on http://ilab.usc.edu. Data
analysis was done using the fixation points from the data set. Drifts in the eye movements
may not be accounted for by the fixation data points. But effects of drifts and minor calibra-
83
tion offsets can be avoided by taking a circular aperture around each fixation point during
data analysis.
3.3.3 Data Analysis
Many studies ascertain the role of basic saliency features in guiding attention to complex
scenes. Methods to compare the computed salient region outputs of the visual attention
algorithms with gaze data from normal human volunteers have been proposed before.
The salient region outputs obtained computationally from the model by Itti et al. were
analyzed using human gaze points obtained for free viewing conditions by Parkhurst et
al. [Parkhurst et al., 2002]. In this method, for all subjects, the salience values from the
salience maps for the k
th
fixation location after stimulus onset for all images in the data set
are extracted and the mean calculated (s
k
). The same process is repeated after randomly
choosing a fixation location on the map for all the images and the mean is calculated (s).
This mean acts as the mean for chance values of salience. If the gaze behavior of humans is
not guided by the bottom-up mechanisms, this mean would be similar to the mean extracted
using fixation locations. The difference between these two means as stated in equation 3.9
is called the chance-adjusted salience (s
a
). The probability of human gaze fixations on
regions of high salience is high if s
a
is positive and the probability of fixating on regions of
low salience is high if s
a
is negative.
s
a
= s
k
s (3.9)
84
This study used images belonging to different categories like fractals, natural land-
scapes, building and city scenes and home interiors. The results showed that for all subjects,
the effect of bottom-up mechanisms in guiding attention in natural viewing conditions was
statistically significant for eye movements immediately following the onset of stimulus.
The effect was the most significant for early fixations but remained above chance levels
throughout the trials.
Another study also used the computational model by Itti et al. to compare the behavior
with human gaze movements [Ouerhani et al., 2004]. This method proposed creating a
human attention map from the human gaze data points. For creating the map, each gaze
data point is represented by a Gaussian distributed activity patch of gray-scale values. The
standard deviation of the Gaussian activity is equal to the size of the fovea. Another pa-
rameter a affects the Gaussian amplitude depending on the fixation duration. If a is set
to 1, the amplitude of the Gaussian distribution is proportional to the duration of fixations
at the points. If a is set to 0, the Gaussian distribution amplitude is the same regardless
of the amount of duration of fixations at that point. A correlation between the human at-
tention map (M
h
(x)) and the computational salience map (M
c
(x)) is computed to observe
the correspondence between the two. The correlation factor r is calculated as stated in
equation 3.10. m
h
and m
c
represent the mean values of M
h
(x) and M
c
(x) respectively. 7
subjects participated in the experiments. Each image was shown for 5 seconds to them with
the instructions to “Just look at the image”. The results showed that the correlation values
were highly variable depending on the different subjects. Over all, when considering the
85
mean of all subjects, the correlations showed encouraging results between the human gaze
maps and saliency maps.
r =
å
x
[(M
h
(x)m
h
)(M
c
(x)m
c
)]
q
å
x
(M
h
(x)m
h
)
2
å
x
(M
c
(x)m
c
)
2
(3.10)
The contribution of low-level saliency features to human attention and gaze when view-
ing complex dynamic scenes was studied by Itti [Itti, 2005]. He also studied the different
kinds of low-level features that contribute more to attention. Another study showed that
increasing the realism of simulations signficantly affected the quantitative measures for
comparing outputs from three variants of the saliency model with human gaze data [Itti,
2006]. Peters et al. used a dynamic range based on the variance of the salience values in
the salience maps and proposed a normalized scanpath saliency method to establish corre-
spondence between salience maps and human gaze data [Peters et al., 2005].
The new saliency model was validated using the methods proposed by Itti [Itti, 2005]
and Peters [Peters et al., 2005], namely the ratio of medians method and normalized scan-
path salience method respectively. These methods are discussed here with the analysis for
our model. Gaze data points from all the subjects were pooled together for each image for
the analysis. For the same set of 150 input images that was shown to the human subjects
on their HMD, salience maps computed by the full model were used for a comparison with
the regions gazed at by human volunteers.
86
Ratio of Medians
This method uses the fixation points extracted from human gaze data and calculates a ratio
of the salience values at the fixation locations and salience values at random locations
(chance values).
S
h
= highest value of saliency within a circular aperture of diameter 5.6 degrees centered
at the fixation point. High values of S
h
signify that human observers fixated at highly salient
regions
S
r
= highest value of saliency within a circular aperture of diameter 5.6 degrees centered
at a random point chosen from a uniform distribution
S
max
= maximum value of saliency in the computational salience map of the image
Combining gaze data points from all subjects, each image consists of about 20-40 gaze
points. For each image, the same number of random points are chosen from a uniform
distribution as the number of fixation points to calculate S
r
values. For each random point,
a set of 100 random points are generated to obtain more accurate estimates of S
r
. S
r
values for each of these points are calculated. The median value Sm
r
of this data set is
used for further analysis. For each of the gaze data points in an image, a value of S
h
is
calculated. Ratios S
h
/S
max
and Sm
r
/S
max
are calculated for all the gaze and random points
in the image. The median values of these ratios (S
hm
and S
rm
) are then calculated. The ratio
of these medians S
hm
and S
rm
(equation 3.11) gives a comparison of the salience values
for human gaze points to the chance salience values calculated as the salience values for
random points.
87
Ratio=
S
hm
S
rm
(3.11)
Higher ratios imply that the saliency values around the fixation points in the computa-
tional maps are greater than the saliency values around random points. This means that the
computational model can predict human gaze locations in an image better than expected by
chance.
Image Shuffling: Image shuffling is used as a control analysis. With shuffling, instead of
using the human gaze fixation points corresponding to the image being analyzed, the gaze
fixation points corresponding to another randomly chosen image are used. The ratio of
medians analysis is then carried out using the saliency maps of the image being analyzed
and gaze points from another image. These results can then be compared to the results
obtained when using the gaze point data set corresponding to the same image that is being
analyzed.
A statistical sign test analysis with a significance level of 0.0001 was done between the
S
h
and Sm
r
values for the salience map outputs from both the new and the full models for
the cases with and without shuffling. For the shuffling case, the same statistical test was
also carried out between the S
h
values with and without shuffling. The S
h
values without
shuffling would be expected to be significantly greater than the S
h
values with shuffling for
the model outputs to have correspondence with human gaze patterns.
88
Normalized Scanpath Salience (NSS)
[Peters et al., 2005] proposed a normalized scanpath salience (NSS) method to accomodate
for the high inter-subject variablity observed in gaze behaviors. The salience map of each
image is normalized to have a zero mean and unit standard deviation. From this normal-
ized salience map, salience values corresponding to gaze fixation locations of the human
observers are extracted. The mean of these salience values is called the Normalized Scan-
path Salience (NSS). The value of NSS if greater than zero implies greater correspondence
between the salience maps and the gaze fixation locations than expected by chance only.
A value of NSS close to zero implies no correspondence between the normalized compu-
tational salience maps and human gaze behavior. If the value of NSS is less than zero, it
represents anti-correspondence between the salience maps and human fixations. For cal-
culating chance values, a random map with a uniform distribution is created at the same
resolution as the salience map. The NSS values for this random map are calculated in the
same manner as stated above. NSS values calculated from the salience maps would be
expected to be significantly greater than zero while the NSS values calculated using the
random maps would be expected to be less than or close to zero.
For the validation experiments for the new model, the NSS values for all gaze data
points were calculated by taking the highest salience values in a region of diameter 5.6
degrees around each fixation point. This was done to account for minor fixation drifts and
calibration offsets. For this analysis also, each random map generation was carried out 100
times for each image to have an accurate chance value analysis. A statistical analysis using
89
Figure 3.9: Distribution of gaze fixation and random points
a paired t-test with a significance level of 0.0001 was done between the NSS values using
salience maps and NSS values using random maps. NSS values from salience maps would
be expected to be positive and significantly greater than the NSS values using random maps.
3.3.4 Results
Analysis using Ratio of Medians
Figure 3.9 shows an example input image and corresponding salience maps from the new
and full models depicting the gaze fixation points, random points from a uniform distribu-
tion and random points obtained after shuffling images. It can be observed that the gaze
fixation points from human observers correlate well with the salient regions whereas the
random points are as expected randomly distributed.
Table 3.3 shows the ratio of medians analysis for the new and full models using gaze
data corresponding to the images and randomly distributed gaze points. Results show that
using a sign test with a significance level of 0.0001 to compare the S
h
and Sm
r
values for
90
Table 3.3: Ratio of medians analysis for the new and full models
S
hm
S
rm
Ratio of
medians
Sign Test (S
hm
and S
rm
)
New model 0.3647 0.1020 3.5767 p<0.0001
Full model 0.4352 0.2457 1.7714 p<0.0001
all the images and all subjects, the ratios of both the new and full models are significantly
greater than chance (chance => ratio = 1). This indicates that both the models can predict
human gaze behavior better than chance. For the set of images used for the experiment, the
ratio of medians for the new model is higher than for the full model. This shows that in this
particular case, the new model out performs the full model. As can be seen from figure 3.8,
the maps for the full models are slightly more dense than the maps for the new model. This
might be a reason behind the overall median values of the full model being greater than the
new model.
Table 3.4 shows the analysis results using the ratio of medians with image shuffling.
Comparing these values with those in table 3.3 shows that the values of the ratios are
lower with shuffling which is the expected. A statistical sign test with a significance level
of 0.0001 between the S
h
values with and without shuffling indicates that the S
h
values
without shuffling are significantly higher than the Sh values with shuffling which is also
expected. However, a statistical sign test with a significance level of 0.0001 shows that the
ratios of the medians after shuffling are significantly higher than one which means better
than chance. This is unexpected but the discrepancy could be explained by the center-bias
effects in the salience maps as well as human gaze behavior.
91
Table 3.4: Ratio of medians analysis after image shuffling for the new and full models
S
hm
S
rm
Ratio of
medians
Sign Test (S
hm
and S
rm
)
New model 0.2275 0.1059 2.1481 p<0.0001
Full model 0.3256 0.2511 1.2970 p<0.0001
Figure 3.10: Gaze distribution (a) and average salience map (b) for the 150 image data set
(a) (b)
There is a central bias in the gaze patterns of subjects. Subjects have a natural tendency
to start looking at unfamiliar images from the center and then gradually look at the periph-
eral areas of the image. Also, as mentioned in the set-up section, subjects were asked to
look at the circle in the center of a test image after every 3 natural images from the data set.
This was done for calibration purposes but this could also centrally bias the initial fixations
of the subjects. Also, because of the possibility of a photographer bias in the images which
results in interesting objects being in the center of the image, the average salience map
could have a center-bias. The average salience map computed as the mean of the salience
maps for the 150 image data set has a center-bias effect as seen in figure 3.10(b). The
center-bias in the gaze data is shown in figure 3.10(a).
92
Figure 3.11: Gaze fixation points distribution for the salience and random maps
Analysis using Normalized Scanpath Salience (NSS)
Figure 3.11 shows another example image and corresponding salience maps from the new
and full models along with uniform distribution random maps depicting the gaze fixation
points. It can be observed that the gaze fixation points from human observers correlate well
with the salient regions whereas the random maps are quite noisy.
The results for the NSS analysis with the new and full models are shown in table 3.5.
The table shows the NSS values with the standard error of mean (sem). For both the models,
the NSS values from the salience maps are greater than zero and the values for the random
maps are close to zero as would be expected to have for a good correspondence between the
computational salient regions and human gaze behavior. A paired t-test with a significance
level of 0.0001 shows that the NSS values from the salience maps are significantly greater
than those from the random maps implying that there is a greater correspondence between
the regions detected as salient by the computational algorithm and human fixations than
would be expected by chance.
93
Table 3.5: Analysis using the NSS method for the new and full models
Salience map
NSS s.e.m
Random map
NSS s.e.m
Paired t-test
New model 0.4310 0.0113 -0.0004 0.0005 p<0.0001
Full model 0.4758 0.0098 -0.0005 0.0005 p<0.0001
Center Bias Analysis To investigate further the result from the image shuffling anal-
ysis with the ratio of medians method that saliency and gaze were correlated even after im-
age shuffling, a center-bias analysis was carried out. Based on the analysis done by [Tatler,
2007], a center-bias analysis for the average salience map and the gaze fixation points of
all the subjects was done. Tatler studied the eye movements of 22 subjects for a set of 120
images to study the center-bias effect. The study showed that irrespective of the presence
of the photographer bias in the image, there was a tendency of subjects to start looking at
the image displayed on the monitor from the center. This is attributed to an initial orienting
response that brings the eye closer to the center of the screen in the initial response with the
first saccade. The correlation between the different features and gaze patterns was observed
but only after the initial centering response.
For the experimental analysis with the new model, all the gaze points from all subjects
and all images are first pooled together as shown in figure 3.10(a). Next, the total number of
gaze points falling into the central 15 degrees of the image is compared to the total number
of gaze points falling into the other areas of the image which we refer to as peripheral areas.
The gaze data is said to have a center-bias if the total number of gaze points in the central
15 degrees is greater than the total gaze points in the peripheral areas. The results show that
from the set of 150 images, 26% of the images have a center-bias in the subject gaze data.
94
Similarly, for evaluating the center-bias in the salience maps, the total number of pixels
whose grayscale level is the maximum value of the average salience map are found in the
central and peripheral areas of the average salience map. If the total of such maximum
valued pixels is greater in the center than in the peripheral areas, the average salience map
is said to be biased centrally. As shown in figure 3.10(b), there does exist a central bias
in the average salience map. The bias in the gaze data and the photographer bias might
be a reason behind the ratio of medians being greater than 1 even when image shuffling is
used. If the shuffled gaze data set has fixation points which correspond to non-zero salience
regions in the image being analyzed, the ratio will be greater than 1. If there is a constant
presence of the gaze or the photographer bias, some images will most likely correspond
much better with even a shuffled gaze data set. However, such number of images should be
low because of which the overall values of S
hm
and S
rm
in table 3.4 are lower than those in
table 3.3.
Also, in the presence of a constant center-bias in the form of gaze or photographer bias,
it is likely that for some of the images, the correspondence between the salient regions
and human gaze behavior was influenced making the values of S
hm
and thus the ratios go
higher. Another analysis to see how the results of the data set were influenced by a few
gaze or photographer biased images was carried out. From the 150 image data set, images
having either a central gaze-data bias or a central salience map bias were removed and a
ratio of medians analysis was done on the rest of the images with their corresponding gaze
data sets. Results of this analysis are shown in table 3.6. It can be observed that the values
95
Table 3.6: Analysis using ratio of medians after removing centrally biased images
S
hm
S
rm
Ratio of
medians
Sign Test (S
hm
and S
rm
)
New model 0.3490 0.1020 3.4231 p<0.0001
Full model 0.4278 0.2519 1.6985 p<0.0001
Table 3.7: Analysis using NSS after removing centrally biased images
Salience map
NSS s.e.m
Random map
NSS s.e.m
Paired t-test
New model 0.4310 0.0113 -0.0004 0.0005 p<0.0001
Full model 0.4758 0.0098 -0.0005 0.0005 p<0.0001
of S
hm
and S
rm
are very similar to the values in table 3.3 which states the results from the
entire image data set.
The new image data set after removing the images biased centrally either because of
photographer bias or gaze data was also analyzed using the NSS method. The results for
this analysis are shown in table 3.6. Again, it is observed that the NSS values are very
similar to those obtained when the entire image data set was used. Here also, a paired t-test
with a significance level of 0.0001 shows that the NSS values obtained using salience maps
are significantly different and higher than the NSS values obtained using random maps,
meaning there is greater correspondence between salient regions detected by the salience
maps and human fixations than expected by chance.
The analysis results for both methods are similar with both the image data sets - the
complete data set and the data set after removing images influenced with center bias. This
implies that the center-bias in either the gaze or the salience map does not affect the cor-
96
respondence between the computational salience maps and human gaze behavior for the
analyzed image data set.
3.4 Modeling for Top Down Information
The model for saliency detection discussed before is a bottom-up approach. Every incom-
ing frame is processed in the same manner without using any a priori information about the
contents of the image frame. In a normal visual human system, besides the early processes
that guide attention to basic features in the visual field, many top-down processes influence
attention. As a computational algorithm, the bottom-up approach performs very well for
the detection of regions that would be important to human observers. However, when a
human subject is searching for a particular object or a defining feature in the visual field,
combining top-down information about the object or feature being searched for with the
bottom-up maps from the algorithm might make the search more efficient. The bottom-up
algorithm may be able to find the object that is being searched for, but it may not be the most
salient object in the visual field for the algorithm and hence may not be the first directional
cue provided by the algorithm. Combining top-down feature information about the object
with the bottom-up maps can enhance the regions of the target object in the final salient
map and thus detect that object as one of the most salient objects. This enhancement to the
saliency detection algorithm is critical for the retinal prosthesis as visually impaired sub-
jects might somtimes find it difficult to find things like their drinks (for e.g a coke can), or
cell phone, books etc. that are difficult for them to see because of their limited visual field.
97
Top down information integration with the bottom-up algorithm is implemented based on
the approach proposed by Frintrop et al. which is discussed in detail below.
3.4.1 Top-Down Approach by Frintrop et al.
As part of VOCUS [Frintrop, 2005], Frintrop et al. propose a target-specific saliency de-
tection approach [Frintrop et al., 2005]. They combine their bottom-up model consisting
of 10 information streams namely on-off and off-on contrasts, green; blue; red and yellow
color, and 0; 45; 90 and 135 degree orientation streams with a weighing scheme that is
based on top-down information about the target-specific features. This model takes a set
of training images for target objects and their respective feature and conspicuity maps for
each of the information streams and computes target-specific weights which could be used
with the bottom-up saliency maps to enhance the detection of the target object in the final
salience map. This top-down integration is discussed below as it applies to the bottom-up
model in this thesis.
Learning Mode
The model is provided with each training image for the target object and a set of coordi-
nates defining the region-of-interest (ROI) that defines the object area in the image frame.
A bottom-up saliency map for the image is computed by the model from which the most
salient region (MSR) inside the ROI is detected. For the 3 information streams, the con-
spicuity maps for intensity, color and orientation, for a total of 3 feature-maps are used to
98
calculate weights. The weight w
i
for each map X
i
is calculated as the ratio of the mean
saliency value in the target region to the mean saliency value in the background as stated in
equation 3.12.
w
i
=
m
(MSR)
m
(imageMSR)
where ie(1;::;3) (3.12)
For a training set with n number of images, the average weight for each of the maps is
calculated as a geometric mean of the weights calculated above. This is stated in equation
3.13.
w
i;(1:::n)
=
n
s
n
Õ
j=1
w
i; j
where ie(1;::;3) (3.13)
Search Mode
In the search mode, a global saliency map is created by integrating the top-down and
bottom-up saliency maps using the weights calculated in the learning mode. Maps (X
i
)
with weights > 1 are important to the target detection and are weighted and summed up to
form an excitation map. The maps (X
i
) with weights < 1 are also weighted and summed
to form an inhibition map which shows more features of the background than the target.
Excitation and inhibition maps are calculated as stated in equation 3.14.
99
E =
å
i
(w
i
X
i
) 8i : w
i
> 1
I=
å
i
((1=w
i
) X
i
) 8i : w
i
< 1 (3.14)
The top-down saliency map S
td
is computed by subtracting the inhibition map from the
excitation map (equation 3.15). It is normalized to the same range as the bottom up salience
map S
bu
. These are then combined as stated in equation 3.16 to form the global salience
map S to combine the influences from both the bottom-up and top-down maps similar to
the mechanisms in the normal human visual system. In equation 3.16, te[0::1]. When t=1,
the model uses only the top-down map with target relevant features whereas when t=0, the
model uses only the bottom-up map with no information about the target. It is difficult to
find an generic value of t and hence t=0.5 is used to give equal weight to the contributions
of the bottom-up and top-down maps.
S
td
= E I (3.15)
S=(1t) S
bu
+t S
td
(3.16)
Frintrop et al. showed in their experiments that using the global salience map, the
average hit number for the detection of the target from the salience map and the number of
100
images from which the target was successfully extracted improved compared to the results
when only the bottom-up map was used.
3.4.2 The New Model and Top-Down Information
Based on the implementation by Frintrop et al. top-down information was integrated with
the new model for saliency detection. The performance of the new algorithm when search-
ing for target objects like a red colored coke can and a cell phone was evaluated. Each of
these cases are discussed below.
Coke Can Detection
A red colored coke can was used as the target object for this implementation. This imple-
mentation was carried out only in Matlab. A few changes were made to the implementation
of the new model to accomodate for the top down information processing. The algorithm
was modified to include the extraction of the red colored hue information which is a feature
of the target.
The new implementation combined the hue and saturation streams instead of using
only the saturation stream. The hue image was extracted from the RGB to HSV conversion
of the original image and, for the saturation image, the values of only the regions corre-
sponding to the red hue map were extracted. The intensity, edge and saturation streams
were processed as in the new model to create the bottom-up saliency maps for the train-
ing and testing images. In the training phase, weights for each of the intensity, edge and
101
hue+saturation streams were calculated from a set of training images. For testing, using
the weights obtained from the training images, global saliency maps were created using
the top-down maps and bottom-up saliency maps. The most salient region in the global
map is extracted as a region around the highest gray scale valued pixel. To find the next
most salient region, the pixels for the most salient region are inhibited and the process is
repeated.
A set of 92 images was used as the training dataset to calculate weights for weighing the
intensity, saturation and edge feature-maps. For testing, a set of 87 images was used. Figure
3.12 shows a set of training images and the corresponding salience maps. It can be observed
that the coke can is usually one of the salient regions but not the most salient always. Figure
3.13 shows a set of training images and its bottom up salience maps along with the global
salience maps created after combing the top down and bottom-up salience maps. The coke
can is more salient in the global salience maps than in the corresponding bottom-up only
salience maps. The weights calculated by the training model were [3.0224, 0.4472, 0.4940]
for Saturation, Intensity and Edge maps. The weights show that the saturation stream with
the red hue is the stream contributing the most to the detection of the coke can from the
salience maps. For the testing dataset, the target was detected using the global salience
maps as well as only the bottom-up and top-down salience maps and the average hit number
of the target was compared. The hit number refers to the serial number of the detection
of the object in the data set from the global salience maps. If the coke can is the most
salient object in the map, its hit number will be 1 and so on. The percentage of test images
from which the target object was successfully extracted was also calculated. The results are
102
Table 3.8: Hit numbers and percentage of images in which the target coke can is found in the
Bottom Up (BU), Top Down (TD) and Global salience maps
t = 0.5 Percentage of Images for Each Hit Number
Hit Number 1 2 3 4 5 - 10 Average Hit Number
BU 26.4% 8% 11.5% 5.7% 33.4% 4.0135 (85.0575%)
TD 43.7% 13.8% 5.7% 3.4% 20.7% 2.8289 (87.3563%)
Global 47.1% 11.5% 4.6% 5.7% 17.2% 2.6000 (86.2069%)
stated in table 3.8. The results show that the bottom-up only model which does not have any
a priori information about the target object, detects the coke can to be the most salient object
in about 26% of the images and detects it to be between the 5
th
and 10
th
most salient object
in approximately 33% of the test images. Using top-down salience maps, the coke can was
detected to be the most salient object in approximately 44% of the images and between the
5
th
and 10
th
most salient object in approximately 21% of the test images. Using Global
salience maps, the coke can was detected to be the most salient object in approximately
47% of the images and between the 5
th
and 10
th
most salient object in approximately 17%
of the test images. On an average, the purely bottom-up implementation of the new model
detects the coke can in about 85% of the images to be the 4
th
most salient region in the test
image data set. And, the top-down and global maps detect the coke can in about 86% of
the images to be within the first 2-3 most salient regions.
Cell Phone Detection
As another test case, a black colored cell phone was used as a target and the weights were
learnt using the training images and tested on another set of images. Using the same top-
down approach as for the coke can, the basic bottom-up saliency algorithm without any
103
Figure 3.12: Examples of training images for the coke can and corresponding bottom-up
salience maps
Figure 3.13: Images of test cases for the coke can with their bottom-up salience maps
and global salience maps created by combining the bottom-up and top down
salience maps
104
additional color information was used to evaluate weights for the intensity, saturation and
edge information streams. The number of training images was 61 and the number of test
images was 108. The weights that were obtained after the training phase were [0.2259,
2.3804, 0.1940] for the saturation, intensity and edge streams respectively. It can be ob-
served that for the black cell phone, the intensity stream provides the most information
about the target features to the saliency detection algorithm. The results for the percentage
of images in which the cell phone was successfully found and the corresponding hit num-
bers are stated in table 3.9. The results show that again, the bottom-up only model which
does not have any a priori information about the target object, detects the cell phone to be
the most salient object in about 17% of the images and detects it to be between the 5
th
and
10
th
most salient object in approximately 33% of the test images. Using top-down salience
maps, the cell phone was detected to be the most salient object in approximately 50% of
the images and between the 5
th
and 10
th
most salient object in approximately 13% of the
test images. Using Global salience maps, the cell phone was detected to be the most salient
object in approximately 42% of the images and between the 5
th
and 10
th
most salient object
in approximately 14% of the test images. On an average, the purely bottom-up implemen-
tation of the new model detects the cell phone in about 86% of the images to be the 4
th
most
salient region in the test image data set. And, the top-down and global maps detect the cell
phone in about 88 - 91% of the images to be within the first 2-3 most salient regions. It can
be observed that the detection results using the top-down maps are better than those using
the global maps.
105
Table 3.9: Hit numbers and percentage of images in which the target cell phone is found in the
Bottom Up (BU), Top Down (TD) and Global salience maps
t = 0.5 Percentage of Images Hit for Each Hit Number
Hit Number 1 2 3 4 5 - 10 Average Hit Number
BU 16.5% 17.6% 12.1% 6.6% 33% 4.0513 (85.7143%)
TD 49.5% 7.7% 13.1% 7.7% 13.2% 2.6265 (91.2088%)
Global 41.8% 15.4% 9.9% 6.7% 14.3% 2.6750 (87.9121%)
It can be inferred that for this particular test data set, the bottom-up maps are adding
noise to the top-down maps when creating the global maps and thus the performance dete-
riorates. The new model uses only 3 information streams which could be a reason behind
the noise problem. With only three streams, as seen for both the coke can and the cell
phone, only one stream is the excitiation stream and the others are inhibitory. For the cell
phone case, there is added competition from black colored objects frequently found in nat-
ural surroundings that also contribute to the intensity stream. With more feature streams it
would be easier to distinguish between objects that are similar in a few features but differ
from each other in at least one feature. Limited number of feature streams do not provide
the band width to distinguish between objects that share the feature that is the excitation
feature for the object of interest. Modifying the value of t to be 0.1 in order to reduce the
influence of the bottom-maps improves performance but again to evaluate a generic value
of t which could be used in various test cases is difficult. Figure 3.14 shows a few more
examples of test images for the cell phone and the corresponding bottom-up and global
salience maps. It can be observed here also that the regions where the cell phone is present
are more prominent in the global maps than in the salience maps.
106
Figure 3.14: Examples of testing images for a cell phone with the corresponding bottom-
up and global salience maps
3.5 Summary
As part of the algorithmic design for a retinal prosthesis image processing module, a con-
cise and computationally efficient model (the ’new model’) for detecting salient regions in
an image frame is proposed. This model is based on the widely used visual attention model
by Itti et al (the ’full model’). The new model has a few key differences when compared to
the full model. The new model uses intensity, saturation and edge information for process-
ing the image. Overall, the number of feature-maps generated and the number of streams
used for processing are much less for the new model than the full model. This makes the
implementation of the new model much more computationally efficient than using the full
model for saliency detection. Experiments tracking human gaze data when looking at a
set of 150 images show that statistically there is a good correspondence between the re-
107
gions gazed at by human observers and regions detected as salient by both the new and full
models which is greater than chance. The quantitative analysis shows that the new model
outperforms the full model for this image data set in one of the two analysis methods. The
execution results for the unoptimized implementation of the new model and only 14% (one
stream) of the full model on the TMS320DM642, 720 MHz DSP show that the new model
is approximately 10 times faster than the full model. This is important because the image
processing module for the retinal prosthesis is planned to be a wearable module. Improve-
ments in processor speeds and their power consumptions along with optimization of the
code implementation can make the model run more efficiently. Increased algorithm effi-
ciency will also result in lower power consumption. However, even an unoptimized version
of the new model executes in less than 1 second which is an acceptable response time when
a user request is put in to the algorithm.
The algorithm is designed to provide information to the implantees about the areas in
their peripheral visual field. Users could use this algorithm during navigation to avoid ob-
stacles or to search for obejcts of interest. For navigation and mobility related activities, the
basic bottom-up implementation of the algorithm may suffice to detect large objects like
tables, doors etc. that may be lying in the path of the subjects. When looking for objects
of interest for e.g. an exit sign, a book or a drink, the algorithm that is combined with
top-down information can be used to improve the performance of the saliency algorithm.
The top-down approach can easily be used to calculate weights of a set of objects that are
of interest to the user so that they may use the algorithm to look for the objects. Modifying
the algorithm to use the hue information of the object of interest and calculating weights
108
for a set of images in the data base is not very difficult to implement, nor is it computation-
ally expensive. A set of weights for a large database of possibly interesting objects can be
calculated before hand and based on the object of interest being looked for, the algorithm
can be easily coded to include the particular hue in the processing. This approach of incor-
porating top-down information about objects of interest does not significantly increase the
computational requirements from the image processing module.
In general, a bottom-up algorithm does not require a priori information about the image
scene and also does not require any training. It may be useful for the subjects to get direc-
tional cues from the algorithm and use those to familiarize themselves with new surround-
ings. To make object detection or recognition more generic, complex object recognition
algorithms could be implemented at an additional computational cost. Such algorithms are
increasingly complex and do not get rid of the need to train a set of images on the object of
interest. Most recognition/detection algorithms work through small portions of the image
frame to detect the object of interest. This adds a huge computational load on a system. Al-
though, if combined with saliency, this computational load can be limited and minimized
by assessing only the salient regions using the object recognition/detection algorithm to
ascertain whether the desired object is present in that region.
For image processing algorithms to be effective in guiding retinal prosthesis implant
patients, more understanding of what kind of functions and features are important to these
patients is required. Patients will also require training to utilize the information provided
by the algorithms effectively and usefully. It is not known if patients will be able to perform
better by utilizing such additional information provided by the algorithm or will they prefer
109
to have only the unfiltered video data and then make their own decisions about the objects
and information that they see in their visual field. With these questions still unanswered,
a task dependent processing approach (bottom up or top down based on the task at hand)
might be better when designing the image processing modules for the prosthesis.
In summary, a computationally efficient image processing algorithm that can be used
to identify important regions and objects lying in the peripheral visual field of retinal pros-
thesis implantees is implemented. This algorithm could be implemented on a wearable
computing platform with a camera on a pair of glasses. This algorithm can also be poten-
tially used by low-vision patients having restricted visual fields. However, there are critical
questions like how quickly can people learn to use the algorithm to guide them and how
much benefit will they gain that need to be answered. To answer these questions, human
subject testing with normal sighted volunteers who are provided with restricted simulated
vision and who use the algorithm to perform several different tasks was conducted. These
experiments and their outcomes in terms of the benefits from a saliency based cueing algo-
rithm are discussed in the next chapter.
110
Chapter 4
Simulated Vision Experiments
To be an important component of the image processing module of a retinal prosthesis, it
is essential to evaluate the performance of the saliency based image processing algorithm
with respect to the requirements of the prosthesis recipients. Retinal prosthesis implantees
will only have partial vision in the central 15-20 degrees of the visual field. The loss of
peripheral information may hamper their mobility. Also, the device imparts vision using a
finite number of electrodes that act as individual pixels in image processing terminology.
This implies that the resolution of the vision that the implantees will be offered through
such a device will be very low and image perception may not be high quality. So far, visual
acuities of up to 2.2 logMAR have been recorded in retinal prosthesis implant recipients
[Caspi et al., 2009]. Several studies are being conducted with implant recipients of such
a device to understand how do they perceive the stimulated visual information and what
kind of tasks can they perform with this vision. Simultaneously, the research community
is also focusing on studies that use human volunteers with normal vision but provide them
with simulated vision through a pair of display glasses and ask them to perform tasks.
111
These studies are useful in studying the optimum number of pixels/electrodes required
to provide basic mobility and search skills to implantees of the prosthesis. The number
of implantees being limited and regulated by an FDA protocol, simulated vision studies
overcome the issue of having a large implanted subject pool and can be conducted on a
larger scale with normal sighted volunteers. Several studies done with visually impaired,
normally sighted and prosthesis implantees that aim to understand various parameters that
affect the perception outcome of an image by such subjects are discussed here. A set of
simulated vision experiments designed to test the performance and benefits of a saliency
algorithm when normal sighted volunteers perform various tasks in different environments
are presented in this chapter.
4.1 Background
4.1.1 Studies with the Visually Impaired
A questionnaire was devised to assess how the visually impaired perceive their abilities for
moving independently in different surroundings [Turano et al., 2002]. 127 subjects enrolled
for the study then rated 35 different scenarios according to how difficult or easy did they
find those tasks to be. The scenarios were a list of daily activities like ‘moving about in the
home’, ‘walking down stairs’, ‘walking through door-ways’, ‘avoiding bumping into knee-
high obstacles’, ‘walking at night’, ‘walking in crowds’, ‘stepping onto and off curbs’,
‘moving about in stores’, etc. More than the functional ability of the eye, the perceived
ability of the subjects in doing a particular task can relate more to their performance. The
112
results of the study showed that ‘moving about in the home’ was perceived by the subjects
to require the least amount of visual ability whereas, ‘walking at night’ was perceived to be
the activity requiring the most amount of visual ability. Many patients who were not legally
blind, and thus had a fair amount of remaining vision, said that they had a fear of falling
and a huge percentage had already had a fall. Thus, regardless of the severity of visual
field loss, the perception of their abilities by patients greatly influences their performance
in certain tasks. The study suggests that subjects who may not be legally blind but perceive
their mobility abilities to be reduced and find certain activities to be extremely difficult may
benefit by rehabilitation services.
In another study, the gaze direction of visually impaired as well as normal sighted
volunteers when both groups performed the same task was studied [Turano et al., 2001].
Knowing the preferences of a particular group of subjects when they fixate while perform-
ing different activities can help in understanding why and in what situations is a particular
kind of information or parameter being used more. This could help in establishing a rela-
tionship between the amount of visual information, the strategies used by the subject group
and their mobility. The visually impaired group for this study consisted of 6 subjects with
varying progressions of retinitis pigmentosa and 3 normal sighted subjects. Subjects were
asked to navigate through an unfamiliar and obstacle-free route. Results showed that RP
patients scanned 3 times the area of the visual field compared to normal sighted subjects.
87% of the fixations for RP subjects were when they were looking down, at the objects on
the walls or at the layout of the environment like intersections of the wall and the floor.
Whereas, 75% fixations of the normal sighted subjects were straight ahead or at the final
113
goal which was a door. RP destroys peripheral vision first and this study suggests that with
the loss of peripheral vision, mobility might be hampered because of which the visually
impaired subjects were seen to have more fixations towards regions on the path that would
give them more information about the layout of the environment and nearby obstacles and
thus the safest routes, rather than the goal.
The gaze patterns of both the visually impaired and normally sighted subjects while
they performed a complex and high-risk task of crossing intersections or streets were also
studied by Geruschat, Turano and colleagues [Geruschat et al., 2003]. The patients in
the visually impaired group suffered from age-related macular degeneration (AMD) and
glaucoma. Activities were divided into 3 parts of ‘walking to the curb’, ‘standing at the
curb’ and ‘crossing the street/intersection’. The study showed that in the 4 second time
period before crossing, the visually impaired subjects who crossed early or on time fixated
primarily on vehicles whereas normally sighted subjects crossing early fixated primarily
on vehicles but those who crossed on time by waiting for the light to change fixated on the
light. This study showed that areas which are fixated on in the visual field change as the
vision status is affected by diseases and their severity.
Velikay-Parel and colleagues studied the average time taken and the number of contacts
made when 3 groups of visually impaired subjects navigated through 3 similar mazes with
an equal number of obstacles [Velikay-Parel et al., 2007]. The 3 groups were divided on
the basis of the visual acuity. The subjects in the group with the highest visual acuity had
the largest visual field. The study showed that there was a significant difference in the
average time taken by the 3 groups and this was influenced by the extent of the visual field
114
and visual acuity. Subjects with the least visual acuity and visual field took the longest to
navigate through the maze. However, there was no significant difference observed in the
number of contacts made by the three groups with the obstacles.
Studies with RP Implantees
Performance of three subjects implanted with the epiretinal prosthesis while they performed
simple visual tasks was studied by Yanai et al. [Yanai et al., 2007]. The subjects had
an epiretinal implant with a 4 x 4 electrode grid and controlled wirelessly with a head-
worn camera or a computer. The tasks comprised of locating and counting the number
of objects, discriminating the orientation of the letter ’L’ and differentiating between the 4
different directions of motion of a rectangular white bar. For object detection, subjects were
required to know the presence or absence of a white object. In the event that the object was
present, they were required to identify if its location was in the right or left visual field. For
the counting objects task, subjects responded saying whether there were one (to the right
or left visual field), two or zero objects in their visual field. All the subjects performed
significantly better than expected by chance in 83% of the test cases. The results from the
study are very encouraging for an epiretinal prosthesis. The vision with an array of 4 x
4 electrodes is very crude and yet sufficient for the subjects to perform so many different
tasks accurately. This gives more hope for the future of the epiretinal prosthesis with arrays
having a higher number of electrodes and providing better resolution.
115
4.1.2 Simulated Vision Studies with the Normally Sighted
Simulated vision can be used to conduct experiments with normally sighted volunteers
by giving them partial vision when they perform different tasks. This helps get an idea
about how visually impaired subjects might perform similar tasks. Simulated vision also
offers the flexibility for researchers to work with a bigger group of normal sighted subjects
compared to a limited group of prosthesis implantees. Many studies have worked with
normal sighted volunteers to determine the ideal number of electrodes/pixels that would be
required by retinal or cortical visual prostheses to provide basic visual abilities to the blind.
Some experiments have also studied how the performance of normal sighted volunteers
degrades with changes in parameters like contrast, number of electrodes/pixels etc. Some
of these studies are discussed here.
A series of psychophysical experiments using a phosphene simulator to estimate pa-
rameters like number of electrodes and spacing between electrodes for a visual prosthesis
were conducted by Cha et al. [Cha et al., 1992a,b]. Their study suggested that an array
600 - 625 electrodes placed in a 1 cm
2
area of the visual field which is approximately 30
degrees might suffice to provide limited but useful and functional vision to the blind. They
conducted reading experiments with an array of 25 x 25 pixels and showed that reading
rates of upto 170 words/min with scrolling text and 100 words/min with fixed text can
be achieved when 4 letters of text are projected onto a 1.7 degree visual field [Cha et al.,
1992c]. The reading material was of grade level 4-8 and the visual stimulus was black on
116
white. The fixed text case required the subjects to make voluntary head movements to read
the text which reduced the overall reading speed.
Experiments were conducted for facial recognition tasks by varying different simula-
tion vision parameters like the pixel grid size, the size of the pixel dots on the grid, the
gap between the dots, the gray-scale resolution and the dot drop out rate [Thompson et al.,
2003]. The dot drop out rate is to simulate for possible electrode failures on the implanted
arrays in a prosthesis. All the stimulus parameters affected the performance of subjects
who were asked to identify a set of unfiltered images using simulated vision. The subjects
were shown the set of images before being provided with simulated vision. Subjects per-
formed highly accurately regardless of the contrast levels (high or low) of the simulated
vision. Performance degraded quite a bit when the drop out rate for the pixels was set to
70%. This study showed that even with the crude vision imparted by a prosthesis, reliable
facial recognition may be achieved. In another study the possibility of prosthesis recipients
having reading abilities after the implant was investigated [Dagnelie et al., 2006]. Normal
sighted subjects used simulated vision to read text paragraphs of a grade 6th level. Reading
speeds of upto 60 words/minute were observed for optimum parameter conditions. Read-
ing speeds were influenced by all the different parameters like grid size, dot size, dot gap,
gray-scale resolution and dot drop out rate. The results suggested that despite retinal reor-
ganization, if the subjects are able to perceive distinct phosphenes, 16 x 16 electrodes in a
3 x 3 mm
2
prosthesis might allow paragraph reading. In yet another study, the performance
of normal sighted subjects for mobility tasks in real world office environments as well as
virtual environments using simulated vision was studied [Dagnelie et al., 2007]. The sub-
117
jects used grids of sizes 4 x 4, 6 x 10 and 16 x 16 to perform the tasks. The parameters
recorded in the study were the time, navigation errors and number of contacts with obsta-
cles. The best performance for this study was when subjects used the 16 x 16 grid but the
findings suggested that with practice and learning, a 6 x 10 grid may also suffice to provide
basic way-finding abilities to the prosthesis implantees. A study investigating the ability
of normal sighted and visually impaired subjects to adapt to phosphene images for a cor-
tical prosthesis showed that patients might be able to perform tasks that involve eye-hand
coordination, visual inspection and way finding with as little as 325 electrodes/phosphenes
with practice [Srivastava et al., 2009].
The performance of normal sighted subjects was evaluated when they performed tasks
requiring eye-hand coordination, object identification and reading using simulated vision
[Hayes et al., 2003]. Again, the 4 x 4, 6 x 10 and 16 x 16 grids were used for simulated
vision. The eye-hand coordination included tasks like pouring candy from one cup into
another empty cup and cutting around the edges of a white center on a black square. The
object identification task required the subjects to identify whether the object placed in front
of them on a table was a spoon, cup, plate or pen. An orientation identification task required
the subjects to discriminate the orientation of a tumbling ’E’. For the reading tasks, subjects
could read fonts as small as 36 point with a 16 x 16 grid. Overall, the best performance
was achieved with a 16 x 16 grid, but a 4 x 4 grid sufficed for identifying simple objects
and symbols.
A number of studies on simulated vision and portable image processing modules exe-
cuting image processing algorithms in real-time have been done by Lovell and colleagues.
118
The effects of practice were assessed on the visual fixation, saccades and smooth pursuits
of subjects performing a visual tracking task in order to improve the perception of the im-
age using simulated vision for the subjects [Hallum et al., 2005]. Subjects were provided
with artificial phosphene vision using different sampling and filtering schemes. The results
from the study suggested that the performance of the subjects improved when they were
using overlapping Gaussian kernels for the filter scheme. A study with simulated vision
experiments for identifying the orientation of the Landolt ’C’ letter showed that the per-
formance of subjects was signficantly better using a hexagonal arrangement for the grid
of phosphenes rather than a rectangular arrangement which is used traditionally in other
studies [Chen et al., 2009a]. Another study suggested that the phosphenes should be round
and filtered using a Gaussian, should have about 8 - 16 gray levels and their representation
should be dynamic and should refresh with head scanning movements [Chen et al., 2009b].
The study also showed that the interactions between neighboring phosphenes should be
represented and a HMD be used and also, ambient light should be blocked for a more
accurate visual model.
4.2 Set up for Experimental Designs for Testing the Saliency Algo-
rithm
For all the simulated vision experiments discussed in this thesis, subjects wore an eMagin
Z800 Head Mounted Display (HMD) from Arrington Research Inc., USA, on which the
simulated vision pixels are displayed. A scene camera with a field of view close to 60
119
degrees is also mounted on the HMD. The diagonal field of view of the HMD is about
40 degrees. Figure 4.1 shows a subject wearing the HMD system. Subjects also wore a
shroud to block their natural peripheral vision when performing experimental tasks. For a
majority of the experiments, one of the eyes of the subjects was patched. Incoming camera
image information was converted into a grid of pixels for simulated vision using a custom
software. Simulated vision was provided in the central diagonal 14 degrees of the HMD in
the form of a 6 x 10 pixels grid to emulate the current version of epiretinal implants having
60 electrodes. The rest of the display pixels were set to zero or black. The central 14
degrees of visual information from the image captured by the scene camera was extracted
and reduced to an arrangment of 60 circular pixels. The gap between two pixels was 0.5
degrees in the horizontal direction and 0.8 degrees in the vertical direction. The number of
gray levels for the simulated vision representation was set to 8. Random electrode drop outs
of 30% were simulated to account for failed electrodes in the actual prosthesis implants. In
any one session of testing, the pixels dropped for the random electrode did not change, but
between sessions, the randomly dropped pixels varied. A study using simulated vision for
navigation tasks showed that for normal sighted subjects, performance begins to degrade
slightly at a 30% electrode drop out [Dagnelie et al., 2007]. With a healthy retina, normal
sighted volunteers can adapt pretty quickly to the reduced and restricted vision. To make
the tasks slightly different to such normal sighted volunteers, a drop out of 30% was chosen.
The IS 1200 VisTracker from Intersense Inc., was attached onto the HMD module to record
the head movements of the subjects while they performed the different tasks.
120
Figure 4.1: Head Mounted System for displaying simulated vision and a scene camera to
capture the real-world information
For the cueing system, the scene camera image frames were input to the image pro-
cessing algorithm and salient regions were detected. Based on the location of these salient
regions with reference to the central vision, directional cues in the form of visual cues were
provided to the users. The cues were given in one of 8 directions: top, down, left, right,
top-right, top-left, bottom-left or bottom-right. The cues were in the form of blinking dots
outside the periphery of the simulated vision. The algorithm detected upto 5 salient regions
for the users. The cueing system was implemented as an on-demand mechanism. The sub-
jects were required to ask for cues and keep their head relatively steady once they requested
the cues. The algorithm automatically provided them with the visual cue in the form of the
blinking dot in the direction of the most salient region.
4.3 Experiments
Results from 5 different experimental designs for evaluating the performance of the saliency
algorithm as an aid in the form of the cues to subjects performing tasks using simulated vi-
sion are presented here.
121
4.3.1 Finding Objects on an Uncluttered Table Top
Subject Population
After the approval from the Institutional Review Board (IRB) at the University of South-
ern California (USC), 7 normally sighted volunteers were enrolled and a signed informed
consent was obtained from each participant. Subjects were required to be English speak-
ing, having reading knowledge, 18+ years of age, no history of vertigo, motion sickness or
claustrophobia; no cognitive or language/hearing impairments and have a visual acuity of
20/30 or better with normal or corrected vision. Visual acuity testing for each participant
of the study was done using a Snellen visual acuity eye chart in the lab.
Methods
7 subjects participated in the study and were given simulated vision of 6 x 10 pixels with
a 30% electrode drop out and 8 levels of gray. For this experiment, the subjects were
given vision in the central 10 degrees of the HMD and were seated at a desk. As part of
the experiment, 1, 2 or 3 objects were placed on the desk. Subjects were asked to find the
objects on the desk by first using head movements to scan around the desk and then by using
cues from the algorithm. 6 trials were conducted for each of the 1, 2 and 3 object cases
for a total of 18 trials for each of the no cue and cueing cases. The parameters measured
were the total head movements (summed up for the horizontal and vertical directions) and
the time taken to finish the task. The set-up for the experiment is shown in figure 4.2. The
area within the red square is the central 10 degree area of the scene camera image frame.
122
Figure 4.2: Set up for the object finding task
This information is then down sampled to form a 6 x 10 pixel formation. The yellow box
depicts the field of view of the HMD. The circular roll of tape in the center of the desk
acted as a reference for the subjects to start off and finish each trial. For the cueing case,
subjects came back to the tape after finding each object on the desk. For cueing, subjects
were required to wait for a cue before finding each object and then follow the direction of
the cue to find the object. For the no cueing case, subjects did not have any help from the
algorithm and were required to use free scanning head movements across the desk to find
the object(s).
Results
The results for the total head movements in degrees and the time taken to finish each of the
1, 2 and 3 object tasks are shown in figures 4.3 to 4.10.
Figures 4.3 to 4.5 show the head movements in degrees averaged over all subjects for
each trial, for the 1, 2 and 3 object cases. It can been seen from the graphs that for all the
three cases, the head movements for the no cueing case are much higher than the cueing
123
case. The average head movements of all subjects for the no cueing and cueing trials, for
each of the 1, 2 and 3 object cases are shown in the form of a bar graph in figure4.6. It
can be observed that the average head movements are much higher for the no cueing trials
than the cueing trials for all the 1, 2 and 3 object cases. A statistical paired t-test analysis
(p<0.05) for the no cueing and cueing trials in each of the 1, 2 and 3 object cases shows
that the total head movements are significantly higher for the no cueing trials than for the
cueing trials.
Figures 4.7 to 4.9 show the time taken in seconds averaged over all subjects for each
trial for the 1, 2 and 3 object cases. For all the three cases, the time taken for the no cueing
trials is also much higher than the cueing trials. The average time taken by all subjects for
the no cueing and cueing trials in each of the 1, 2 and 3 object cases is shown in the form
of a bar graph in figure 4.10. Again, it can be observed that the average time taken is much
higher for the no cueing trials than for the cueing trials for all the 1, 2 and 3 object cases. A
statistical paired t-test analysis (p<0.05) for the no cueing and cueing trials in each of the
1, 2 and 3 object cases shows that the time taken is significantly higher for the no cueing
trials than for the cueing trials.
Discussion
Although this experiment is designed to have a very controlled environmental set up, it
gives a basic idea of the usefulness of a cueing system in guiding subjects to find different
objects. Blind subjects usually keep their personal areas relatively uncluttered. In such
124
Figure 4.3: Head movements in degrees averaged over 7 subjects for each trial for the no
cueing and cueing cases with 1 object
Figure 4.4: Head movements in degrees averaged over 7 subjects for each trial for the no
cueing and cueing cases with 2 objects
125
Figure 4.5: Head movements in degrees averaged over 7 subjects for each trial for the no
cueing and cueing cases with 3 objects
Figure 4.6: Average head movements and standard error of mean (sem) for the 1, 2 and 3
object cases
126
Figure 4.7: Time in seconds averaged over 7 subjects for each trial for the no cueing and
cueing cases with 1 object
Figure 4.8: Time in seconds averaged over 7 subjects for each trial for the no cueing and
cueing cases with 2 objects
127
Figure 4.9: Time in seconds averaged over 7 subjects for each trial for the no cueing and
cueing cases with 3 objects
Figure 4.10: Average time and standard error of mean (sem) for the 1, 2 and 3 object cases
128
uncluttered environments, a cueing system that can point subjects to objects or regions of
interest can be greatly beneficial to reduce the time taken to find the objects and in re-
ducing the total head movements. Head scanning is found to be beneficial for blind and
visually impaired subjects to understand the object(s) that they are looking at. However, in
unknown surroundings, head scanning can lead to confusion and fatigue. The videos from
this experiment show that when performing the task without the help of cues, subjects tend
to scan the entire desk several times in several directions to gather information about the
desk set up and the objects. The scanning strategies adopted by the subjects also differ from
one another. A minority of the subjects participating in the experiment had organized head
movements, however, the majority did not. This majority of subjects sometimes scanned
the same area again and again but did not scan the other areas at all. In such scenarios,
subjects had to continuously scan for a longer time until they found the objects. A cue-
ing mechanism can benefit the subjects by restricting the span of their scanning area to a
few different regions of interest. Different scanning patterns of subjects might affect their
performance for mobility related tasks also.
4.3.2 Mobility Task with Similar Looking Obstacles
Subject Population
This study was a pilot study conducted with 3 normal sighted volunteers from the retinal
prosthesis lab (who were familiar with this project) and 1 naive normal sighted volunteer.
129
Subjects had a visual acuity of 20/30 or better with normal or corrected vision. Visual
acuity testing for each subject was done in the lab using a Snellen visual acuity eye chart.
Methods
4 subjects participated in the study and were given simulated vision in the form of 6 x
10 pixels with a 30% electrode drop out and 8 levels of gray, in the central 10 degrees
of the HMD. Subjects also wore a shroud to block their natural peripheral vision. For this
experiment, the subjects had to navigate past an arrangment of chairs in an otherwise empty
room and find a target on the wall at the end of the chair arrangement. There were a total
of 13 chairs and the room was a 15 x 15 m
2
relatively empty space. The target on the wall
was a red rectangle about 61 x 72 cm
2
in size. Subjects started the trial on one side of
the chair arrangement and navigated through the path. The subjects were not allowed to
see the arrangement in between trials. The chair arrangement was not changed between
trials but the target was moved around on the wall. The starting position of the subjects
was also changed for every trial. The subjects were allowed to familiarize themselves with
the simulated vision system for a few minutes before the trial began. For the initial trials,
only time taken to complete the task was recorded for the subjects. For the later trials, head
movements in the horizontal (X) and vertical (Y) direction were also recorded. The number
of trials varied between 8 and 15 for the different subjects. Figure 4.11 shows an example
of the environmental set up (top image) for this experiment as well as the corresponding
simulated vision image (bottom image). The red square shows the central 10 degrees of the
130
Figure 4.11: Chair and target set up for mobility task
scene camera image which is reduced to 60 pixels for the simulated vision as shown in the
image below.
Results
Besides the time taken to finish the task, the head movement velocities in the horizontal
(X) and vertical (Y) directions and the total head movements (X + Y) were also recorded.
The number of trials being different for all the subjects, the data was not averaged for all
the subjects for each trial. The graphs show the data for all the subjects for each trial
separately. Figures 4.12 and 4.13 show the head movement velocities in the X and Y
directions when subjects performed the task with and without cues. Figure 4.14 shows the
total head movements in degrees for the no cueing and cueing trials. Figure 4.16 shows
131
the time in seconds taken by the subjects to navigate past the chairs and find the target
on the wall with and without cues. For figure 4.12, the velocities with values greater than
45 degrees/second have been clipped in the graph to a value of 45 degrees/second. This
has been done to give more clarity to the other data points with lower values in the graph.
Figures 4.15 and 4.17 are the bar graphs for the head movements in degrees and time in
seconds for the no cueing and cueing trials averaged over all subjects. The error bars
represent the standard error of mean of the data sets.
It can be observed from figures 4.12, 4.13 and 4.14 that for both, the horizontal and
vertical head velocities and the total head movements, the inter-subject variablility is quite
high. No learning trend is observed for the mobility task for head velocities in either di-
rection or the total head movements for either of the cueing or no cueing cases. A paired
t-test analysis (p<0.05) suggests that the velocities in both, the horizontal X-direction and
vertical Y-direction are significantly lower for the cueing trials than for the no cueing trials.
A paired t-test (p<0.05) analysis shows that the total head movements (sum of horizontal
and vertical head movements in degrees) are significantly less for the cueing trials than for
the no cueing trials.
Figure 4.16 shows that again, none of the subjects exhibit a learning curve for the time
taken to finish the trials. A paired t-test (p<0.05) between the no cueing and cueing trials
for all the subjects shows that there is no significant difference in the time taken between
the cueing and no cueing trials.
132
Figure 4.12: Head movement velocity in the horizontal direction for all subjects
Figure 4.13: Head movement velocity in the vertical direction for all subjects
133
Figure 4.14: Total head movements in degrees for all the subjects for the no cueing and
cueing trials
Figure 4.15: Bar graph representing the average head movements in degrees over all trials
and all subjects for the no cueing and cueing cases along with the standard
error of mean (s.e.m)
Figure 4.16: Time in seconds to finish the task for all subjects
134
Figure 4.17: Bar graph representing the average time in seconds over all trials and all
subjects for the no cueing and cueing cases along with the standard error of
mean (s.e.m)
Discussion
For this experiment also, the head movement behavior of the subjects showed that subjects
adopted different scanning strategies to perform the trials. Using a cueing system compared
to natural head scanning movements may not be as intuitive to normal sighted subjects
because of the ability of their healthy retina to adapt very quickly to the reduced vision.
How soon the subjects get acquainted with the cueing system also differs significantly
between subjects which could be a factor behind the variability observed between subjects.
The room in this experimental set up was a big open space and the chairs were arranged
only in one part of the room because of which many times the subjects wandered out of
the boundaries of the chair arrangement and took more time and more head movements to
realize that they were not on the right path. Such trials also influenced the variability that
was observed for the head movement velocities and time. With cueing, when the subjects
wandered off, they would quickly realize in 2-3 cues that they were not near the chairs
135
and were in the wrong direction and wouldn’t waste much more time in the off-boundary
areas. The cueing mechanism and the processing related to it also takes a finite amount of
time. The subject stops walking when a cue is required, then asks for the cue, the image
processing algorithm processes the camera frame and gives out the preferred direction of
interest in the form of a visual cue and the subject then decides if he/she wants to follow
the cue or ask for more cues. This entire process adds time to the cueing trials compared to
the no cueing trials.
4.3.3 Mobility Task in an Office Area
Subject Population
The same subjects who participated in the previous experiment participated in this study.
The subjects were 3 normal sighted volunteers from the retinal prosthesis lab (who were
familiar with this project) and 1 naive normal sighted volunteer. Subjects had a visual
acuity of 20/30 or better with normal or corrected vision. Visual acuity testing for each
subject was done in the lab using a Snellen visual acuity eye chart .
Methods
4 subjects were given simulated vision in the form of 6 x 10 pixels in the central 10 degrees
of the HMD with a 30% electrode drop out and 8 levels of gray. Subjects also wore a shroud
to block their natural peripheral vision. For this experiment, subjects were required to find
4 targets in a lobby environment. The set up area had 6 doors, plants, furniture like sofa
136
sets, tables, and the target objects. Subjects were required to find the 4 targets after entering
the testing area. The targets were 2 doors, a square black target about 50.6 x 26.4 cm
2
in
size and a table about 2 feet in height. The set up was not shown to them at any time. Once
in the lobby environment, how much to walk around the area was a subject’s own choice.
With a healthy retina, normal sighted volunteers would quickly learn the arrangement of
objects for a set-up in a small and confined area even when using simulated vision. Hence,
one reason behind choosing doors as targets was to create a certain degree of orientation
loss for the subjects. Arrangement of certain pieces of furniture and the targets except the
doors was changed in every trial. Time taken to find the targets and finish the task, the
head movement velocities in the horizontal (X) and vertical (Y) direction and the total head
movements in degrees were recorded. The number of trials varied between 1 to 5 for the
different subjects. Figure 4.18 shows an example of the environmental set up (top image)
for this experiment as well as the corresponding simulated vision image (bottom image).
The red square shows the central 10 degrees of the scene camera image which is reduced
to 60 pixels for the simulated vision in the image below.
Results
The graphs in figures 4.19, 4.20 and 4.21 show that the head velocities in both directions
and the total head movements are much higher for the no cueing trials than for the cueing
trials. Subject 3 was able to finish only 2 trials because of confusion and fatigue. As
seen from the timing graph in figure 4.23, the subject took almost 10 minutes to finish
137
Figure 4.18: Experimental setup and simulated vision
the second trial without cues and could not continue with the trials because of the fatigue
resulting from that trial. The head movement data for one of the trials for subject 3 did not
get recorded and hence that data point is not available. The data points in the graph for the
horizontal head velocities with values greater than 45 degrees/second have been clipped to
a value of 45 degrees/second for the sake of clarity of the graph for the rest of the data
points. Figures 4.22 and 4.24 are the bar graphs for the head movements in degrees and
time in seconds for the no cueing and cueing trials averaged over all subjects. The error
bars represent the standard error of mean of the data sets.
Statistical analysis in the form of a paired t-test (p < 0.05) for the head velocities in
both the horizontal and vertical direction and the total head movements shows that the head
velocities and head movements for the no cueing trials are significantly higher than for the
cueing trials. A statistical paired t-test analysis (p<0.05) for the time shows that there is
138
Figure 4.19: Head movement velocity in the horizontal direction for all subjects
no significant difference between the time taken by the subjects to finish the no cueing and
cueing trials.
Discussion
This experiment tested the performance of subjects when they were asked to find a set of
targets in a real-world environment. Amongst all the experimental designs discussed in this
chapter, this design was one of the first in a real-world set up to evaluate the benefits of a
saliency based cueing algorithm to the subjects. The number of trials in this design were
limited, but again, no learning was observed for the subjects in the few trials that they per-
formed. For the set-up in this study, subjects found it more difficult to distinguish between
objects. In the previous experiments, there were limited objects or easily distinguishable
objects like white chairs. But in this design, the targets of interest blended very well with
the natural environment which made it difficult for the subjects to understand the objects
139
Figure 4.20: Head movement velocity in the vertical direction for all subjects
Figure 4.21: Total head movements in degrees for all the subjects for the no cueing and
cueing trials
140
Figure 4.22: Bar graph representing the average head movements in degrees over all trials
and all subjects for the no cueing and cueing cases along with the standard
error of mean (s.e.m)
Figure 4.23: Time in seconds to finish the task for all subjects
141
Figure 4.24: Bar graph representing the average time in seconds over all trials and all
subjects for the no cueing and cueing cases along with the standard error of
mean (s.e.m)
that they were looking at, especially in the no cueing trials. With no cueing, subjects kept
scanning continuously at times without being able to distinguish between the different ob-
jects, their sizes and locations and thus found some of the trials to be very confusing. With
the cueing trials, the subjects would feel confident that the algorithm is cueing them to some
object and they would try to pay more attention to understand what they were being cued
to. However, it was also observed that with the cueing being based on a bottom-up saliency
algorithm, sometimes the targets were not the first cue but instead cues ranked between 2
and 5. In such scenarios, if subjects moved away from a certain position where the first cue
was not towards a target without asking for more cues, they would miss the object and take
longer to find it. This experiment suggested that when subjects are searching for specific
objects, it would be best to combine the bottom-up algorithm with some sort of top-down
142
processing for object recognition. This would help in ranking the cues such that priority is
given to the direction of the targets of interest.
4.3.4 Mobility Task in a Corridor
Subject Population
After the approval from the Institutional Review Board (IRB) at the University of Southern
California (USC), 10 normally sighted volunteers were enrolled and a signed informed
consent was obtained from each participant. Subjects were required to be English speaking,
having reading knowledge, 18+ years of age, no history of vertigo, motion sickness or
claustrophobia; no cognitive or language/hearing impairments and have a visual acuity of
20/30 or better with normal or corrected vision. Visual acuity testing for each participant
of the study was done in the lab using a Snellen visual acuity eye chart in the lab. As part
of the informed consent, subjects were informed about the kind of tasks that they would be
performing. However, the subjects were not provided with any information about the kind
of parameters that were being recorded, the related analysis and the end goal of the study.
Subjects were naive to the set-up of the study as well as to the aim of the study.
Methods
Subjects were required to navigate past obstacles in a corridor and find a target sign on the
wall at the end of the navigation path. The corridor was about 9.5 x 2.5 m
2
. Simulated
vision was in the form of a 6 x 10 pixel grid in the central 14 degrees of the HMD and
143
the subjects wore a shroud to block their natural peripheral vision. One of the eyes of the
subjects was patched to simulate retinal prosthesis implant subjects who are likely to have
the implant only in one eye. Figure 4.25 shows the corridor set up for the experiment and
figure 4.26 shows an image from the scene camera and corresponding simulated vision. All
the subjects were given one practice session in another completely different environment
where they familiarized themselves with using simulated vision and also with the cueing
mechanism. For testing on the mobility course, the starting position of all the subjects
remained the same for each trial. Some of the obstacles and the sign were moved around
and re-arranged at the beginning of a new trial. The only information provided to the
subjects about the mobility course was that the end of the path was indicated by the presence
of a very bright object. The subjects were instructed to stop at the bright object and then
look at the wall to find the sign. They were provided with this information because it was
difficult with the simulated vision for them to know that they were walking into or towards
a wall especially because they did not use a cane to guide their path. If the subjects started
walking towards a wall, they were immediately stopped.
This experiment was conducted in 2 phases. For the first phase, the 10 subjects were
randomly divided into two groups. One group performed the task without the help of any
cues and the other performed the task using cues from the bottom-up saliency algorithm.
Each subject came for two sessions within a week and completed 15 trials in each session
for a total of 30 trials. Thus, for phase 1, data for a total of 150 trials was recorded for
both the no cueing and cueing groups. All the subjects were given a break for about 10-15
minutes half-way through the session.
144
Figure 4.25: Corridor set up for mobility testing
The second phase was conducted at least 3 weeks after the completion of the first phase
for each subject. For the testing in this phase, the groups were reversed. Subjects who had
performed the task with no cueing in phase 1, were now in the cueing group and vice versa.
In this phase, only 1 session was conducted with 15 trials for each subject. The data set for
phase 2 comprised of 75 trials for each group of subjects. Again, the subjects were given a
break for about 10-15 minutes half-way through the session.
The measured parameters were the cumulative head movements in the horizontal and
vertical directions, the time taken to finish the task and the number of errors. Errors were
counted when subjects bumped into objects, when subjects asked questions like - is this the
bright box? and the answer would be no, incorrect identification of the sign and when they
ran into walls.
Results
Figures 4.27, 4.28 and 4.29 show the cumulative head movements (the sum of head move-
ments in horizontal and vertical directions) for this experiment. Figure 4.27 shows the total
145
Figure 4.26: Scene camera and simulated vision view
head movements for the no cueing and cueing cases for each trial in phase 1 and phase 2
averaged over the 5 subjects. In phase 1, learning is evident after session 1 which included
15 trials and the performance plateaus. An unpaired t-test (p<0.05) between the no cueing
and cueing trials for phase 1 shows that the head movements for the cueing case are sig-
nificantly lower than for the no cueing case. Individual analysis of session 1 and session
2 from phase 1 shows that the head movements are significantly less for the cueing case
in session 1 but for session 2, there is no significant difference between the no cueing and
cueing trials. The averages of head movements over all trials and subjects for session 1
and session 2 of phase 1 are shown in the form of a bar graph in figure 4.28. The standard
error of mean (s.e.m) values are also lower for the cueing case than for the no cueing case.
Because of the learning from phase 1, when the subjects come back for phase 2, the overall
146
Figure 4.27: Head movements averaged over all subjects for the corridor mobility experi-
ment
Figure 4.28: Average head movements for session 1 and session 2 of phase 1
values of the head movements are lower compared to phase 1 but again, an unpaired t-test
(p<0.05) shows that the total head movements for the cueing case are significantly lower
than for the no cueing case. Analyzing the data in groups of 5 trials, the difference between
the no cueing and cueing groups is the most significant in the first 5 trials of phase 1 and the
first 10 trials of phase 2. An overall comparison between the no cueing and cueing averages
for all the trials over all the subjects and the respective standard error of means (s.e.m) for
phase 1 and phase 2 are shown in the bar graph in figure 4.29.
147
Figure 4.29: Average head movements for phase 1 and phase 2
Figure 4.30 shows the raw data for the horizontal head movements from trials 1, 5 and
9 of phase 1 stacked for the 5 subjects in the cueing and no cueing groups. The head
movement data is relative to the starting positions of the subjects. For the no cueing trials,
there are frequent right to left and left to right head movements compared to the cueing
trials. This confirms the qualitative observation that with no cues, subjects continuously
scan in different areas of the corridor to gather information about objects lying on either
side of them and they do so frequently which adds to the total head movements. Figure
4.31 shows the horizontal head velocities stacked from trials 1, 5 and 9 of phase 1 for the
5 subjects in each of the cueing and no cueing groups. Head velocities are calculated for
every second and the graph thus represents the absolute horizontal head displacement of the
subjects for every second. The horizontal head velocities and hence the head displacements
in every second, for the no cueing trials are more than those for the cueing trials implying
subjects might be trying to gather more information than required when performing the
trials without cueing. The results in figures 4.30 and 4.31 are for the trials 1, 5 and 9 in
phase 1. As the number of trials increases, subjects get familiar with the environmental
setup and the variations in the movements decrease, making the graphs smoother for the
148
no cueing trials. The graphs are only for the horizontal head movements and velocities
because the amount of movement and the velocities in the vertical direction are not as high
and frequent as in the horizontal direction. The reason behind this is that the subjects look
downwards for most part of the trial knowing that the objects except the sign on the wall
are placed on the floor. Subjects also do not make huge vertical head movements in the
upward direction because of the presence of the ceiling. They certainly do make some
head movements vertically to gauge the height of the obstacles but these movements are
not as high and variable as the horizontal head movements made across the room to find
the different objects that are present.
Results for the time taken by the no cueing and cueing groups to finish the trials are
shown in figures 4.32, 4.33 and 4.34. Figure 4.32 shows that the time taken for the no
cueing trials is lesser than for the cueing trials for both phase 1 and phase 2. As was
observed for the head movements data, the learning of the environment and the learning
about how to identify different objects results in the time in phase 2 being lower than phase
1 for both the no cueing and cueing trials. A statistical unpaired t-test (p<0.05) analysis
between the no cueing and cueing trials for session 1 and session 2 of phase 1 and also
the same test between phase 1 and phase 2 shows that the time taken to finish the no
cueing trials is significantly less than the time for the cueing trials (Figures 4.33 and 4.34).
Analyzing the data in groups of 5 trials, the difference between the no cueing and cueing
groups is the most significant in all the trials in phase 1 and the last 10 trials in phase 2.
Results for the number of errors made by the subjects for the no cueing and cueing
trials are shown in figures 4.35, 4.36 and 4.37. For both the sessions 1 and 2 in phase 1,
149
Figure 4.30: Horizontal head movements for trials 1, 5 and 9 in phase 1 for the no cueing
and cueing groups
150
Figure 4.31: Horizontal head movement velocities for trials 1, 5 and 9 in phase 1 for the
no cueing and cueing groups
151
Figure 4.32: Time averaged over all subjects for the corridor mobility experiment
Figure 4.33: Average time in seconds for session 1 and session 2 of phase 1
Figure 4.34: Average time in seconds for phase 1 and phase 2
152
Figure 4.35: Number of errors averaged over all subjects for the corridor mobility experi-
ment
a statistical unpaired t-test (p<0.05) analysis shows that the number of errors made by the
cueing group is significantly lower than the errors made by the no cueing group. However,
when subjects come back for phase 2 and reverse their groups, the same analysis shows that
there is no significant difference between the number of errors made by the no cueing and
cueing groups. This is an expected result because by the end of phase 1, the subjects are
very familiar with the environmental set up and have learnt to recognize different objects,
walls, carpets etc. This results in a lower number of occurances, irrespective of the no
cueing or cueing groups, in which the subjects run into walls, or bump into objects they
can’t distinguish or identify objects incorrectly. Analyzing the data in groups of 5 trials,
the difference between the no cueing and cueing groups is the most significant for the first
10 trials in phase 1 and is insignificant for all the trials in phase 2.
153
Figure 4.36: Number of errors for session 1 and session 2 of phase 1
Figure 4.37: Average number of errors for phase 1 and phase 2
On average, the number of cues required by the subjects in the cueing group to finish
the trials was 8 in both phase 1 and phase 2.
Discussion
The strategies for both the no cueing and cueing groups were different for this experiment
also. The no cueing group subjects followed a disorganized scanning pattern in the initial
trials. Scanning very fast and in a random fashion causing confusion to a few subjects.
Subjects also couldn’t make out the difference between obstacles, walls, the carpet and
darker obstacles etc. With no cueing, the subjects quickly learnt that the darker pixels in
their simulated vision for a majority of the areas represented the carpet. Most subjects,
once they realized this, stopped finding the obstacles but instead kept walking along the
154
black patches on the path. The black patches did not always correspond to the carpet areas
leading the subjects to bump into obstacles occasionally. The obstacles being chairs and
boxes which wouldn’t shock the subjects on contact, the subjects didn’t have a fear of
bumping into something. With cueing, the subjects asked for cues, followed the object
towards which the cue was directed and then asked for the next cue. They kept moving
from one obstacle to the next till they found the bright box at the end of the path. Then
looking at the wall, they asked for more cues to find the target sign. This path was one of
the longest paths in all the experimental designs and required subjects in the cueing group
to wait frequently and ask for cues. Using cues, the subjects typically followed the path
indicated by the cues and this added substantial time to the cueing trials. Cueing had its
advantage also because of the forced path. Subjects did not get lost as much and were more
confident about the objects and their placement and hence had fewer errors and contacts
right from the start of the experiment. Also, because of the organized walking path due to
the cues, the subjects were not required to scan unnecessarily around the whole corridor
but only around the areas indicated by the directional cues where they were headed to. This
resulted in the significantly less head movements observed for the cueing group. For the no
cueing group, subjects did not have a clear idea and reference of the areas where they were
scanning and sometimes unknowing started scanning in the direction of the ceilings and the
walls. A few times subjects even turned around and started walking in the reverse direction
without realizing it. Qualitatively, most subjects preferred the cueing trials to the no cueing
trials because of the confidence given to them by the algorithm cues. Some subjects who
performed the cueing trials in phase 1 were a little anxious about the no cueing trials in
155
phase 2. In some of the no cueing trials they even wished that they had the help of cues to
guide them and avoid confusion.
The results from the head movements analysis for this experiment offer some inter-
esting insights. As subjects repeatedly performed more trials, the difference between the
performance of the no cueing and cueing groups became insignificant, specifically in ses-
sion 2 of phase 1. This can be attributed to the learning of environmental factors along with
the adaptation to simulated vision and the cueing or no cueing mechanism of performing
the task. For head movements, time and errors, phase 2 values of the respective measured
parameters were lower than phase 1 values because of the environmental and task learning
effects in phase 1. For the head movement analysis in phase 2, the cueing group again
showed significantly less head movements than the no cueing group. If the performance of
subjects was influenced only by environmental learning, the difference between the total
head movements of the no cueing and cueing groups would not have been significant even
in phase 2. In phase 2, the no cueing group consists of the subjects who had performed the
first phase with the help of cues and who had followed paths directed by the cueing algo-
rithm. In phase 2, when these subjects did not have any help from the algorithm, they felt
quite lost and had greater head movements than the cueing group in phase 2. The cueing
group in phase 2 consisted of the subjects who had performed trials in phase 1 without the
help of cues. Many of these subjects had even figured out tricks to avoid obstacles by fol-
lowing only the darker patches which in most cases would be the carpeted areas. But when
they performed the trials in phase 2 with the help of the cueing algorithm, they had fewer
head movements than the no cueing group in phase 2. We can infer from this experiment
156
that when subjects are in new surroundings, a cueing system might be helpful in organizing
their mobility path and head movements resulting in less confusion about the new areas. A
cueing system may also benefit the users by reducing the number of errors that they make.
However, cueing may actually add time to a mobility task without providing a benefit as
such a task would require multiple cues and the algorithmic and cognitive processing time
associated with each cue would add additional time to the trial.
4.3.5 Desk Task to Search for a Target
Subject Population
After the approval from the Institutional Review Board (IRB) at the University of Southern
California (USC), 6 naive normally sighted volunteers were enrolled and a signed informed
consent was obtained from each participant. Subjects were required to be English speaking,
having reading knowledge, 18+ years of age, no history of vertigo; motion sickness or
claustrophobia; no cognitive or language/hearing impairments and have a visual acuity of
20/30 or better with normal or corrected vision. Visual acuity testing for each participant
of the study was carried out in the lab using a Snellen visual acuity eye chart. As part of
the informed consent, subjects were informed about the kind of tasks that they would be
performing. However, the subjects were not provided with any information about the kind
of parameters that were being recorded, the related analysis and the end goal of the study.
Subjects were naive to the set-up of the study as well as to the aim of the study.
157
Methods
The 6 subjects were again randomly and equally divided into two groups - no cueing and
cueing. The cueing was based on the saliency algorithm with the top-down information
about the target of interest as discussed in Chapter 3. The target for this experiment was
a red colored coke can. Subjects were given simulated vision in the form of a 6 x 10
pixel grid in the central 14 degrees of the HMD and wore a shroud. One of the eyes of
the subjects was also patched. The desk set up consisted of various objects like books,
notepads, a computer monitor, computer keyboard, a computer mouse, a pen stand, a box,
cell phone etc. The objects had a wide range of colors and contrasts. There were more than
one objects on the desk with a red colored hue like the coke can. The subjects were given
time to get familiarized with the perception of the coke can using simulated vision and were
also given one practice trial with cues. This familiarized them withe the cueing mechanism
and gave them an idea about how to distinguish the coke can from the other objects. The
experiment again had 2 phases but both the phases were conducted in the same session. For
phase 1, the subjects performed 10 trials with no cueing or cueing based on their respective
groups. For phase 2, after the first 10 trials in phase 1, the subjects were asked to perform
the same task in the reverse manner - subjects who had used cues before, now performed
the task without cues and vice versa. The subjects performed 10 more trials after reversing
the groups. Only 1 session was conducted with a total of 20 trials per subject and 10 trials
per group per subject. The set up of the desk and a simulated vision example is shown in
figure 4.38.
158
Figure 4.38: Set up for the desk task of finding the coke can and the corresponding simu-
lated vision
Results
Figures 4.39, 4.40 and 4.41 show the head movements for the no cueing and cueing groups
for phases 1 and 2. An unpaired t-test (p<0.05) analysis between the no cue and cueing
trials for both phase 1 and 2 shows that the total head movements made by the subjects are
significantly lower for the cueing trials than for the no cueing trials. Subjects get trained
on the desk lay out and the extent of the field of view of the camera after the first 10 trials
in phase 1. This leads to the overall values of head movements being lower in phase 2
compared to phase 1. Head movements for the cueing trials in phase 2 are still significantly
lower than the head movements for the no cueing trials.
The graphs for time in figures 4.42, 4.43 and 4.44 show the results for the time taken by
the two groups to find the coke can in phase 1 and 2. A statistical analysis with an unpaired
t-test (p<0.05) shows that the time taken by the subjects for the cueing trials is significantly
lower than for the no cueing trials in phase 1. In phase 2, because of the learning from the
159
Figure 4.39: Phase 1 head movements for the no cueing and cueing groups
Figure 4.40: Phase 2 head movements for the no cueing and cueing groups
Figure 4.41: Head movements for Phase 1 and Phase 2 for the no cueing and cueing
groups
160
Figure 4.42: Phase 1 time in seconds for the no cueing and cueing groups
first 10 trials in phase 1, the overall time required goes down for both the cueing and no
cueing groups and statistically, there is no significant difference between the no cueing and
cueing trials. An important observation here is that for phase 2, the subjects performing
no cueing are those who have already performed cueing before. With the cueing trials, the
subjects learn how far to look for the coke can on the desk and also have better learning
of how to recognize the coke can in the cluttered environment. With the cueing trials,
because of the added confidence provided to the subjects by the algorithm, subjects learn
to distinguish the object quicker as they pay more attention to what they are being cued to.
With no cueing trials, once subjects start looking for the coke can, many times the responses
are based on guess work or by elimination for eg. this object is too big to be a coke can
etc. Subjects in the no cueing group in phase 1 improve their performance significantly in
phase 2 when they use cues to find the object. However, because the no cueing group for
phase 2 has already done cueing in phase 1, they are more adapt to performing better and
the two groups show no significant difference in terms of time for phase 2.
161
Figure 4.43: Phase 2 time in seconds for the no cueing and cueing groups
Figure 4.44: Time in seconds for Phase 1 and Phase 2 for the no cueing and cueing groups
The number of errors made by the subjects in recognizing the coke can during the
no cueing and cueing trials are shown in figures 4.45, 4.46 and 4.47. An unpaired t-test
(p<0.05) for phases 1 and 2 shows that for phase 1, the number of errors for the cueing trials
is significantly lower than for the no cueing trials, but, for phase 2, there is no significant
difference between the two. Again, this could be attributed to the fact that after the first 10
trials in phase 1, subjects have learnt how to identify the coke can and distinguish it from
the other objects in the desk layout.
162
Figure 4.45: Phase 1 number of errors for the no cueing and cueing groups
Figure 4.46: Phase 2 number of errors for the no cueing and cueing groups
Figure 4.47: Number of errors for Phase 1 and Phase 2 for the no cueing and cueing groups
163
Discussion
It was discussed in experiment 3 that when subjects are looking for a particular target,
combining the bottom-up saliency algorithm with some sort of object recognition/detection
information might be more useful for the subjects and may help optimize performance of
the algorithm and the subjects. In the current experiment, the regions of interest for cueing
were detected using the bottom-up algorithm combined with the top-down information for
the coke can. The subjects performing the cueing trials were instructed that most likely the
first or the second cues would be directing them towards the coke can. This greatly im-
proved the performance of the subjects with the cueing trials because they were confident
that the algorithm is also trying to find the coke can for them. This led to the subjects focus-
ing and concentrating more when scanning the region towards which they were directed by
the cues. This helped the subjects quickly learn how to distinguish the coke can from the
surroundings and other objects on the desk. If the algorithms can always show the direction
of the object of interest in the first cue, the task of finding search objects would become
very simple for the subjects. However, the implementation of top-down and bottom-up in-
formation integration presented in this thesis uses a limited number of streams and weights
for the streams which can lead to other objects with similar features being detected before
the actual object of interest. The algorithm picks out the coke can within the first two
cues most of the times but not always the first. This algorithm can be combined with an
object recognition algorithm to extract the object regions at the first and second cues and
compare the features in those regions to the features of the search object for e.g. the coke
164
can. The desired object region can then be found from this small subset of object regions
and the direction of the search object can be shown to the subjects as the first cue. This
implementation would incur additional computational complexity to the algorithm. With
improvements in processor speeds and power requirements, this could become a possibility
in the future.
4.4 Summary
In this chapter we weigh the advantages and disadvantages of having a saliency based
cueing system to help guide the subjects towards the direction of important objects and
compare the performance to when the subjects use their natural head scanning movements.
Various test scenarios for tasks like searching for objects on a table and walking through a
room or corridor while avoiding bumping into obstacles were evaluated for the performance
of normal sighted subjects using simulated vision. A cueing system may help organize the
manner in which subjects sample their visual field and thus lead to a reduction in the head
velocities and total head movements. Head movements are important for blind subjects to
understand the layout of the room and the different objects but unguided head movements
could lead to the scanning of unnecessary areas and thus cause confusion or disorientation
to the blind subjects.
The results showed that there is a huge variability in inter-subject performance mainly
because of the difference in the approaches that each subject has towards the different tasks.
Certain parameters like a subject’s height would change the viewing angle of each subject
165
relative to the ground which in turn will result in the camera’s captured visual field being
different for each subject. This results in different directional cues for different subjects
performing the same task and leads to the variability in performance. With a healthy retina,
many subjects adapt very quickly to the simulated prosthetic vision in a few trials (about 2-
5) and do not feel the need to use the cueing system. However, most subjects find the cueing
system to be useful for orienting themselves in the new environment and in being confident
that they are looking at something important which could be an object of interest. This was
especially true in the real-world environments like the lobby area and the desk task with
the coke can because in such environments consisting of naturally occurring objects, all the
objects blended in such a way that it was difficult for the subjects to distinguish between
different objects and areas.
Overall, the cueing system reduced the head velocities and head movements of the
subjects and also the number of errors that they made. The process of asking for cues,
getting them on the display and then deciding whether to follow them or not added a finite
amount of time to the cueing trials, effectively increasing the time for the cueing groups
especially for mobility tasks. The results of the various experimental designs suggest that in
unfamiliar environments, subjects might be at an advantage when using the cueing system
to guide their path or to search for objects of interest. When searching for objects of
interest, combining the bottom-up nature of the saliency algorithm with certain features of
the targets of interest can improve performance of the subjects.
All of these experiments were carried out with normal sighted volunteers provided with
simulated vision. Because the normal sighted volunteers have a healthy retina, it may
166
be relatively easy and quick for them to adapt to the restricted and simulated vision. It
may not be possible to easily extrapolate these results from the studies done with normal
sighted volunteers to the visually impaired. More studies on patients implanted with the
prosthesis will be required to understand how to infer the results obtained with normal
sighted volunteers for the retinal prosthesis implantees. However, these studies provide us
with the first step of simulation to analyze what kind of benefits from such a cueing system
can be expected when subjects/patients have restricted field of view and artificial or partial
vision.
167
Chapter 5
Summary and Discussion
The work presented in this thesis focused on the development and implementation of im-
age processing algorithms to provide guidance to the retinal prosthesis implant recipients
for navigation and search tasks. A computationally efficient implementation of a saliency
detection algorithm based on a visual attention model by Itti et al. was developed. This
model is approximately 10 times faster than the model by Itti et al. The algorithm is a
bottom-up model visualized to provide the implant recipients with information about pos-
sible objects of interest in the peripheral areas of their visual field on request. The execution
rate of the algorithm model on a Texas Instruments 720 MHz DSP evaluation kit is about
1 frame per second which implies that users will be provided with directional information
by the algorithm within 1 second of requesting it. This meets the goal for the on-demand
system being visualized for the application. Top-down information processing was also in-
tegrated with the algorithm to optimize performance for search tasks. Results showed that
the performance of the algorithm when searching for objects like a cell phone or a coke can
improved when using the top-down integrated model rather than the bottom-up model only.
168
Several mobility and search task experiments were conducted with normal sighted volun-
teers performing the tasks with restricted simulated vision in the central visual field. The
studies showed that using cueing help from the saliency detection algorithm significantly
reduced the amount of head movements and errors made by the subjects. The time was
significantly reduced with algorithmic help for search tasks but no benefit was observed
for mobility tasks. The studies support the idea that image processing algorithms can be
used to provide information to the prosthesis implant recipients and possibly to visually
impaired subjects in a way that can provide them with added confidence and improve their
performance in various tasks.
Performance evaluation of the subjects in different environmental settings when using
cueing help from the algorithm showed that the bottom-up only model can be the most
useful to gather information about new surroundings, to avoid obstacles and to plan a nav-
igation route. When searching for objects of interest, integrating the top-down and the
bottom-up models improves performance. The bottom-up model processes the image in-
formation based on basic image features like color saturation, intensity and edge informa-
tion and gives equal weight to these different features. Using top-down integration, the
set of features that represent the search object are given higher weights and other features
are suppressed. This effectively inhibits other objects and facilitates an early detection of
objects of interest. The algorithm is designed to provide cues indicating the direction of
possible objects of interest. When searching for objects of interest, it is desirable to have
the first 2-3 cues directed towards the objects. The top-down integration in this thesis has
been implemented at minimal additional computational cost. To make the algorithm more
169
robust and to be able to provide the directional cues towards objects of interest in the first
cue, additional information processing will have to be implemented at an additional com-
putational cost. The implementation was in the form of an on-demand system providing
the direction of possibly interesting objects on the request of the user. The on-demand fea-
ture was designed partly to reduce the resulting confusion to the subjects from too much
information in real-time. It also allows the use and implementation of computationally in-
tensive algorithms on portable processors by requiring each frame to be processed in about
1 second instead of in (1/30) seconds which would be the case for a real-time system. With
improvements in processors and power consumptions of image processing chips, a real-
time implementation of computationally intensive algorithms may be a possibility in the
near future. With powerful processors, an on-demand feature would allow the implementa-
tion of algorithms with even higher computational requirements and also possibly superior
performance compared to the algorithms discussed in this thesis.
The simulated vision experiment studies give an insight into the possible benefits that
could be offered to retinal prosthesis or visually impaired recipients when they utilize infor-
mation from an image processing algorithm to perform their tasks. The studies show that
once subjects become familiar with the environmental layout, the benefit offered by the
cueing information to subjects performing mobility tasks decreases. This is not surprising
considering even totally blind subjects are very comfortable and confident when navigat-
ing in familiar surroundings like at their home. The algorithm can help in significantly
reducing the head movements and errors made by the subjects when performing mobility
tasks. When navigating through a mobility course restricted in a corridor area, multiple
170
cues required from the algorithm increase the total time taken by the subjects to perform
the trials. However, when navigating in a room that did not have a pre-defined course, the
time taken to finish the trials was not signficantly different for the subjects using help from
the algorithm compared to the subjects not using any help. For search tasks, improvement
in performance was observed for all three measured parameters namely total head move-
ments, time and errors when subjects used cueing information from the algorithm. In the
absence of algorithm guidance, search tasks involve a certain level of random searching.
Depending on which direction the object of interest is and in which direction the subjects
start scanning, the object could be found sooner or later. But with cueing help, the algo-
rithm directs the subjects toward the object of interest usually in the first couple of cues
because of which the measured values of the parameters are significantly lower for the sub-
jects using cueing compared to when subjects did not use cueing help. This might suggest
that if subjects do not find it convenient to wait to ask for multiple cues when navigating in
environments, they still might benefit greatly in search tasks when using such an algorithm.
How do retinal prosthesis implantees prefer to have information provided to them will have
to be explored once a greater number of patients are implanted. Studies with retinal proth-
esis implantees and visually impaired subjects when conducted on larger scales might help
to evaluate if the benefits from a cueing system to these groups of subjects are similar to
those seen for normal sighted subjects using simulated vision.
The work proposed in this thesis was one of the first to evaluate the possible benefits
of a cueing system based on detection of regions of interest to retinal prosthesis recipients.
Future work could focus on carrying out more experiments with visually impaired patients
171
who are not implanted with a prosthesis and evaluate the kind of benefits offered to them.
In the experiments with normal sighted volunteers, it was observed that a certain level of
boredom sets in after the subjects do the same tasks again and again. Future experiments
with normal sighted volunteers can be designed to include competition and rewards to keep
the subjects interested in the tasks for the entire testing period. Follow-up interviews can be
planned to evaluate qualitative and psychological benefits of a cueing system to the subjects
and their overall experience. The image processing algorithms can be made more useful
and robust for search tasks by implementing feature detection (like SIFT features) for the
objects of interest. These features can be matched to various salient regions computed by
the algorithm combining top-down and bottom-up information. The region with which
the features match can be identified as the most salient region. This may help subjects
confidently detect the objects of interest in the first directional cue rather than in the first 2-
3 cues. This extra processing will involve extra computational costs but the advancements
in the power reduction and processing capabilities of chips may make this possible in the
near future.
172
173
Bibliography
Alon Asher, William A Segal, Stephen A Baccus, Leonid P Yaroslavsky, and Daniel V
Palanker. Image processing for a high-resolution optoelectronic retinal prosthesis. IEEE
Trans Biomed Eng, 54(6 Pt 1):993–1004, Jun 2007.
M. Bak, J. P. Girvin, F. T. Hambrecht, C. V. Kufta, G. E. Loeb, and E. M. Schmidt.
Visual sensations produced by intracortical microstimulation of the human occipital
cortex. Med Biol Eng Comput, 28(3):257–259, May 1990.
E. L. Berson. Retinitis pigmentosa. the friedenwald lecture. Invest Ophthalmol Vis Sci,
34(5):1659–1676, Apr 1993.
Anding Bi, Jinjuan Cui, Yu-Ping Ma, Elena Olshevskaya, Mingliang Pu, Alexander M
Dizhoor, and Zhuo-Hua Pan. Ectopic expression of a microbial-type rhodopsin restores
visual responses in mice with photoreceptor degeneration. Neuron, 50(1): 23–33, Apr
2006.
G. S. Brindley and W. S. Lewin. The sensations produced by electrical stimulation of the
visual cortex. J Physiol, 196(2):479–493, May 1968.
J.B. Bron, R. Tripathi, and B. Tripathi. Wolff’s Anatomy of the Eye and Orbit: 8
th
Edition. A Hodder Arnold Publication, 1997.
Peter J. Burt and Edward H. Adelson. The laplacian pyramid as a compact image code.
IEEE Transactions on Communications, 1983.
Laser Cane and Polaron. http://www.abledata.com, a.
Laser Cane and Polaron. http://www.deafblind.com/dbequipm.html, b.
J. Canny. A computational approach to edge detection. IEEE Trans. Pattern Analysis and
Machine Intelligence, 8:679–714, 1986.
Avi Caspi, Jessy D Dorn, Kelly H McClure, Mark S Humayun, Robert J Greenberg, and
Matthew J McMahon. Feasibility study of a retinal prosthesis: spatial vision with a 16-
electrode implant. Arch Ophthalmol, 127(4):398–401, Apr 2009.
K. Cha, K. Horch, and R. A. Normann. Simulation of a phosphene-based visual field:
visual acuity in a pixelized vision system. Ann Biomed Eng, 20(4):439–449, 1992a.
K. Cha, K. W. Horch, and R. A. Normann. Mobility performance with a pixelized vision
system. Vision Res, 32(7):1367–1372, Jul 1992b.
174
K. Cha, K.W. Horch, R. A. Normann, and D. K. Boman. Reading speed with a pixelized
vision system. J Opt Soc Am A, 9(5):673–677, May 1992c.
Spencer C Chen, Gregg J Suaning, John W Morley, and Nigel H Lovell. Simulating
prosthetic vision: Ii. measuring functional capacity. Vision Res, 49(19):2329–2343, Sep
2009a.
Spencer C Chen, Gregg J Suaning, John W Morley, and Nigel H Lovell. Simulating
prosthetic vision: I. visual models of phosphenes. Vision Res, 49(12):1493–1506, Jun
2009b.
A. Y. Chow and N. S. Peachey. The subretinal microphotodiode array retinal prosthesis.
Ophthalmic Res, 30(3):195–198, 1998.
Alan Y Chow, Vincent Y Chow, Kirk H Packo, John S Pollack, Gholam A Peyman, and
Ronald Schuchard. The artificial silicon retina microchip for the treatment of vision loss
from retinitis pigmentosa. Arch Ophthalmol, 122(4):460–469, Apr 2004.
L. da Cruz, B. Coley, P. Christopher, F. Merlini, V. Wuyyuru, J.A. Sahel, P. Stanga, E.
Filley, G. Dagnelie, and Argus II Study Group. Patients blinded by outer retinal
dystrophies are able to identify letters using the argus TM II retinal prosthesis system.
The Association for Research in Vision and Ophthalmology Annual Meeting, Fort
Lauderdale, Florida, US, 2010.
Gislin Dagnelie, David Barnett, Mark S Humayun, and Robert W Thompson. Paragraph
text reading using a pixelized prosthetic vision simulator: parameter dependence and task
learning in free-viewing conditions. Invest Ophthalmol Vis Sci, 47(3):1241–1250, Mar
2006.
Gislin Dagnelie, Pearse Keane, Venkata Narla, Liancheng Yang, James Weiland, and
Mark Humayun. Real and virtual mobility performance in simulated prosthetic vision. J
Neural Eng, 4(1):S92–101, Mar 2007.
Chloé de Balthasar, Sweta Patel, Arup Roy, Ricardo Freda, Scott Greenwald, Alan
Horsager, Manjunatha Mahadevappa, Douglas Yanai, Matthew J McMahon, Mark S
Humayun, Robert J Greenberg, James D Weiland, and Ione Fine. Factors affecting
perceptual thresholds in epiretinal prostheses. Invest Ophthalmol Vis Sci, 49(6): 2303–
2314, Jun 2008.
Tobi Delbrück and Shih-Chii Liu. A silicon early visual system as a model animal.
Vision Res, 44:2083–2089, 2004.
W. H. Dobelle and M. G. Mladejovsky. Phosphenes produced by electrical stimulation of
human occipital cortex, and their application to the development of a prosthesis for the
blind. J Physiol, 243(2):553–576, Dec 1974.
175
J. E. Dowling. The Retina: An Approachable Part of the Brain. Belknap Press of Harvard
University Press, 1987.
R. Eckmiller. Learning retina implants with epiretinal contacts. Ophthalmic Res, 29(5):
281–289, 1997.
R.E. Eckmiller, D. Neumann, and O. Baruth. Specification of single ganglion cell
stimulation codes for retina implants. in The Association for Research in Vision and
Ophthalmology (ARVO) Conference, 2004.
W. Fink, M.A. Tarbell, J. Weiland, and M. Humayun. Dora: Digital object recognition
audio-assistant for the visually impaired,. Association for Research in Vision and
Ophthalmology (ARVO ) Conference, 2004.
O. Foerster. Beitrage zur pathophysiologie der sehbahn und der spehsphare. J Psychol
Neurol, 39:435–463, 1929.
D. A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall; US ed
edition, 2002.
Yoav Freund and Robert E.Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of Computer and System Sciences, 55,
1997.
David S Friedman, Benita J O’Colmain, Beatriz Muñoz, Sandra C Tomany, Cathy
McCarty, Paulus T V M de Jong, Barbara Nemesure, Paul Mitchell, John Kempen, and
Eye Diseases Prevalence Research Group. Prevalence of age-related macular
degeneration in the united states. Arch Ophthalmol, 122(4):564–572, Apr 2004.
Simone Frintrop. VOCUS: A Visual Attention System for Object Detection and Goal
directed Search. PhD thesis, University of Bonn, 2005.
Simone Frintrop, Gerriet Backer, and Erich Rome. Goal-directed search with a top down
modulated computational attention system. Proceedings of the Annual Meeting of the
German Association for Pattern Recognition (DAGM), 2005.
Genentech. New Lucentis ranibizumab injection: http://www.lucentis.com.
Duane R Geruschat, Shirin E Hassan, and Kathleen A Turano. Gaze behavior while
crossing complex intersections. Optom Vis Sci, 80(7):515–528, Jul 2003.
Sendero GPS. http://www.senderogroup.com/products/shopgps.htm.
176
H. Greenspan, S. Belongie, R. Goodman, P. Perona, S. Rakshit, and C. H. Anderson.
Overcomplete steerable pyramid filters and rotation invariance. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 1994.
L. E. Hallum, G. J. Suaning, D. S. Taubman, and N. H. Lovell. Towards photosensor
movement–adaptive image analysis in an electronic retinal prosthesis. Conf Proc IEEE
Eng Med Biol Soc, 6:4165–4168, 2004.
L. E. Hallum, S. L. Cloherty, and N. H. Lovell. Image analysis for microelectronic retinal
prosthesis. IEEE Trans Biomed Eng, 55(1):344–346, Jan 2008.
Luke E Hallum, Gregg J Suaning, David S Taubman, and Nigel H Lovell. Simulated
prosthetic visual fixation, saccade, and smooth pursuit. Vision Res, 45(6):775–788, Mar
2005.
Jasmine S Hayes, Vivian T Yin, Duke Piyathaisere, James D Weiland, Mark S Humayun,
and Gislin Dagnelie. Visually guided performance of simple tasks using simulated
prosthetic vision. Artif Organs, 27(11):1016–1028, Nov 2003.
Alan Horsager, Scott H Greenwald, James D Weiland, Mark S Humayun, Robert J
Greenberg, Matthew J McMahon, Geoffrey M Boynton, and Ione Fine. Predicting visual
sensitivity in retinal prosthesis patients. Invest Ophthalmol Vis Sci, 50(4):1483–1491,
Apr 2009.
D. H. Hubel. Eye, brain and vision. W. H. Freeman; 2nd edition, 1995.
M. S. Humayun, E. de Juan, G. Dagnelie, R. J. Greenberg, R. H. Propst, and D. H.
Phillips. Visual perception elicited by electrical stimulation of retina in blind humans.
Arch Ophthalmol, 114(1):40–46, Jan 1996.
M. S. Humayun, E. de Juan, J. D. Weiland, G. Dagnelie, S. Katona, R. Greenberg, and S.
Suzuki. Pattern electrical stimulation of the human retina. Vision Res, 39(15): 2569–
2576, Jul 1999.
Mark S Humayun, James D Weiland, Gildo Y Fujii, Robert Greenberg, Richard
Williamson, Jim Little, Brian Mech, Valerie Cimmarusti, Gretchen Van Boemel, Gislin
Dagnelie, and Eugene de Juan. Visual perception in a blind subject with a chronic
microelectronic retinal prosthesis. Vision Res, 43(24):2573–2581, Nov 2003.
M.S. Humayun, J.D. Weiland, B. Justus, C. Merrit, J. Whalen, D. Piyathaisere, S.J. Chen,
E. Margalit, G. Fujii, R.J. Greenberg, E de Juan Jr., D. Scribner, and W. Liu. Towards a
completely implantable, light-sensitive intraocular retinal prosthesis. Annu. Int. Conf.
IEEE Eng. Med. Biol. Soc., 23rd, Instanbul, Turkey.
177
M.S. Humayun, R. Freda, I. Fine, A. Roy, G. Fujii, RJ Greenberg, J Little, B Mech, J
Weiland, and E de Juan J. Implanted intraocular retinal prosthesis in six blind subjects. in
The Association for Research in Vision and Ophthalmology (ARVO) Conference, 2005.
M.S. Humayun, L. da Cruz, G. Dagnelie, S. Mohand-Said, P. Stanga, R.N. Agrawal, R.J.
Greenberg, and Argus II Study Group. Interim performance results from the second
sight® argusTM II retinal prosthesis study. The Association for Research in Vision and
Ophthalmology Annual Meeting, Fort Lauderdale, Florida, US, 2010.
L. Itti and C. Koch. A saliency-based search mechanism for overt and covert shifts of
visual attention. Vision Res, 40(10-12):1489–1506, 2000.
L. Itti and C. Koch. Computational modelling of visual attention. Nat Rev Neurosci, 2
(3):194–203, Mar 2001.
Laurent Itti. Models of Bottom-Up and Top-Down Visual Attention. PhD thesis,
California Institute of Technology, Pasadena, California, 2000.
Laurent Itti. Quantifying the contribution of low-level saliency to human eye movements
in dynamic scenes. Visual Cognition, 12(6):1093–1123, 2005.
Laurent Itti. Quantitative modelling of perceptual salience at human eye position. Visual
Cognition, 14:959–984, 2006.
Laurent Itti and Christoph Koch. A comparison of feature combination strategies for
saliency-based visual attention systems. In Proceedings of SPIE Human Vision and
Electronic Imaging IV (HVEI’99), 3644:473–482, 1999.
Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention
for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 20(11), 1998.
The Bat K-Sonar. http://www.batforblind.co.nz/.
ER Kandel, JH Schwartz, and TM Jessel. Principles of Neural Science: 4th Edition.
McGraw-Hill Professional Publishing, New York, 2000.
S. Y. Kim, S. Sadda, J. Pearlman, M. S. Humayun, E. de Juan, B. M. Melia, and W. R.
Green. Morphometric analysis of the macula in eyes with disciform age-related macular
degeneration. Retina, 22(4):471–477, Aug 2002.
Ronald Klein, Barbara E K Klein, Michael D Knudtson, Stacy M Meuer, Maria Swift,
and Ronald E Gangnon. Fifteen-year cumulative incidence of age-related macular
degeneration: the beaver dam eye study. Ophthalmology, 114(2):253–262, Feb 2007.
178
C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural
circuitry. Hum Neurobiol, 4(4):219–227, 1985.
H. Kolb, E. Fernandez, and R. Nelson. Webvision: The Organization of the Retina and
Visual System.
F. Krause and H. Schum. Die epiliptischen erkankungen. In Neue Deutsche Shirurgie, ed.
Kunter H. Stuttgart., 1931.
Wentai Liu, Wolfgang Fink, Mark A. Tarbell, and Mohanasankar Sivaprakasam. Image
processing and interface for retinal visual prostheses. In ISCAS (3), pages 2927– 2930.
IEEE, 2005.
J. Loomis, R. Golledge, and R. Klatzky. Navigation system for the blind: Auditory
display modes and guidance. Presence, 7(2):p. 193–203, 1998.
David G. Lowe. Local feature view clustering for 3d object recognition. IEEE
Conference on Computer Vision and Pattern Recognition, 2001.
David G. Lowe. Distinctive image features from scale-invariant keypoints. International
Journal of Computer Vision, 2004.
R. D. Lund, A. S. Kwan, D. J. Keegan, Y. Sauvé, P. J. Coffey, and J. M. Lawrence. Cell
transplantation as a treatment for retinal disease. Prog Retin Eye Res, 20(4):415–449, Jul
2001.
R. E. MacLaren, R. A. Pearson, A. MacNeil, R. H. Douglas, T. E. Salt, M. Akimoto, A.
Swaroop, J. C. Sowden, and R. R. Ali. Retinal repair by transplantation of photoreceptor
precursors. Nature, 444(7116):203–207, Nov 2006.
Albert M Maguire, Francesca Simonelli, Eric A Pierce, Edward N Pugh, Federico
Mingozzi, Jeannette Bennicelli, Sandro Banfi, Kathleen A Marshall, Francesco Testa,
EnricoMSurace, Settimio Rossi, Arkady Lyubarsky, Valder R Arruda, Barbara Konkle,
Edwin Stone, Junwei Sun, Jonathan Jacobs, Lou Dell’Osso, Richard Hertle, Jian xing
Ma, T. Michael Redmond, Xiaosong Zhu, Bernd Hauck, Olga Zelenaia, Kenneth S
Shindler, Maureen G Maguire, J. Fraser Wright, Nicholas J Volpe, Jennifer Wellman
McDonnell, Alberto Auricchio, Katherine A High, and Jean Bennett. Safety and efficacy
of gene transfer for leber’s congenital amaurosis. N Engl J Med, 358(21): 2240–2248,
May 2008.
Manjunatha Mahadevappa, James D Weiland, Douglas Yanai, Ione Fine, Robert J
Greenberg, and Mark S Humayun. Perceptual thresholds and electrode impedance in
three retinal prosthesis subjects. IEEE Trans Neural Syst Rehabil Eng, 13(2):201– 206,
Jun 2005.
179
E. Margalit and W. B. Thoreson. Inner retinal mechanisms engaged by retinal electrical
stimulation. Invest Ophthalmol Vis Sci, 47:2606–2612, 2006.
Ruggero Milanese, Harry Wechsler, Sylvia Gill, Jean-Marc Bost, and Thierry Pun.
Integration of bottom-up and top-down cues for visual attention using non-linear
relaxation. International Conference on Computer Vision and Pattern Recognition, pages
781–785, 1994.
R. A. Normann, E. M. Maynard, P. J. Rousche, and D. J.Warren. A neural interface for a
cortical vision prosthesis. Vision Res, 39(15):2577–2587, Jul 1999.
Rishita Nutheti, Bindiganavale R Shamanna, Praveen K Nirmalan, Jill E Keeffe,
Sannapaneni Krishnaiah, Gullapalli N Rao, and Ravi Thomas. Impact of impaired vision
and eye disease on quality of life in andhra pradesh. Invest Ophthalmol Vis Sci,
(11):4742–4748, Nov 2006.
Nabil Ouerhani, Roman vonWartburg, Heinz Hšugli, and RenŽe Mšuri. Empirical
validation of the saliency-based model of visual attention. Electronic Letters on
Computer Vision and Image Analysis, 3(1):13–24, 2004.
N. Parikh, L. Itti, and J. Weiland. Saliency-based image processing for retinal prostheses.
J Neural Eng, 7(1):16006, Feb 2010.
Derrick Parkhurst, Klinton Law, and Ernst Niebur. Modeling the role of salience in the
allocation of overt visual attention. Vision Res, 42(1):107–123, Jan 2002.
Sonic Pathfinder. http://web.aanet.com.au/tonyheyes/pa/pf_blerb.html.
Robert J Peters, Asha Iyer, Laurent Itti, and Christof Koch. Components of bottom-up
gaze allocation in natural images. Vision Res, 45(18):2397–2416, Aug 2005.
W. K. Pratt. Digital Image Processing. John Wiley & Sons; 2 edition, 1991.
Markus N Preising and Steffen Heegard. Recent advances in early-onset severe retinal
degeneration: more than just basic research. Trends Mol Med, 10(2):51–54, Feb 2004.
P. Elizabeth Rakoczy, Meaghan J T Yu, Steven Nusinowitz, Bo Chang, and John R
Heckenlively. Mouse models of age-related macular degeneration. Exp Eye Res, (5):741–
752, May 2006.
J. F. Rizzo, J. Wyatt, M. Humayun, E. de Juan,W. Liu, A. Chow, R. Eckmiller, E.
Zrenner, T. Yagi, and G. Abrams. Retinal prosthesis: an encouraging first decade with
major challenges ahead. Ophthalmology, 108(1):13–14, Jan 2001.
180
A. Santos, M. S. Humayun, E. de Juan, R. J. Greenburg, M. J. Marsh, I. B. Klock, and A.
H. Milam. Preservation of the inner retina in retinitis pigmentosa. a morphometric
analysis. Arch Ophthalmol, 115(4):511–515, Apr 1997.
Shapiro, G. Linda G, Stockman, and C. George. Computer Vision. Prentice Hall; ISBN 0-
13-030796-3, 2002.
Talking Signs. http://www.talkingsigns.com.
Nishant R Srivastava, Philip R Troyk, and Gislin Dagnelie. Detection, eye-hand
coordination and virtual mobility performance in simulated vision for a cortical visual
prosthesis device. J Neural Eng, 6(3):035008, Jun 2009.
J. L. Stone, W. E. Barlow, M. S. Humayun, E. de Juan, and A. H. Milam. Morphometric
analysis of macular photoreceptors and ganglion cells in retinas with retinitis pigmentosa.
Arch Ophthalmol, 110(11):1634–1639, Nov 1992.
Benjamin W Tatler. The central fixation bias in scene viewing: selecting an optimal
viewing position independently of motor biases and image feature distributions. J Vis,
7(14):4.1–417, 2007.
Robert W Thompson, G. David Barnett, Mark S Humayun, and Gislin Dagnelie. Facial
recognition using simulated prosthetic pixelized vision. Invest Ophthalmol Vis Sci,
44(11):5035–5042, Nov 2003.
B. Tjan, P. Bechmann, R. Roy, N. Giudice, and G. Legge. Digital sign system for indoor
wayfinding for the visually impaired. 2005.
A. M. Treisman and G. Gelade. A feature-integration theory of attention. Cogn Psychol,
12(1):97–136, Jan 1980.
D. Tsai, J. W. Morley, G. J. Suaning, and N. H. Lovell. A wearable real-time image
processor for a vision prosthesis. Comput Methods Programs Biomed, 95(3):258– 269,
Sep 2009.
K. A. Turano, D. R. Geruschat, F. H. Baker, J. W. Stahl, and M. D. Shapiro. Direction of
gaze while walking a simple route: persons with normal vision and persons with retinitis
pigmentosa. Optom Vis Sci, 78(9):667–675, Sep 2001.
Kathleen A Turano, Robert W Massof, and Harry A Quigley. A self-assessment
instrument designed for measuring independent mobility in rp patients: generalizability to
glaucoma patients. Invest Ophthalmol Vis Sci, 43(9):2874–2881, Sep 2002.
181
M. Velikay-Parel, D. Ivastinovic, M. Koch, R. Hornig, G. Dagnelie, G. Richard, and A.
Langmann. Repeated mobility testing for later artificial visual function evaluation. J
Neural Eng, 4(1):S102–S107, Mar 2007.
Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple
features. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2001.
vOICe. http://www.seeingwithsound.com.
James D Weiland, Wentai Liu, and Mark S Humayun. Retinal prosthesis. Annu Rev
Biomed Eng, 7:361–401, 2005.
F.Werblin, B. Roska, D. Balya, Cs. Rekeczky, and T. Roska. Implementing a retinal
visual language in cnn: a neuromorphic study. in Proc. IEEE International Symposium on
Circuits and Systems, 2:333–336, 2001.
Douglas Yanai, James DWeiland, Manjunatha Mahadevappa, Robert J Greenberg, Ione
Fine, and Mark S Humayun. Visual performance using a retinal prosthesis in three
subjects with retinitis pigmentosa. Am J Ophthalmol, 143(5):820–827, May 2007.
Kareem A Zaghloul and Kwabena Boahen. Optic nerve signals in a neuromorphic chip ii:
Testing and results. IEEE Trans Biomed Eng, 51(4):667–675, Apr 2004a.
Kareem A Zaghloul and Kwabena Boahen. Optic nerve signals in a neuromorphic chip i:
Outer and inner retina models. IEEE Trans Biomed Eng, 51(4):657–666, Apr 2004b.
E. Zrenner, A. Stett, S. Weiss, R. B. Aramant, E. Guenther, K. Kohler, K. D. Miliczek,
M. J. Seiler, and H. Haemmerle. Can subretinal microphotodiodes successfully replace
degenerated photoreceptors? Vision Res, 39(15):2555–2567, Jul 1999.
Abstract (if available)
Abstract
Diseases like Retinitis Pigmentosa and Age-related Macular Degeneration result in a gradual and progressive loss of photoreceptors leading to blindness. A retinal prosthesis device imparts partial and artificial vision to patients in the central 15-20 degrees of the visual field by electrically activating the remaining healthy cells of the retina using electrical currents and an electrode array. Many visual aids available commercially aid blind patients with their day to day activities and navigation tasks. Most of these technologies have equipment or infrastructure overhead and are designed for indoor or outdoor use. The retinal prosthesis system design consists of an image processing module that can be utilized to process camera images in indoor and outdoor environments using algorithms to provide information about the surroundings and aid patients in their daily activities. This thesis presents work towards developing, validating and testing image processing algorithms for a retinal prosthesis system that could be used to aid retinal prosthesis recipients in navigation and search tasks.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modeling retinal prosthesis mechanics
PDF
PET study of retinal prosthesis functionality
PDF
Manipulation of RGCs response using different stimulation strategies for retinal prosthesis
PDF
Effect of continuous electrical stimulation on retinal structure and function
PDF
Towards a high resolution retinal implant
PDF
An in vitro model of a retinal prosthesis
PDF
Novel imaging systems for intraocular retinal prostheses and wearable visual aids
PDF
Characterization of visual cortex function in late-blind individuals with retinitis pigmentosa and Argus II patients
PDF
Intraocular and extraocular cameras for retinal prostheses: effects of foveation by means of visual prosthesis simulation
PDF
Prosthetic vision in blind human patients: Predicting the percepts of epiretinal stimulation
PDF
Intraocular camera for retinal prostheses: refractive and diffractive lens systems
PDF
Understanding the degenerate retina's response to electrical stimulation: an in vitro approach
PDF
User-interface considerations for mobility feedback in a wearable visual aid
PDF
Cortical and subcortical responses to electrical stimulation of rat retina
PDF
Improving stimulation strategies for epiretinal prostheses
PDF
Electrical stimulation of degenerate retina
PDF
Adaptive event-driven simulation strategies for accurate and high performance retinal simulation
PDF
Investigation of the electrode-tissue interface of retinal prostheses
PDF
Electrical stimulation approaches to restoring sight and slowing down the progression of retinal blindness
PDF
RGBD camera based wearable indoor navigation system for the visually impaired
Asset Metadata
Creator
Parikh, Neha
(author)
Core Title
Saliency based image processing to aid retinal prosthesis recipients
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Biomedical Engineering
Publication Date
08/06/2011
Defense Date
06/01/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
DSP,image processing,OAI-PMH Harvest,retinal prosthesis,saliency
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Weiland, James D. (
committee chair
), Humayun, Mark S. (
committee member
), Itti, Laurent (
committee member
), Ragusa, Gisele (
committee member
)
Creator Email
nehajparikh@gmail.com,nehajparikh@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3320
Unique identifier
UC1185329
Identifier
etd-Parikh-3893 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-375101 (legacy record id),usctheses-m3320 (legacy record id)
Legacy Identifier
etd-Parikh-3893.pdf
Dmrecord
375101
Document Type
Dissertation
Rights
Parikh, Neha
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
DSP
image processing
retinal prosthesis
saliency