Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computer vision aided object localization for the visually impaired
(USC Thesis Other)
Computer vision aided object localization for the visually impaired
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPUTER VISION AIDED OBJECT LOCALIZATION FOR THE VISUALLY IMPAIRED by Nii Tete Mante A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BIOMEDICAL ENGINEERING) December 2016 Copyright 2016 Nii Tete Mante Epigraph The senses deceive from time to time, and it is prudent never to trust wholly those who have deceived us even once. - Rene Descartes ii Acknowledgments Throughout my Ph.D., I have grown not only as a scholar, but also as a person. This massive amount of growth should not and cannot be attributed only to my self. Without the support of the amazing people around me, I could not have achieved this massive accomplishment. My family, my professor, my committee members, my friends and my colleagues are all to thank for this educational endeavor. I rmly believe that true growth occurs when we experience periods of adversity. These periods show us something about ourselves. They show us how we react to ad- versity. They show us whether or not we have the resilience to either break down the obstacles in front of us or succumb to them. I surely experienced these obstacles, and I was able to gure out ways to break past these barriers. However, there were some barriers that felt too great for me to knock down. They seemed impossible to get around. In the third year of this program, I was really struggling. I felt like I was not good enough to complete the Ph.D. It was the rst time that I had ever felt unsure of myself. It was the rst time I did not feel supremely condent in my abilities as an engineer. I consulted both my mother and father and my advisor Dr. James Weiland. I didn't realize it at the time, but this was a dening moment in my life. It was a moment where the trajectory iii of my future could be changed based on how my parents and advisor would react to my situation, as well as how I would react to their advice. Upon telling my parents that I was struggling, they gave me life altering advice. It was the rst time that I had come to my parents and told them I was struggling with something academically. Their responses were perfect nonetheless. My parents made me understand that their pride in me would not be determined by whether or not I completed the degree. They were and still are proud of me no matter the circumstances. This alone lifted a weight from my shoulders. They also made me understand that any person who is striving to do something great will experience their limits. Ultimately, there are only two choices; yielding to your limits, or pushing past them. Either choice is ne, but if you're interested in becoming the best version of yourself, the only possible choice is to ignore the invisible barriers in your mind and push through them. For this strength and support I thank my mom, Zenaida, and my dad, Charles, deeply. Time and time again you all have changed my life for the better. My advisors advice was also priceless. After discussing with James the struggle I was experiencing, his advice was very comforting. He didn't pressure me into choosing one way or another. He allowed me to realize two things. The rst was that it is natural to possibly feel this internal struggle. Oddly enough, hearing this from him, a distinguished professor, was another weight lifted o of my shoulders. The second fact that he made me realize was that doing the Ph.D. was my choice, and my choice alone. This fact made me realize why I wanted to pursue this academic endeavor in the rst place. It also made me realize that it was a privilege to be in this prestigious program with some of the most brilliant minds around. iv These conversations are possibly innocuous to my parents and professor, but the result of those moments have changed my life forever. For that, I thank my Mom, Dad and James deeply. The last group of people I would like to thank are my siblings and my friends. As the oldest in my family, it has been a pleasure to watch my siblings grow up. Seeing them grow and watching over them has inadvertently made me a stronger and better person. I strive to treat people well, and be someone that people look up to. I'm absolutely positive that I have these traits because while growing up, I had to be someone that my siblings could love and respect. So to Naa Adei and Nii Kojo, I thank you for always trying to better yourselves; by you guys striving to be great in your own lives, it makes me push myself even more. To my large group of friends that I have met throughout this program; I thank you for the long discussions about all possible topics. I thank you for the great times we have had exploring this amazing University and city. You made a great experience at a great school even better. I would also like to thank my committee members for the amazing professional advice and deeply intellectual conversations. These interactions helped shaped my research and as a result, this thesis you are now reading. To Dr. Armand Tanguay, Dr. Mark Humayan, Dr. Gisele Ragusa and Dr. Michael Khoo, I thank you all deeply. Lastly, I would like to thank the Braille Institute, and Judy Hill for allowing me to use their facilities to do my experiments. I would also like to thank all of the braille students who participated in our research studies. I would like to thank my research assistants Ashley Moy and Shadi Bahool for assisting me during my experiments at the Institute. v I would also like to thank my grant colleagues Dr. Aminat Adebiyi, Dr. Furkan Sahin, Kaveri Thakoor, Dr. Patrick Nasiatka, Dr. Christian Siagian, Dr. Gerard Medioni and Dr. Laurent Itti for the amazing interdisciplinary work that we did. Finally, a big thank you to my colleagues Dr. Steven Walston, Dr. Karthik Murali and Dr. Boshuo Wang for the great discussions and debates we've had over the years. vi Table of Contents Epigraph ii Acknowledgments iii List of Tables x List of Figures xi Abstract xviii Chapter 1: Introduction 1 1.1 Vision Aids for the Visually Impaired . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Crowd Sourced Based Systems . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Sensory Substitution Based Systems . . . . . . . . . . . . . . . . . 5 1.1.2.1 vOICe sensory substitution system . . . . . . . . . . . . . 5 1.1.2.2 BrainPort Sensory Substitution Device . . . . . . . . . . 7 1.1.3 Computer Vision Based Systems . . . . . . . . . . . . . . . . . . . 7 1.1.3.1 GroZi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.1.3.2 OrCam . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2 Proprioceptive Capabilities of the Visually Impaired . . . . . . . . . . . . 11 1.2.1 Reach and Point Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2.2 Reach and Orient Tasks . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2.3 Viability of a Head Mounted System . . . . . . . . . . . . . . . . . 13 1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.1 Specic Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.2.1 Object Tracking and Physical Feedback Analysis . . . . . 16 1.3.2.2 Real Time Recognition Module Design . . . . . . . . . . 17 1.3.2.3 System Integration . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 2: Auditory and Vibrotactile Feedback Analysis 19 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Wide-Field-of-View Camera . . . . . . . . . . . . . . . . . . . . . . 20 vii 2.2.2 Context Tracker - Computer Vision Algorithms . . . . . . . . . . . 21 2.2.3 Sensory Map - Auditory and Vibrotactile Feedback Algorithms . . 22 2.2.3.1 Central Visual Angle . . . . . . . . . . . . . . . . . . . . 23 2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Localization, Reach and Grasp Experiments . . . . . . . . . . . . . 25 2.3.2 Tracker Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Localization, Reach and Grasp Results . . . . . . . . . . . . . . . . 28 2.4.2 Subject Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.3 Tracker Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 3: Real Time Scene Text Recognition and Object Recognition 38 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.2.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.1.2.2 Articial Neural Network . . . . . . . . . . . . . . . . . . 48 3.1.2.3 Convolutional Neural Network . . . . . . . . . . . . . . . 51 3.1.3 Scene Text Recognition . . . . . . . . . . . . . . . . . . . . . . . . 52 3.1.3.1 Neural Network Based Approach . . . . . . . . . . . . . . 53 3.1.3.2 Class Specic Extremal Regions . . . . . . . . . . . . . . 55 3.1.4 Object Detection and Recognition . . . . . . . . . . . . . . . . . . 57 3.1.4.1 Scale Invariant Feature Transform . . . . . . . . . . . . . 57 3.1.4.2 Speeded Up Robust Features . . . . . . . . . . . . . . . . 61 3.1.4.3 Attention Biased Speeded Up Robust Features (AB-SURF) 63 3.1.4.4 Color Occurrence Histograms . . . . . . . . . . . . . . . . 64 3.1.4.5 Support Vector Machines . . . . . . . . . . . . . . . . . . 65 3.2 Methods and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.1 Scene Text Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.1.1 Text Detection . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.1.2 Text Recognition . . . . . . . . . . . . . . . . . . . . . . . 67 3.2.2 Object Recognition Pipeline . . . . . . . . . . . . . . . . . . . . . . 69 3.2.2.1 Data Collection and Augmentation . . . . . . . . . . . . 69 3.2.2.2 Neural Network Formulation . . . . . . . . . . . . . . . . 72 3.2.2.3 Stochastic Gradient Descent Learning Algorithm . . . . . 73 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.3.1 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.3.1.1 In Vitro Data Accuracy . . . . . . . . . . . . . . . . . . . 76 3.3.1.2 In Vivo Data Accuracy . . . . . . . . . . . . . . . . . . . 80 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.4.1 Neural Network Limitations and Remedies . . . . . . . . . . . . . 83 viii Chapter 4: Full System Integration 85 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1.2 Network Programming . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.1.3 Asynchronous Programming . . . . . . . . . . . . . . . . . . . . . . 87 4.1.4 Object Oriented Programming . . . . . . . . . . . . . . . . . . . . 87 4.2 Methods and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.1 System Controller Application . . . . . . . . . . . . . . . . . . . . 88 4.2.2 Computer Vision Backend . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.2.1 Speech Synthesis Module . . . . . . . . . . . . . . . . . . 89 4.2.2.2 Object Recognition Module . . . . . . . . . . . . . . . . . 92 4.2.2.3 Text Recognition Module . . . . . . . . . . . . . . . . . . 93 4.2.2.4 Master Module . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.3.1 Blindfolded Sighted Demo . . . . . . . . . . . . . . . . . . . . . . . 99 4.3.2 Proposed Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4.1 Previous System Improvements . . . . . . . . . . . . . . . . . . . . 101 4.4.2 System Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter 5: Conclusion 104 5.1 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.1.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 104 5.1.2 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . . 106 5.1.2.1 Scene Detection . . . . . . . . . . . . . . . . . . . . . . . 106 5.1.2.2 Improvement of Recognition Module . . . . . . . . . . . . 107 5.1.2.3 Port Algorithms to Mobile Devices . . . . . . . . . . . . . 107 5.1.2.4 Full System Human Subjects Testing . . . . . . . . . . . 108 Glossary 109 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 References 112 ix List of Tables 2.1 The 3 experimental groups for the feedback experiments. Each subject was assigned to one group. Experiments were done in order from top to bottom for the angles listed. . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 The subjects that participated in this study, and the causes of their visual impairment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 The Top-1 and Top-5 accuracies for the cereal/kellogg class. Both in vitro and in vivo accuracies are reported. . . . . . . . . . . . . . . . . . . . . . . 82 4.1 The left column represents codes used within the software. For example, if an item was found within the camera's eld of view, then the Recognition module would send a itemFound request to the Synthesis module. Then, the synthesis module would generate a command such as \I couldn't nd <Pasta Roni>". The text within the angled brackets <> can be varied depending on what item or name should be said. . . . . . . . . . . . . . . 90 4.2 The left column represents code used for communication between the app, and the server. The right column represents an actual item that can be sent to the server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 x List of Figures 1.1 The schematic diagram for the Brainport system (Danilov and Tyler 2005). 8 1.2 Full system diagram. (1) Subject chooses item via \Talk-Back" accessi- bility app. Dierent color rows correspond to items (e.g., Pasta, Cereal, etc.). (2) The camera sends visual input to a computer. (3) The recog- nition module nds the item, or any item from the list within the Field of View (FOV) and outputs a bounding box. (4) The tracking module takes the initial bounding box, and follows the object continuously. (5) The feedback generates speech or vibration that correlates to the position of the item. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3 System setup. The system is comprised of a head mounted camera, bone conduction headphones, head mounted vibrotactile motors, and a personal computer. The computer contains the processing algorithms. . . . . . . . 16 2.1 Object Localization and Tracking System. The user wears the head- mounted wide-eld-of-view camera (1) The camera sends visual input to a computer. (2) The computer determines the position of a chosen object. (3) The bone conduction headphones/motors play sounds/vibrate, and the subject incrementally turns their head (4) If the object is centered in the camera's eld-of-view (FOV), then the test subject reaches and grasps for the correct object (5) Otherwise, the program repeats its loop. . . . . . . 20 2.2 (Right) Diagram shows the location the pancake motors. Two motors, one just below the temporal lobe and the other behind the ear, are placed on both sides of the head. Thus, a total of four vibration motors were used to convey haptic feedback to the subject. (Left) Diagram shows bone conduction headphones used for auditory feedback. . . . . . . . . . . . . . 22 xi 2.3 Sensory Map for the auditory and vibrotactile feedback mechanisms. The grid represents the camera's eld-of-view (FOV). An object's position in 3D space can be mapped to the 2D FOV above. Once the object is mapped to this FOV, it will fall into one of the 9 grid locations. Depending on the location of the object within the grid, the computer will generate the corresponding word/vibration for the subject. Auditory feedback is com- prised of computer-generated speech, and vibrotactile feedback is achieved by triggering of 4 vibration motors in dierent combinations. . . . . . . . 23 2.4 This gure shows the relation between the initial image and the sensory map. The central visual angle (CVA) is the size of the central region (in degrees). The calculation of the central visual angle is made based on distance of items from the camera and the camera's eld of view (FOV) size. 24 2.5 Experimental Setup. (Left) The subject is seated, and the camera is mounted on glasses worn by the subject. (Right) The vision algorithm detects the object of interest (object highlighted by green bounding box). The user is instructed to grasp the object once they receive the center com- mand from the computer. In the case of auditory, a \center" command is a speech-synthesized voice saying \Center." The \center" command for vibrotactile is no vibration. In this case, the center region was 31:2°. . . . 26 2.6 (Top Left) Time to rst grasp vs. Central Visual Angle. The rst grasp occurs when the subject grasps any object. (Top Right) Time to Com- pletion vs. Central Visual Angle. Completion occurs when the subject grasps the correct object. (Bottom Left) Number of attempts vs. Central Visual Angle. (Bottom Right) First grasp rate vs. Central Visual Angle. First grasp success rate is a percentage; occurs when subject grasps item on the rst attempt. All times are in seconds. Fast times, low number of reaches and high success rates are optimal. Each graph plots the average time/attempts for all subjects. Additionally, each graph plots the results from subjects using auditory and vibrotactile feedback. . . . . . . . . . . . 29 2.7 These plots show the time to 1st grasp, time to completion, number of reaches and 1st grasp success rate for the second round of vibrotactile experiments. Each graph plots the average time/attempts for 5 out of the 12 subjects. Additionally, each graph plots the results from subjects using auditory and vibrotactile feedback. . . . . . . . . . . . . . . . . . . . . . . 31 xii 2.8 Each graph shows the tracking path of an object during a single local- ization task. Each row corresponds to two experiments from one subject. Specically, one plot represents how the object moves within the camera's eld of view (FOV). The x and y axes for the graphs are in pixels. Each 100 pixels correspond to a horizontal distance of approximately 15 cm. Each point represents a snapshot of the position of the object's centroid at a certain video frame/time. The green dot shows position of the ob- ject at the start of the experiment. The red dot shows the nal position, and where the subject grasped the object. Thus, the red dot's position should ideally be close to the center of the plot (320, 240). Rows 1, 2 and 3 correlate to localization experiments from visual angles 7:8°, 15:6° and 39°, respectively. The left column shows experiments in which the subject struggled to nd the object. The right column shows experiments that went smoothly for the subject. . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Two examples of grocery items within the same class \sugar". . . . . . . . 39 3.2 Example images from the ImageNet challenge. Images can contain one or more items from the categories provided in the challenge. This particular dataset contains 1000 categories (i.e. leopard, dalmatian, grape, etc.) and 1.2 million images. Directly below each image is the correct category/la- bel. Below the correct category are 5 guesses as well as a horizontal bar depicting the condence score of the algorithm. . . . . . . . . . . . . . . . 42 3.3 The apple and banana represent two classes within a set of three training examples. The training examples are delineated by a decision boundary (blue line). Two values, size and color, are used as an example of input features. The goal of the perceptron is to nd this blue linear boundary. Ultimately, for future predictions of test data, the perceptron will classify test data by seeing which side of the decision boundary it rests on. . . . . 45 3.4 The Perceptron. Each input value, x i is modulated by a weight value, w i . The sum of these products (i.e. the dot product between w and x) is then thresholded via f(x) (Equation 3.2). The nal output y is a value of 0 or 1. 46 3.5 An articial neural network. The network contains an input layer, two hidden layers and an output layer. Each node, aka perceptron, is fully connected to all nodes in the following layer. . . . . . . . . . . . . . . . . 49 3.6 An image of the updated neuron used in Articial Neural Networks (ANN). The dierence between the ANN neuron and a perceptron is the activation function (f ) used for the output of each neuron. . . . . . . . . . . . . . . 50 3.7 Three popular examples of activation functions used in articial neural networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 xiii 3.8 A 1 dimensional version of a convolutional neural net. The output nodes (yellow blocks) only connect to nodes in their vicinity. . . . . . . . . . . . 52 3.9 The result of passing a scene text recognition algorithm over an image. The algorithm nds and recognizes the text as \Triple Door". . . . . . . . 53 3.10 Examples from the Wang training set for text recognition Wang et al. 2012. (Left) from International Conference on Document Analysis and Recognition (ICDAR) 2003 dataset. (Right) Synthetic data. . . . . . . . . 54 3.11 The text detection neural network used in approach 1 (Wang et al. 2012). 55 3.12 For each octave of scale space, the initial image is repeatedly convolved with Gaussians to produce the set of scale space images shown on the left. Adjacent Gaussian images are subtracted to produce the dierence- of-Gaussian images on the right. After each octave, the Gaussian image is down-sampled by a factor of 2, and the process repeated. . . . . . . . . . 59 3.13 Here the maxima and minima keypoints are detected. This is done by comparing a pixel (marked with X) in a DoG image to its 26 neighbors in 3 3 regions at the current and adjacent scales (marked with circles). . . 60 3.14 This gure shows an 8x8 region around a keypoint (center of left image). Each 1 1 grid in the Image Gradient gure has an orientation. In the right Keypoint Descriptor image the 16 grids in one corner of the Image Gradients are reduced to 1 quadrant in a 22 grid. Each quadrant contains an 8 bin histogram. In this diagram the descriptor generated would be a 32 dimension vector (2 2 8). Most implementations utilize 4 4 grids with 8 bins each (i.e. 128 dimensions). . . . . . . . . . . . . . . . . . . . . 62 3.15 (a) and (b) are Gaussian second order partial derivatives in y-direction and xy-direction, respectively. (c) and (d) represent approximations of (a) and (b). Using the box lter approximation improves overall computational speed of the algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.16 A graphical representation of one pair in a Color Cooccurence Histogram (CCH). Two pixels with colorsC 1 andC 2 are separated by a vector (x; y). 65 3.17 A support vector machine (SVM) determines the hyper plane between a set of examples. Each example belongs to one of two classes. The SVM maximizes the margin between the nearest examples in each class. . . . . 66 xiv 3.18 A high level overview of the text recognition process. Given an original image, the text detection algorithm detects possible word bounding boxes in an image. Once the possible word regions are found, the regions are passed through a recognition algorithm to predict the actual characters inside of the bounding box. Word guesses are compared to the lexicon and partially matching words are updated to dictionary words. . . . . . . . . . 68 3.19 A ow diagram for the automated web crawler. The only manual inter- vention required is a list of queries/categories. The web crawler uses a list of queries to automatically request and download images from a grocery store website. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.20 An actual original image from the dataset being preprocessed. In this gure, the original image is translated, warped, and rotated. Additional operations done to the original images are brightening, darkening and scaling. 71 3.21 The manual collection pipeline. For each frame of the web camera video, multiple items within the image were cropped and labeled. These cropped and labeled images were used in subsequent testing of the object recogni- tion pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.22 One version of the network used in this research. The network takes the Red Green and Blue channels of the image, and ultimately produces a guess for the brand of the item. The outputs are of the format<category>/<brand> (e.g., rice/rice-roni, or sugar/splenda). . . . . . . . . . . . . . . . . . . . . 73 3.23 (Left) An example of an in vitro web image downloaded from the internet. (Right) An example of an in vivo image frame taken from a web camera. . 76 3.24 The rst version of the object recognition neural network. It consists of 1 input layer, 3 hidden layers and an output layer. . . . . . . . . . . . . . . 78 3.25 This plot shows the accuracy of the neural network in classifying examples from 46 dierent classes. The classes were represented as brands. These results are from testing the rst version of the neural net with the rst version of the dataset (v1-dataset). The v1-dataset consisted of 855 images (599 training, 256 testing). . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.26 The second version of the object recognition neural network. It consists of 1 input layer, 3 hidden layers and an output layer. In this version, two of the hidden layers convolutional, and 1 of the layers is fully connected. This is in contrast to version 1 (v1 had 1 convolutional and 2 fully connected). 80 xv 3.27 This plot shows the accuracy of version 2 of the neural network in classi- fying examples from version 2 of the dataset (v2-dataset). The v2-dataset consisted of 41,660 images (29,162 training, 12498 testing) from 46 brands and 1 background class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.28 Actual images of four brands/classes taken from live web camera video. . 82 4.1 (Left) The rst version of the system controller app. This android app (Zhang, Weiland) was developed to integrate with the rst version of this system, the Wearable Visual Aid. (Right) The second version of the app (Mante, Weiland) was implemented to integrate with the new system dis- cussed in this chapter. The new app allows the subject to hear the list of items they have left as well as read text in front of the camera. . . . . . . 89 4.2 The Synthesis module (within light blue box) communicates with Apple's speech libraries through a custom bridge. The custom Bridge was devel- oped for this thesis. The purpose of the bridge was to allow code from a dierent programming language to be called within our system. The code within this projects system is C++, whereas Apple's libraries are implemented in one of their programming languages, Objective C. . . . . 91 4.3 The App/Client waits for a subject to click an item (e.g., Pasta). In parallel, the server is started and running. The server then invokes the receiveData function and waits for data in a loop. Once a click on the iOS app occurs, the phone sends JSON data containing the command and the item name to the Computer/Server. Once the computer receives a piece of data, the appropriate modules are invoked. . . . . . . . . . . . . . . . . 92 4.4 Given an image, the sliding window approach creates smaller windows/bounding- boxes (yellow squares) and slides them over the entire window. A process- ing algorithm can then be applied to any window within the image. . . . . 94 4.5 The Text Recognition Module Pipeline. Image channels are processed in parallel by the text detection algorithm. Once possible character regions are detected, they are passed to multiple character merger processes. The result is a list of M possible word regions. The list of word regions is split M/N sized batches where N is the number of Optical Character Recognition processes. The nal result is a list of recognized text (e.g., Pasta, Roni, Splenda, etc.). . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.6 The full system ow chart. (1) The subject chooses an item via smart- phone. This tells the computer to ask for one video frame from the (2) camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 xvi 4.7 (Left) A blind folded subject standing in front of a shelf of 14 grocery store items. The subject is wearing a head mounted web camera and holding the smartphone application. (Right) An image showing the blindfolded sighted subjects point of view. The image also shows the bounding box localization of the recognition module (Chapter 3) for Kelloggs Rice Krispies, as well as the subject reaching and grasping for the object. . . . . . . . . . . . . . 99 xvii Abstract The role of vision in accomplishing most daily tasks is priceless. For visually impaired people, many of the tasks deemed simple by sighted individuals are dicult or impossible. Tasks such as navigating through complex environments or nding items can be useful problems to solve to aid in the improvement and overall autonomy of visually impaired in- dividuals. In this thesis, a grocery item nding assistant has been developed. The system utilizes a head mounted camera for seeing the world, computers for processing camera information and simplistic feedback for ecient communication between the system and the visually impaired person. The work done in this thesis included developing a real time tracking and feedback module. This module was thoroughly tested with visually impaired people to determine the eectiveness of a head mounted system for tracking items in real time. The results show that subjects were able to utilize intuitive feedback commands to guide their center of vision towards a desired object and eventually reach and grasp for that item. The work done also resulted in a step towards object recognition for grocery items. By building a lightweight and real time neural network based object recognition system, as well as exploring the use of grocery web images for the recognition of images from a xviii web camera, the research was able to determine the pitfalls and possible limitations of a web image only dataset. Additionally, a fully closed loop real time recognition system with contextual feed- back is demonstrated. This closed loop system was a multi-threaded and asynchronous based program that linked the tracking, feedback, text recognition, and object recogni- tion modules. The early human subjects testing with the feedback module coupled with the demonstration of the closed loop system with a blindfolded sighted subject suggest that this system can improve the lives of visually impaired people. xix Chapter 1 Introduction The World Health Organization has estimated that 285 million people around the world are visually impaired. Of these, 39 million are blind and 246 million have low vision. Visually impaired individuals are required to do simple daily tasks in a manner dierent than their sighted peers. In a study done by (Gold and Simson 2005) asks such as maneuvering through obstacles and navigating to desired destinations are examples of this. Several studies have detailed modern, technology-based solutions to the problems of way-nding and navigation [(Jihong and Xiaoye 2006), (Pradeep, Medioni, and Weiland 2010), (Bourbakis et al. 2008), (Ladetto and Merminod 2002)]. However, there are fewer solutions for recognizing, nding and grasping objects of importance. For a visually impaired person, tasks such as grocery shopping, or nding items in their environment often require aid from a sighted person or a deep memorization of their environment. This raises an important question: can an intelligent system be implemented to aid the visually impaired in nding items of interest. In addressing this question, the work presented in this dissertation explores the interaction between the human and computer as well as the development and integration of intelligent of software modules. 1 The goal of this chapter is to review previous vision based systems and prior studies that examined the ability of visually impaired to reach and grasp for items. In Section 1.1 vision aids from three dierent categories are talked about in detail. Crowd sourced based systems are discussed in Section 1.1.1, sensory substitution based systems in Section 1.1.2 and computer vision based systems in Section 1.1.3 This will give insight into the previous methodologies by respected people in the eld. It will also serve to oer advantages and disadvantages towards dierent approaches a wearable system can follow. In Section 1.2 proprioception is covered. The purpose of discussing proprioception is for determining whether or not a visually impaired person can orient their body and arms in ne grained directions. In Section 1.2.1 the reaching and pointing ability of the blind in comparison to blindfolded sighted individuals is covered. A similar overview for reaching and orienting is covered in Section 1.2.2. Discussing the reaching and pointing/orienting ability is crucial to justify the design of the system proposed in this thesis. Lastly, the actual system is discussed brie y in Section 1.3.2. 1.1 Vision Aids for the Visually Impaired Creating devices or systems to aid the visually impaired in nding and grasping objects remains to be a dicult problem. Being able to nd an object requires an understanding of one's surroundings. In addition to understanding the environment, one must have an idea of their position relative to an object. This problem is referred to as object localization. 2 For a visually impaired person, their means to understand an environment is ex- tremely limited due to their lack of sight. Therefore, an object localization system must be able to interpret the environment and also communicate that interpretation to the person. In regards to object localization systems, there are three main branches that the current devices fall under. Crowd sourced systems, sensory substitution and computer vision based systems are these branches. Crowdsourced based systems utilize the mobile application ecosystem and large groups of sighted people to assist with understanding the visual world. The second branch, sensory substitution, utilizes one or more alternative sensory modalities to act in place of a damaged sensory modality. Visual impairment can occur as a result of disease or damage to the eye or the neurons in the visual pathways. Although vision is impaired, other senses remain intact. Sensory substitution leverages this notion. The premise is to use a dierent sensory pathway to stimulate the brain area responsible for sight. Many of these systems utilize audition or touch as substitution methods. The last branch, computer vision, utilizes algorithms and image processing to interpret the environment. Computer vision is a subdivision of computer science that deals with processing and understanding both natural and synthetic images. The ultimate goal is to produce algorithms that understand images at a level near or above human understanding. In the sections below each of the three types of systems are explained more in depth. 1.1.1 Crowd Sourced Based Systems Crowd sourcing is a methodology that allows a large group of people to solve a singular task. When crowd sourcing, the individual needing assistance sends the query or task out 3 via a middle man service, and volunteers take it upon themselves to solve the problem at hand. Be My Eyes (Wiberg 2015) is a mobile application that utilizes the principles of crowd sourcing. Specically, a visually impaired person can open the mobile application, and start a video query. The video query includes video of the scene or item that they need help interpreting. Once the visually impaired person has nished with the video they can then send it to Be My Eyes' servers. Then any sighted volunteer can examine the query/video, and describe the scene in plain words to the visually impaired person. At the time of writing, there are approximately 380,000 sighted volunteers and 29,322 blind individuals on the service. Additionally, 162,000 individuals have been assisted at the time of writing. This system oers a great service, and the possibilities for how it can be extended are positive. The data acquired from labeling scenes and video sequences could be priceless for algorithmic processing. In this context, computer vision algorithms could leverage the labeled images and scenes for training and testing. Essentially, the systems would improve as more and more queries are answered by sighted individuals. This is important because in computer vision, good data is a primary indicator of the performance of the processing algorithm. The topics of computer vision and data are discussed brie y in Section 1.1.3 and in more detail in Chapter 3. With any service such as this, spam becomes a possibility. It's likely, that as a service like this grows, that malicious volunteers or bots to purposely give the wrong information or answers. In this case, it may then be absolutely necessary to detect spammers. This is one drawback of a human based system. 4 1.1.2 Sensory Substitution Based Systems As discussed in Section 1.1, the main premise of sensory substitution is to use undamaged existing sensory pathways for interpreting information in place of damaged pathways. Ultimately, the goal is to allow a patient to perceive information they would otherwise be unable to interpret. Much research has gone into the notion of utilizing the tactile and/or auditory pathways for sensory substitution. In the context of this research, the damaged sensory modality is sight, thus it's worth discussing examples of sensory substitution systems. The vOICe and Brainport are two prominent examples of sensory substitution based systems. The vOICe converts video, from a head mounted camera, to a \soundscape", i.e. a complex sound that corresponds to the video. The user must interpret this soundscape in order to understand what is in the view of the camera (Meijer 2012). 1.1.2.1 vOICe sensory substitution system The vOICe system takes a 176 by 64 pixel region in a greyscale image and converts it into a \soundscape". The soundscape corresponds to a series of complex tones and beeps that are played at a specic rate. The specic rate corresponds to the image snapshot speed. To elaborate, when processing video, vOICe takes image snapshots at one image per second. The system uses three rules to generate the aforementioned soundscape. The rst rule pertains to left-to-right scanning. Specically, each tone/sound byte played can be thought of as a continuous tone of that lasts for approximately one second. The earlier sound sequences of this tone correspond to patterns towards the left portion 5 of the image. The later sound sequences correspond to the right portion of the image. Thus, left-right scanning is time based. The second rule of vOICe is related to up and down scanning. Up and down scanning is pitch based. Pitch is the actual frequency of the sound; a high frequency sound will sound sharper, and a low frequency sound will have more base. For example, a person with a deeper speaking voice speaks with a lower pitch. As it pertains to the vOICe system, a pixel near the top of an image leads to a high pitch tone. A pixel near the bottom of an image yields a low pitch sound. The third rule links all three rules together. Both time and pitch of a tone have been mentioned in regards to left-right and up-down, respectively. However, the crucial portion to mention is the volume or loudness of a tone. Specically, brighter pixels or patterns yield louder tones, and darker ones correspond to silence. Thus, a bright white pixel near the top left of a black background will yield a louder high pitch tone. Furthermore, that tone will be played at the beginning of the 1 second sound sequence. The end result is a system that plays tones varying in pitch and loudness depending upon the brightness and location of patterns in the image. To summarize, dierent image patterns will yield dierent sound codes. The act of recognizing the sound code now depends on the visually impaired subject. The question now becomes, can visually impaired people hear these sounds, and associate objects or their surroundings to these sounds. 6 1.1.2.2 BrainPort Sensory Substitution Device The Brainport is another sensory substitution device used for orientation, mobility and object identication. It is a non-invasive system that consists of a camera, a tongue pad electrode array and a portable microcontroller. It translates video from a head mounted camera into electrical stimulation patterns on the surface of the tongue. The oral unit contains circuitry to convert the controller signals from the base unit into individualized zero to +18 Volt monophasic pulsed stimuli on the 10 by 10 electrode (Danilov and Tyler 2005). Specically, this system generates an \electrotactile" pattern on the tongue in real time based on what the camera sees. Essentially, the subject can feel dierent tactile patterns for dierent images. Thus, the system relies on the subjects ability to process the raw electrotactile patterns to determine shape, size, location and motion of objects. The schematic diagram for the Brainport can be seen in Figure 1.1 (Danilov and Tyler 2005). 1.1.3 Computer Vision Based Systems Computer vision systems leverage advanced algorithms to process images and understand them. Before discussing the aforementioned systems, it is worth brie y discussing the topic of computer vision and some of its applications. Human and animal visual systems are able to understand the rich three dimensional world in a highly ecient and accurate manner. Computer vision aims to approach, and possibly surpass, this level of understanding for the purpose of automating certain tasks. Two extremely important tasks in vision are recognition, and motion analysis. Recognition deals with determining if specic objects are in an image or video sequence. 7 Figure 1.1: The schematic diagram for the Brainport system (Danilov and Tyler 2005). 8 In the context of computer vision, some of the major tasks within recognition are object recognition, object detection and Optical Character Recognition (OCR): • Object Recognition - denoting all occurrences of previously learned objects in an image • Object Detection - nding occurrence of a specic item in an image • Optical Character Recognition - determining what characters or words are in an image Certain examples of these major tasks are being used in practice today (Szeliski 2010). Applications of computer vision include: • Machine inspection - rapid parts inspection for quality assurance • Automotive safety - detecting obstacles or pedestrians on the street • Automatic number plate recognition • Medical Imaging - registering pre-operative and intra-operative imagery • Handwritten digit recognition The tasks listed above are well structured situations. For example, in the case of Automatic Number Plate Recognition (ANPR), license plates all share similar charac- teristics. Their dimensions and fonts are fairly consistent. ANPR falls under a category of computer vision called OCR. The ultimate goal of OCR is to recognize characters 9 and words in images. Automotive safety is another application gaining traction in con- sumer products. Pedestrian and obstacle detection are two of the computer vision tasks currently used in advanced car systems. Although all of these examples are dierent applications, these systems share a com- mon trend. That trend is to reduce the images into numerical models so that computers can use these models for future recognition. In a sense, computers must rst learn about the condensed representations of these items (e.g., license plates, hand written digits) and store them in their memory. This high level approach is extremely similar to the human visual system as well. We learn about representations of objects and recall these representations when necessary. Thus, the premise of computer vision is twofold: (1) what is the best representation for items, (2) what is the most ecient way to store and access these representations. Computer vision has taken many dierent approaches to this problem. There have been a handful of systems that focus on applying algorithms to assisting the visually impaired. GroZi, developed at the University of California San Diego, and OrCam are examples of such systems. 1.1.3.1 GroZi GroZi, a handheld grocery shopping assistant for the visually impaired, oers object recognition, via computer vision algorithms, for nding items and haptic feedback for grasping the items. It also utilizes scene text recognition for overhead aisle signs (Belongie 2012). The camera is located on the handheld device itself however, requiring scanning, as if with a ashlight. Overall, the GroZi system is very similar to the system proposed in 10 this project. The system proposed in this dissertation employs a head mounted camera instead of a handheld camera. The handheld approach may be less intuitive than using head position to guide scanning. 1.1.3.2 OrCam The OrCam oers object recognition and optical character recognition via a camera that attaches to eyeglasses. It gives voice feedback via a bone conduction earphone (Orcam 2016). However, it's input modality is dierent; the OrCam recognizes or reads whatever the user points at with their hand. The input modality, while simple, requires the user to point at the desired item. Pointing at the desired item implies that the user must know where the object is to begin with. For people with little or no vision, knowing the location of the desired object remains a signicant problem. 1.2 Proprioceptive Capabilities of the Visually Impaired Proprioception is the sense of the relative position of neighboring parts of the body and strength of eort being employed in movement (Mosby 1994). The system proposed in this research relies on the ability of a visually impaired individuals reaching and grasping ability. Thus, it is worthwhile to examine whether or not visually impaired people have an adequate sense of their arms during these sorts of tasks. Several studies have examined this notion of proprioception as it pertains to the visually impaired. A few of these studies are highlighted below. 11 1.2.1 Reach and Point Tasks Several studies have examined the proprioceptive capabilities of the congenitally blind as well as the late blind. One study in particular (Gosselin-Kessiby, Kalaska, and Messier 2009) conducted a series of reach-and-orient experiments to see how the congenitally blind (n=5), late blind (n=7) and blindfolded-sighted (n=18) patients compared to each other in reaching tasks. Specically, subjects were required to match the orientation of their right hand to that of their left hand by placing a rectangular peg into a vertically mounted board. The board contained two slots: one slot for a rectangular peg in the left hand, and another slot for them to place another rectangular peg with their right hand. Ultimately, the researchers found that there were essentially no dierences between congenitally blind and postnatal blind subjects in the reach-orient tasks. Additionally, they found that blind subjects and blindfolded sighted subjects performed extremely similar (Gosselin-Kessiby, Kalaska, and Messier 2009). 1.2.2 Reach and Orient Tasks Another study (Rossetti, Gaunet, and Thinus-Blanc 1996) compared blind and blind- folded sighted subjects in arm movement tasks. These tasks however were dierent from the aforementioned proprioceptive study. Instead of reaching and orienting, subjects were informed to reach and point at a location in space. Specically, both visually impaired and blindfolded sighted subjects were informed to point at target locations in front of them. The subject's left hands were pointed at a target location for 300 ms, then immedi- ately dropped and returned to the table; they were then instructed to point to the target locations with their right hand. These locations rested on a transparent vertical plane; 12 the locations were on an arc of 30 cm away from their right hand. Lastly, the targets ranged in position from 17°left of the vertical midline to 43°right of the vertical midline. The results of this study actually showed that the blind subjects outperformed normally sighted subjects under some conditions (Rossetti, Gaunet, and Thinus-Blanc 1996). 1.2.3 Viability of a Head Mounted System The work discussed in the previous sections (1.2.1 and 1.2.2) papers show that the blind have the ability to perform proprioceptive tasks accurately, and perform just as well as the sighted counterparts. There are several other studies that support the notion that the blind can perform equally well in such tasks (Heller 1989, Heller and Kennedy 1990, Castiello, Bennett, and Stelmach 1993, Vanlierde and Wanet-Defalque 2004). Ad- ditionally, another group of papers show that the blind even outperform their sighted counterparts (Gaunet and Rossetti 2006, Alary et al. 2008). This suggests that a system relying on visually impaired people reaching and grasping can be viable. To defend that claim, this research project will explore if a Head Mounted Camera (HMC) system is a viable way to orient a blind user towards a desired object, and thereby enable a reaching and grasping task. 1.3 Dissertation Organization 1.3.1 Specic Aims This research proposes a wearable camera system that nds grocery items and generates simplistic auditory or vibrotactile feedback to visually impaired subjects based on item 13 location. The subjects group consists of congenital or late blind individuals suering from either low vision, or complete blindness. Because the project consists of work pertaining to Human Computer Interaction, Computer Vision, and Application Development, the work done has been split into three specic aims. The rst specic aim is to implement and test an object tracking and feedback module with visually impaired subjects. This aim acts as the interface between the computer and the subject. The module to be designed must track an item within the camera's eld of view FOV, and simplistically communicate object location to the subject in real time. Ultimately, the module should guide subjects towards reaching and grasping for items. In this aim, the researcher provides an initial image region/object the computer program to track. The second specic aim is to implement and test a real time recognition module to automatically detect objects and their locations within images. This aim removes the researcher from the loop. The module relies on computer vision techniques such as scene text and/or object recognition for identifying items. The module designed must be lightweight, reasonably accurate and run quickly. The third and nal specic aim is to integrate the recognition, tracking and feedback modules into a completely closed loop computer program for reaching and grasping tasks. 1.3.2 System Overview A block diagram showing the ow and modules of the system are shown in Figure 1.3. The nal system is a computer vision based program for visually impaired people. It consists of a head mounted camera for seeing the visual world, a personal computer for processing algorithms and a physical feedback module. The physical feedback module 14 Figure 1.2: Full system diagram. (1) Subject chooses item via \Talk-Back" accessibility app. Dierent color rows correspond to items (e.g., Pasta, Cereal, etc.). (2) The camera sends visual input to a computer. (3) The recognition module nds the item, or any item from the list within the FOV and outputs a bounding box. (4) The tracking module takes the initial bounding box, and follows the object continuously. (5) The feedback generates speech or vibration that correlates to the position of the item. contains bone conduction headphones or vibrotactile motors for guided feedback. The system's objective is to nd objects in the camera's FOV using the intelligent processing algorithms. Finally, it relays simplistic information to the visually impaired subject. Figure 1.3 shows the camera setup as well as the bone conduction headphones. In order to accomplish this task object recognition, OCR based algorithms were uti- lized and implemented. Additionally, a custom software and hardware based feedback module was developed for interfacing with the visually impaired subject. 15 Figure 1.3: System setup. The system is comprised of a head mounted camera, bone conduction headphones, head mounted vibrotactile motors, and a personal computer. The computer contains the processing algorithms. 1.3.2.1 Object Tracking and Physical Feedback Analysis A crucial layer in any computer program of this nature is the feedback modality. The information provided via feedback can be raw information, as in the case of Brainport or vOICe from Section 1.1.2.2 and 1.1.2.1, respectively. Providing raw information to the visually impaired person relies on them learning how to decode it. Another approach, is to provide information in the form of simplistic commands. This project examines notion of simplistic commands; to thoroughly test this premise, a software and hardware based real time feedback module was developed. In Chapter 2, the design, implementation and testing of the module, the Object Tracking and Localization System (OLTS), is discussed. Prior to designing the system, human subjects questioning was done to help inform the design decisions for building the prototype. The nal module consisted of an object tracking algorithm, a spatially dierentiated feedback algorithm, and bone conduction headphones/vibrotactile motors. The object tracking algorithm was included to make the system semi autonomous. To elaborate, the nal system aims to recognize/detect items and oer real time feedback. Recognition of items is a complex task (discussed 16 in aim 2-Chapter chap:3-aim2). However, tracking a provided region within an image is a partially solved task in computer vision. Thus, to expedite the process of testing the feedback module, the researcher acted as an \ideal" recognition system, by initially providing the locations of objects to the tracking algorithm for live video tracking. This enable validity testing of the feedback module via human subjects experiments with visually impaired people. 1.3.2.2 Real Time Recognition Module Design As alluded to in previous sections, in order to utilize simplistic feedback, a system must rst be able to do advanced processing to interpret the raw information provided. In this case, the raw information is an image or live video. This work included developing a recognition module to do the advanced processing. The recognition module constists of a scene text recognition module and an object recognition module. Scene text recognition is a subdivision of OCR that deals with detecting and recognizing text in natural images. The diculty in scene text recognition arises from lighting conditions, non standard fonts and partial occlusion. The work included choosing an appropriate text recognition algorithm, and also implementing it. Additionally, work was done to develop a novel object recognition module. The object recognition is a neural network based module that attempts to localize items within the image. In Chapter 3 scene text and object recognition work are discussed in more detail. 17 1.3.2.3 System Integration In Chapter 4, the design methodology used to integrate all of the software and hardware modules is discussed. A prototype of the fully closed loop system is tested by a blindfolded sighted individual to give initial insights on the validity of the system. The validity of the system, as well as possible drawbacks are highlighted. Additionally, proposed experiments and evaluation methods are discussed in detail. 18 Chapter 2 Auditory and Vibrotactile Feedback Analysis 2.1 Introduction 2.1.1 Problem One of the most crucial parts this project was testing the link between the system it- self and the human. The goal of this specic aim was to determine an ecient feed- back methodology for communicating to the subject. In order to eciently test the feedback mechanisms, an experimental system called the OLTS was developed. The OLTS is divided into two main modules: (1) a tracking algorithm called the Context Tracker(section 2.2.2), (2) a built custom feedback module called the Sensory Map (sec- tion 2.2.3). 2.2 System Overview The OLTS sees the outside world by using a head-mounted wide-eld-of-view camera, tracks an object's position by means of computer vision algorithms, and provides feedback on how to reach and grasp the object to the visually impaired user. The system generates 19 Figure 2.1: Object Localization and Tracking System. The user wears the head-mounted wide-eld-of-view camera (1) The camera sends visual input to a computer. (2) The computer determines the position of a chosen object. (3) The bone conduction head- phones/motors play sounds/vibrate, and the subject incrementally turns their head (4) If the object is centered in the camera's eld-of-view (FOV), then the test subject reaches and grasps for the correct object (5) Otherwise, the program repeats its loop. auditory feedback via bone conduction headphones, or vibrotactile feedback via pancake motors positioned on a glasses frame. The bone conduction headphones used in these experiments were Gamechanger LLC Audiobone 1.0 headphones. The pancake motors used were Beam Bristlebot Robot 14 mm x 3.6 mm pancake micro motors (applied voltage 1.3V - 3.0V). Tracking and feedback are performed in real-time. The system ow chart is shown in Figure 2.1. 2.2.1 Wide-Field-of-View Camera The rst stage utilizes a head-mounted wide-eld-of-view camera for visual input. The camera employed in these experiments was a KT&C KPC-E23NUB wide-dynamic-range NTSC camera that captures a video stream at 30 frames per second. The camera has a resolution of 640 x 480 pixels, and is mated to a wide-angle lens that provides a eld-of- view of 92°. The output of the camera is fed to a Sensoray s2255 NTSC-to-USB converter 20 box, and then interfaced to a computer via a USB interface for implementation of the computer vision and auditory feedback algorithms described below 2.2.2 Context Tracker - Computer Vision Algorithms The next stage involves the use of a computer vision algorithm for tracking objects. This stage allows the system to track an item, given only a bounding box of an image patch, in real time. The algorithms were run on an Apple Inc. 2010 Macbook Pro (4 GB RAM, 2.4 GHz i5 processor). The computer vision algorithm was developed by Dinh and Medioni Dinh and Vo 2011. This algorithm, the Context Tracker, utilizes contextual information as a fundamental guiding principle that enhances object tracking, and is built upon the principles of the Tracking-Learning-Detection (TLD) tracker (Kalal, Mikolajczyk, and Matas 2011). When provided with a bounding box, the context tracker rst learns features about the object within the box. This feature description is then stored and used for detection in the next frame. The algorithm's learning process is implemented continually while the object is being tracked. In addition to learning while tracking and detecting, the Context Tracker uses im- portant features surrounding the object to allow for robust tracking. These important features are referred to as supporters and distractors. Supporters are key-points within the frame that share similar motion correlation to the bounding box/object. Distractors are objects that have a similar feature description to the object of interest Dinh and Vo 2011. The algorithm remembers distractors so that they may be continually ignored in the upcoming frames of the camera input, thus improving the tracking performance. 21 Figure 2.2: (Right) Diagram shows the location the pancake motors. Two motors, one just below the temporal lobe and the other behind the ear, are placed on both sides of the head. Thus, a total of four vibration motors were used to convey haptic feedback to the subject. (Left) Diagram shows bone conduction headphones used for auditory feedback. 2.2.3 Sensory Map - Auditory and Vibrotactile Feedback Algorithms The nal stage of the OLTS system is the feedback algorithm. The algorithm uses the object position from the computer vision algorithm to give auditory or vibrotactile feedback to the visually impaired subject. The auditory feedback is played via bone conduction headphones, and the vibrotactile feedback is generated via pancake motors. Bone conduction headphones have the benet of not obstructing the ear canal, since visually impaired individuals use their sense of hearing for navigation and other tasks. The speakers are positioned just anterior to the ear. This allows the sound to be conducted through the temporal bone. The acquisition of the object's position and feedback generation is done in real time. Once the feedback algorithm has the object position, as provided by the vision algorithms, the position is passed to our Sensory Map (Figure 2.3). The Sensory Map translates the 2D position of the object to a discrete `code.' There are nine discrete codes, and each corresponds to a region within the visual eld of the camera. The values of the `code' are as follows: UL, UR, U, L, C, R, DL, D, DR. The letters U, D, L, R and C correspond to \Up", \Down", \Left", \Right" and \Center" 22 Figure 2.3: Sensory Map for the auditory and vibrotactile feedback mechanisms. The grid represents the camera's eld-of-view (FOV). An object's position in 3D space can be mapped to the 2D FOV above. Once the object is mapped to this FOV, it will fall into one of the 9 grid locations. Depending on the location of the object within the grid, the computer will generate the corresponding word/vibration for the subject. Auditory feedback is comprised of computer-generated speech, and vibrotactile feedback is achieved by triggering of 4 vibration motors in dierent combinations. respectively. For example, the two letter code UL means \Up and Left". Depending upon the code provided, a spoken word or phrase is played to the subject or one or two motors are activated. For example, if the object resides within the lower right portion of the cameras FOV, the DR code is passed to the speech synthesis and the words \Down and Right" are spoken to the subject. Similarly, the fourth motor, M4, would vibrate when given the DR code. The subjects are instructed on the appropriate response for each cue. In regards to the center region, once the subject hears/feels center command, they are instructed to reach and grasp for the object. 2.2.3.1 Central Visual Angle As mentioned before, the sensory map contains a central region. Once a visually impaired subject has centralized the object within the camera's FOV, the sensory map will generate 23 Figure 2.4: This gure shows the relation between the initial image and the sensory map. The central visual angle (CVA) is the size of the central region (in degrees). The calculation of the central visual angle is made based on distance of items from the camera and the camera's eld of view (FOV) size. a center command for the headphones/vibration motors. Upon hearing/feeling this center command the subject reaches and grasps for the object. The center region size was varied to determine if an optimal central region existed for the sensory map. Specically, the central region size was represented as the Central Visual Angle (CVA) (Figure 2.4). The central visual angle, cva , can be calculated based upon the distance of items from the camera and the camera's FOV. The equation is described below: cva =tan 1 ( w d ) (2.1) w = Fp r P total (2.2) 0:15 meters<d< 0:66 meters (2.3) 24 where d represents the straight line distance of an item from the camera, w represents the width of the central region (in feet), F = 3 represents the width of the camera's eld of view (in feet), p r represents the width of the central region (in pixels) and P total = 640 represents the width of the cameras eld of view (in pixels). Given these constant values, 5 dierent central visual angles were used for experimentation in the localization experiments (7:8°, 15:6°, 23:4°, 31:2°, 39°). 2.3 Experiments 2.3.1 Localization, Reach and Grasp Experiments Visually impaired subjects were seated at a table with 3 - 5 objects placed in front of them. While seated, the subjects wore a set of glasses with a small camera attached to the glasses and headphones (as shown, for example, in Figure 2.5). The objects' straight-line distance (SLD) from the subject ranged from 0.15 m to 0.66 m, based on the subject's reach. The subject's goal was to centralize the correct object using auditory/vibrotactile feedback from the OLTS, and then reach and grasp for the object. The rst step of experiments was a training phase. In this phase, the computer given feedback was supplemented by the researcher giving verbal instructions and manually guiding the subject by helping to move their head and hands to complete an object local- ization task. When subjects became familiar with the system, the testing phase begins. The length of the training stage varied between subjects, however, training normally lasted for 30 minutes. The testing phase required the subject to localize and grasp the correct food item out of a set of three to ve similar items with the help of the system. 25 Figure 2.5: Experimental Setup. (Left) The subject is seated, and the camera is mounted on glasses worn by the subject. (Right) The vision algorithm detects the object of interest (object highlighted by green bounding box). The user is instructed to grasp the object once they receive the center command from the computer. In the case of auditory, a \center" command is a speech-synthesized voice saying \Center." The \center" command for vibrotactile is no vibration. In this case, the center region was 31:2°. Similar objects were used since some of the objects had a small amount of residual vision. Objects were placed at dierent positions along the table (see Figure 2.5) in random order in front of the test subject. The experimenter started the OLTS software and chose an item in the visual eld, by drawing a bounding box of the desired image patch (see Fig- ure 2.5, right) on the computer. Once the box is drawn, the system runs autonomously. If the subject selects the wrong object, they are instructed by the operator to keep search- ing. 10 trials were done for each of the ve central visual angles (Table 2.1) for a total of 50 trials per subject for each feedback modality. 13 subjects (Table 2.2) were tested using both the vibrotactile (n = 12) and auditory (n = 13) feedback modalities. Subject EB was not available for testing vibrotactile feedback. The testing phase normally lasted for 1.5 hours. To avoid biases related to the order of testing to the visual angle, the subjects were split into 3 groups, each with a dierent order of visual angle (Table 2.1). Several types of data were recorded, including the time to completion, number of reaches per task, object tracking path, the video stream during each task, and the com- mands provided to the subject. 26 Group 1 Group 2 Group 3 7.8° 7.8° 39° 15.6° 31.2° 31.2° 23.4° 23.4° 23.4° 31.2° 15.6° 15.6° 39° 39° 7.8° Table 2.1: The 3 experimental groups for the feedback experiments. Each subject was assigned to one group. Experiments were done in order from top to bottom for the angles listed. In a second round of experiments, we extended the visual angle to 3:9° for the vibro- tactile, since we did not see diminished performance at 7:8° (see Results). All testing was approved by the University of Southern California Institutional Re- view Board, and performed at the Braille Institute in Los Angeles, California. Subjects were read the informed consent form, and then signed the form to enroll in the study. Background medical information was obtained on their eye condition both from their ophthalmologist and from a questionnaire, under HIPAA regulations. Patient ID Cause of Visual Impairment RA Advance Glaucoma EV Optic Nerve Dysplasia RT Optic Nerve Dysplasia ON Retinitis Pigmentosa JV Cataracts HF Retinopathy of Prematurity RP Cytomegalovirus Retinitis TT Retinitis Pigmentosa AD Glaucoma & Retinitis Pigmentosa CB Over-oxygnenation at Birth GD Physical Injury JVA Not Disclosed WB Glaucoma Table 2.2: The subjects that participated in this study, and the causes of their visual impairment. 27 2.3.2 Tracker Evaluation Methods Further analysis was done in order to determine the performance of our OLTS program. The main metric of interest was the eective average frames per second (FPS) of the program. This analysis was inspired by a two dierent studies of tracking algorithms Wu, Lim, and Yang 2013, Dutilh et al. 2011. 2.4 Results 2.4.1 Localization, Reach and Grasp Results The results shown in Figure 2.6 depict the four measurements vs. the center visual angle. The four measurements were time to rst grasp, time to completion, number of attempts and rst grasp success rate. The time to rst grasp indicates the amount of time it took the subject to grasp any object. Time to completion indicates the amount of time it took for the subject to successfully grasp the correct object. The number of attempts indicates how many \reaches" it took for the subject to grasp the correct object. Lastly, the rst grasp success rate indicates how often the subject grabbed the correct on the rst attempt. All four graphs show the average for each measurement across all 13 subjects for ve \center" visual angles. For auditory feedback, the results showed that there was a signicant dierence for the time to completion, and time to rst grasp between the ve central visual angles (p< 0:05). Post-hoc tests showed that the center angle of 7:8° (smallest angle) was the reason for the statistical dierence. A center angle of 7:8° yielded very slow completion times and rst grasp times in comparison to the other angles. There were no statistically 28 Figure 2.6: (Top Left) Time to rst grasp vs. Central Visual Angle. The rst grasp occurs when the subject grasps any object. (Top Right) Time to Completion vs. Central Visual Angle. Completion occurs when the subject grasps the correct object. (Bottom Left) Number of attempts vs. Central Visual Angle. (Bottom Right) First grasp rate vs. Central Visual Angle. First grasp success rate is a percentage; occurs when subject grasps item on the rst attempt. All times are in seconds. Fast times, low number of reaches and high success rates are optimal. Each graph plots the average time/attempts for all subjects. Additionally, each graph plots the results from subjects using auditory and vibrotactile feedback. 29 signicant dierences amongst the remaining 4 visual angles (15:6°, 23:4°, 31:2°, 39°) for any of the other measures. Our results showed no statistically signicant dierence between the ve visual an- gles for any of the measures when using vibrotactile feedback. To determine if a lower performance threshold exists in vibrotactile feedback, additional object localization ex- periments were done in 5 of the original 12 subjects. Three angles (3:9°, 15:6°, 39°) were tested. Two of the angles (15:6°, 39°) were retested along with the new angle (3:9°). Including two angles from the original testing was done to eliminate the possibility that performance could vary from day to day. These additional results (Figure 2.7) showed that the dierence in time to 1st grasp and completion time between the new smallest visual angle, 3:9°, and the remaining two angles was statistically signicant (p < 0:05). There was no statistically signicant dierence for amount of reaches and 1st grasp suc- cess rate between the dierent visual angles (p < 0:05). Additionally, the old and new data from these 5 subjects for angles 15:6° and 39° were compared. Analysis showed that there was no statistically signicant dierence in performance from these dierent days (p< 0:05). 30 Figure 2.7: These plots show the time to 1st grasp, time to completion, number of reaches and 1st grasp success rate for the second round of vibrotactile experiments. Each graph plots the average time/attempts for 5 out of the 12 subjects. Additionally, each graph plots the results from subjects using auditory and vibrotactile feedback. 31 32 Figure 2.8: Each graph shows the tracking path of an object during a single localization task. Each row corresponds to two experiments from one subject. Specically, one plot represents how the object moves within the camera's eld of view (FOV). The x and y axes for the graphs are in pixels. Each 100 pixels correspond to a horizontal distance of approximately 15 cm. Each point represents a snapshot of the position of the object's centroid at a certain video frame/time. The green dot shows position of the object at the start of the experiment. The red dot shows the nal position, and where the subject grasped the object. Thus, the red dot's position should ideally be close to the center of the plot (320, 240). Rows 1, 2 and 3 correlate to localization experiments from visual angles 7:8°, 15:6° and 39°, respectively. The left column shows experiments in which the subject struggled to nd the object. The right column shows experiments that went smoothly for the subject. As the subject orients their head to centralize the object within the camera's FOV, the object moves along a trajectory with the FOV. The graphs from Figure 2.8 show this object trajectory from six dierent experiments. The optimal visual angle (15:6°), largest visual angle (39°) and smallest visual angle (7:8°) were examined. The tracking paths for two experiments are plotted for each visual angle: one set of plots shows an ideal tracking path/head movement, and the other set of plots shows an experiment where the subject struggled to nd the object. The subjects were trained prior to testing, so there was not a consistent trend showing that subjects performed dierently at the beginning of testing vs. at the end of testing. 2.4.2 Subject Comments In order to gain insight on the system design, subjects were questioned after using both feedback mechanisms. These questions pertained to improvements that could be made and which feedback mechanism was preferred. Based on the subject's feedback, there was no denitive trend that suggests more people would prefer sound feedback versus vibrotactile feedback. However, most subjects stated that vibration is better suited in 33 louder environments, since the auditory feedback would be dicult to hear. Subjects also mentioned, that they liked the direct and specic nature of the auditory feedback. Their impression was that the auditory feedback directly \tells you what to do", whereas with vibrotactile feedback \you have to feel it out." One common statement among patients was the possibility of switching between feedback modalities for dierent situations via a physical switch. 2.4.3 Tracker Results The average FPS for the OLTS program was calculated and reported. Additionally, the FPS for the standalone Context Tracker program was obtained from literature as a baseline Wu, Lim, and Yang 2013. The FPS for our OLTS program (5.73 fps) was lower than the FPS for the standalone Context Tracker (15.3 fps). This is due to the OLTS running multiple algorithms on multiple threads; thread-1 ran the Context Tracker algorithm and thread-2 generated speech synthesis/vibrotactile feedback. Thus, CPU processing time was divided between two algorithms, and as a result the overall FPS dropped. This rate is probably acceptable for this application, particularly if auditory feedback is used, since auditory commands are generated at approximately two voice commands/second. Vibrotactile feedback generated a continuous stream of vibrations, so increased FPS may provide better performance. 2.5 Summary The results demonstrate that blind individuals can use their sense of proprioception, that is their sense of body position in space, to guide, reach and grasp for an object. By 34 directing the subjects to point their head at an object, the system enables them to grasp it, even though they cannot see the object. While most of the subjects had vision into adulthood before becoming blind, 5 were blind from birth, so they have never used vision to \look at" an object. The fact that they too could learn to use this system is signicant. We believe that an eective localization system should leverage natural movements and simplistic feedback. Natural movements correlate to using head orientation for locating and pointing towards objects as well as a reach/grasp for picking up the objects. Sim- plistic feedback correlates to using a minimal amount of codes for auditory/vibrotactile feedback to localize and grasp an object. The GroZi system has simplistic feedback, nat- ural movement for grasping, but does not utilize natural head movements do to the hand held camera. The OrCam has simplistic feedback and a head mounted camera for nat- ural head movement, but does not locate items; it instead informs the visually impaired subject what they are pointing at. The vOICe system uses natural head movement via a head mounted camera, but the user must interpret the information. Our approach addresses the shortcomings of the other systems. For auditory feedback, the subject's performance signicantly worsened while using an angle of 7:8°. This same decrease in performance didn't arise at an angle of 7:8° for vibrotactile. Thus, we felt the need to determine if a performance threshold existed for vibrotactile feedback. A second round of vibrotactile experiments with a smaller angle (3:9°) as well as two original angles for comparison (15:6° and 39°) showed a signicant worsening in completion and rst grasp time for an angle of 3:9°. It is not immediately clear why vibrotactile yields this lower threshold; however, it is worth noting that auditory and vibrotactile feedback are delivered at dierent rates. Specically, auditory feedback 35 is delivered to the visually impaired subject at roughly 2 words/command per second, whereas vibrotactile feedback is delivered continuously. This is simply due to the fact that a single utterance such as \left" or \right" takes time for the computer to say. Another dierence to note is that for vibrotactile feedback, vibration is turned o while the object is centralized. These two dierences, varied feedback rates and dierent `center' commands, may explain the dierence in performance thresholds. The results reveal areas requiring improvement. As can be observed in Figure 2.5, two closely spaced objects may both fall in the Center region. Thus, the user may perform the task correctly, yet still grasp the wrong object. This situation of correct centralization with incorrect reaching is related to the proprioceptive abilities of the subjects. Because a head mounted system was used, the visually impaired subjects perceived location of their arm was crucial to the accuracy of their reaching movements. Thus, we looked to literature to see how eective the blind are with reaching tasks as well as how they compare to sighted subjects. Subjects commented that a nal conrmation step that positively identies the object would make them more likely to use the system. Indeed, regardless of system settings, subjects were successful in grasping the specied object about 50% of the time, and this is under ideal conditions for object recognition in that the operator selected the object. Present day object recognition engines still contain single object localization errors around 25% and object detection precision at around 43% Russakovsky et al. 2015. By limiting the objects under consideration, the success rate can be improved, but even under ideal conditions it is clear that positive identication is needed, if users are to trust the system to identify objects. Possible solutions for this are scene text recognition 36 or bar code reading Bissacco et al. 2013, Jaderberg et al. 2016a. In the case where no text or bar code is available, object recognition can be added through optimal viewing of the object. These approaches will work better if the user is holding the object in front of the camera, thereby clearly isolating the image area to be targeted for analysis. It remains to be determined if users would tolerate initial false positive identication and multiple attempts. We have presented a wearable, experimental system which allows the visually impaired to reach and grasp for objects. The system is built upon computer vision tracking techniques in conjunction with custom feedback algorithms and modalities. We have presented quantitative measurements of object localization tasks. Our analysis shows auditory feedback led to worse performance (slower time to completion) at a central visual angle of 7:8°. Additionally, vibrotactile performance worsened at an angle of 3:9°. The results suggest that an additional step, using computer vision techniques to positively identify the object, is needed. 37 Chapter 3 Real Time Scene Text Recognition and Object Recognition 3.1 Introduction 3.1.1 Problem Recognition is a task within computer vision that deals with processing, identifying and labeling instances of items or objects within images. Within recognition, tasks such as object recognition, object detection, optical character recognition are major tasks. In the context of this research, a recognition system is crucial to making the visually impaired person as autonomous as possible. The work from Chapter 2 includes a real time feedback system for guiding a visually impaired person towards an item. Specically, the computer tracks the \item" in real time, and generates simplistic feedback commands 1-2 times per second. These feedback commands eventually guide the visually impaired person so that they can reach and grasp for that item. For each of these localization experiments, the \item" is chosen by a sighted researcher prior to the start of real time tracking. The item is chosen by manually drawing a bounding box around the item on the computer screen. This sighted human based process for selecting an item is essentially an \ideal" 38 Figure 3.1: Two examples of grocery items within the same class \sugar". object recognition algorithm. The important question to answer now is - can a recognition module be developed to remove the sighted human from the loop? To answer the aforementioned question, the problem must rst be formalized. For the purposes of this research, the type of items that this system aims to recognize are grocery store items. Items such as these can vary in color and size. There also isn't an obvious hierarchical structure to items within the same class. In regards to grocery items a \class" would be identications such as water, juice, sugar, rice. For example, a box of Splenda and a box of Kroger cane sugar are both types of sweetener, but they share no real similar characteristics (Figure 3.1). In contrast, entries in a class such as \faces" all have similar structure; they all have noses, eyes, mouths, etc. Grocery store items on the other hand are only similar in the sense that the majority of them contain text labels or brands in highly visible areas. However, this similarity does provide a means to classify objects within this class. 39 The fact that grocery store items contain unique branding and visible text is important to this problem. It suggests that an algorithm for reading text in an image could be very useful for eventually nding items. In Section 1.1.3, a recognition paradigm called Optical Character Recognition (OCR) was brie y mentioned. OCR is the act of detecting and recognizing text in images. It's a method that is normally applied to scanned documents, and images with non-occluded and well structured computerized text. For this research, the images contain text at dierent viewing angles, imperfect lighting and text with non standard fonts. An even more suitable class of algorithms exist in Scene Text Recognition (STR). Scene text recognition is a eld within OCR that deals with nding and reading text in natural images. Natural images are pictures with imperfect resolution, varied lighting and random viewing angles. An algorithm with human level accuracy does not fully exist yet. There are however algorithms that can detect and recognize text with reasonable accuracies. A list of prominent algorithms is discussed in Section 3.1.3. The specic algorithm used in this module is also discussed in that section. As mentioned before, scene text recognition algorithms are not 100% accurate at the mo- ment. Using another computer vision methodology in conjunction with the text recogni- tion algorithm may be worthwhile. An object recognition algorithm trained on grocery store item images may be able to bias or boost the accuracy of the text recognition. In Section 3.1.4, dierent types of features and methodologies for object recognition are discussed. Additionally, the rationale for choosing a recognition algorithm is covered. 40 3.1.2 Neural Networks Within object recognition, an interesting task is identifying the types of objects in an image. There are yearly contests within computer vision that allow researchers to apply their algorithms to common datasets. These datasets contain large amounts of images with multiple types of objects in them. The most famous contest (or challenge), the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), gives researchers a set of 1.2 million training images. Within the images there can be any number of items from 200 categories. The categories represent items such as apples, antelopes, bicycles and more. Each image in the provided training set is annotated with the positions and labels of items in the image. The purpose of this is to allow researchers to build programs and algorithms to learn from the annotated datasets. After an allotted amount of time, the ILSVRC committee sends out an unlabeled image dataset with items randomly located in each of the images. The eventual goal is to determine the best performing algorithm. The most common previous approaches for the have been to use \hand crafted features". Hand crafted features are features that have been manually developed and rened by computer vision scientists and engineers. To put it bluntly, they are human developed. There are many dierent hand crafted features that can be employed in a recognition framework. Popular hand crafted features such as Scale Invariant Feature Transform (SIFT) (Section 3.1.4.1), or Speeded Up Robust Features (SURF) (Section 3.1.4.2) have been used in many dierent algorithms. Each year in the ILSVRC challenge two important metrics are used to judge the algo- rithms submitted by thousands of researchers. They are the Top-1 and Top-5 error rates. 41 Figure 3.2: Example images from the ImageNet challenge. Images can contain one or more items from the categories provided in the challenge. This particular dataset contains 1000 categories (i.e. leopard, dalmatian, grape, etc.) and 1.2 million images. Directly below each image is the correct category/label. Below the correct category are 5 guesses as well as a horizontal bar depicting the condence score of the algorithm. 42 For each image, the algorithm should generate N guesses (e.g., apple, bicycle). Each guess should have a condence score associated with it. If the N guesses do not contain the true label (i.e. the actual item in the image), then this counts as an incorrect classication (Figure 3.2). This process is then repeated for every image in the testing dataset. After all images are processed, the error rate is calculated (Equation 3.1). Lower error rates indicate better performance. TopNerror = T i T (3.1) where T i is the total number of incorrectly labeled images and T is the total number of images in the test set. In 2012, the best performing hand crafted algorithm, yielded Top-1 and Top-5 error rates of 45.7% and 25.7%, respectively on the ILSVRC dataset. However, the hand crafted algorithm was not the best overall performing system in that year. The Hinton group from the University of Toronto built a deep learning neural network for object recognition tasks (Krizhevsky, Sutskever, and Hinton 2012). This network drastically outperformed all other algorithmic approaches with Top-1 and Top-5 error rates of 37.5% and 17%, respectively. The percentage decrease in error rate was extremely signicant, and it demonstrated for the rst time that neural networks are extremely viable for tasks in object recognition. In fact, there are products currently utilizing neural networks for other machine learning tasks because of this realization: • Google Speech Recognition (Deng, Hinton, and Kingsbury 2013) • Apple's Siri (Levy 2016) 43 • Google Photos Now that the strengths of neural networks have been discussed, it's worthwhile to brie y touch upon the underlying principles about them. 3.1.2.1 Perceptron A thorough discussion of Neural Networks requires an introduction to their individual building blocks, the Perceptron. The perceptron is a binary classier. It maps a set of input values x 1 to a 0 or 1 output value (Equation 3.2). It's goal is to create a linear decision boundary between a set of training examples (Figure 3.3). f(x) = 8 > > > > < > > > > : 1 if wx +b> 0 0 otherwise (3.2) w represents a set of weights of length n. These weights are real valued numbers; they represent the parameters of the network. In Equation 3.2, these weights essentially mod- ulate the importance of each input value x. x is a set of input values to be fed to the perceptron. The perceptron is directly inspired by the neuron. Thus, input values can be thought of as outputs from previous neurons, and weights can be thought of as the strength of the connection between an incoming axon and a synapse (Figure 3.4). In regards to image processing, each input, x i , could be the intensity value of an image pixel (range from 0 - 255). The output values y, 0 or 1, would act as two dierent classes (e.g., 0 = apple and 1 = banana). The goal of using the perceptron would be to learn the weights necessary to output 0 if an image contains an apple. Conversely the system 44 Figure 3.3: The apple and banana represent two classes within a set of three training examples. The training examples are delineated by a decision boundary (blue line). Two values, size and color, are used as an example of input features. The goal of the perceptron is to nd this blue linear boundary. Ultimately, for future predictions of test data, the perceptron will classify test data by seeing which side of the decision boundary it rests on. 45 Figure 3.4: The Perceptron. Each input value, x i is modulated by a weight value, w i . The sum of these products (i.e. the dot product between w and x) is then thresholded via f(x) (Equation 3.2). The nal output y is a value of 0 or 1. should output 1 if an image contains a banana. This example is extremely trivial, however it illustrates the overall premise of a perceptron, and eventually, a neural network. Given a set two output classes, learn the weights necessary to predict future test inputs. Learning the weights of the perceptron is an iterative method consisting of three steps. Given a set of t labeled examples (e.g., image-1 = apple, image-5 = banana) the following three steps are run to learn weights: 1. Initialize all weights w to small random value or 0 2. Given a set D, of m labeled examples; For each example j, run the inputs of that example through the perceptron. So essentially, pass each image through the per- ceptron 46 (a) Calculate the output (aka classify), y, with the current weight values for the current example j y j =f[w x j ] (3.3) y j =f[w 0 :::x j;0 +w 1 :::x j;1 +w 2 :::x j;2 + +w n :::x j;n ] (3.4) (b) Update the weights. If the perceptron's output for training example j (y j ) equals the training examples desired output (d j ) then no change occurs. The output values are binary (0 or 1). If they are not equal, then the ith weight (w iold ) increased/decreased by the ith input value (x j;i ) for the current train- ing example. w inew =w iold + (d j y j )x j;i (3.5) 3. Repeat step 2 for N iterations (a chosen value) or until the iteration error 1 m m X j=1 jd j y j j (3.6) is below a user specied threshold . d j represents the predened label output value for the current example j, and y j represents the calculated output given the inputs of the current example j. To summarize, the learning algorithm (Equation 3) for the perceptron iteratively shrinks the total error between the guesses and actual values of a set of training examples. The 47 error for a single training example is simply the dierence between the actual value and the predicted value. Thus, the error for one example can be -1, 0 or 1. For each training example, its output value, y can be 0 or 1. That training example additionally contains a set of input values, x. For each example, the perceptron predicts an output given the input x values (Equation 3.4). If the perceptron guesses incorrectly, then the weight, (w i ), is increased or decreased by the input value, x j,i . 3.1.2.2 Articial Neural Network The previous section served as a primer to: • Perceptron - A linear binary classier inspired by neurons • Simple learning algorithm for adjusting the perceptron weights based on training error An Articial Neural Network (ANN) is simply a group of perceptrons connected. For the remainder of this section perceptrons will be referred to as nodes. The ANN can have multiple layers, and within each layer, there can be many nodes. Each node in layer l sends it's output to all nodes in layer l + 1 (Figure 3.5). Thus, an ANN is fully connected. In the previous section (3.1.2.1) the goal of the perceptron was to act as a binary classier. Given a set of training examples that have 0 or 1 labels that correspond to class names (e.g., apple or banana), train the classier by increasing/decreasing weights when the classier is incorrect. In the case of an ANN, the high level logic is the same. There are a set of output classes, weights and input values. The network iteratively learns optimal weights based on the 48 Figure 3.5: An articial neural network. The network contains an input layer, two hidden layers and an output layer. Each node, aka perceptron, is fully connected to all nodes in the following layer. labeled data provided to it. The main dierences between the perceptron and the articial neural network are: • Learning Algorithms - A method called Back Propagation is used in addition to gradient descent • Output Classes - An ANN can have more than two output classes • Non Linear Classier - The weights in a perceptron are linearly combined (weighted sum). In an ANN the weights pass through a nonlinear activation function (Fig- ure 3.6). This allows the decision boundary of the ANN to be nonlinear. The nodes used in an ANN vary slightly from the perceptron discussed in Section 3.1.2.1. Specically, instead of being passed through the binary threshold function (Equation 3.2), nonlinear activation functions are used. 49 Figure 3.6: An image of the updated neuron used in Articial Neural Networks (ANN). The dierence between the ANN neuron and a perceptron is the activation function (f ) used for the output of each neuron. Nonlinear activation functions such as the Sigmoid, Hyperbolic Tangent and Rectied Linear Unit are used in most applications of articial neural networks, with Rectied Linear Units (ReLU) being a favorite. • Logistic Function (Sigmoid) f(x) = 1 1 +e x • Hyperbolic Tangent (TanH) f(x) = 2 1 +e 2x 1 • Rectied Linear Unit (ReLU) f(x) = 8 > > > > < > > > > : 0 for x 0 x for x 0 50 9 6 3 3 6 9 3 3 x y f(x) = 1 1+e x 9 6 3 3 6 9 3 3 x y y =ReLU(x) 9 6 3 3 6 9 3 3 x y f(x) =tanh(x) Figure 3.7: Three popular examples of activation functions used in articial neural net- works. 3.1.2.3 Convolutional Neural Network A Convolutional Neural Network (CNN) is an extension of an ANN. It utilizes all the features of an ANN: • Non linear activation, to make the network a non linear classier • Multiple layers, and nodes 51 Figure 3.8: A 1 dimensional version of a convolutional neural net. The output nodes (yellow blocks) only connect to nodes in their vicinity. The key dierence between an ANN and a CNN lies in the connectivity of their neurons. Specically, a CNN reduces the amount of connections by only requiring local connectiv- ity. Given two layers of neurons, l and l+1, the nodes in layer l+1 will connect to local neurons in the preceding layer (Figure 3.8). This constraint in CNNs makes them very useful for nding local features within an image. This is due to the fact that the neurons in later layers only examine smaller portions of the preceding input. Reducing the amount of connections reduces the amount of parameters to learn. This leads to two performance benets: (1) network size, (2) network speed. The process of passing an input image through the network is referred to as forward propagation. By decreasing the amount of weights necessary, the amount of calculations necessary to forward propagate the input is decreased. 3.1.3 Scene Text Recognition Scene text recognition is a eld within OCR that deals with nding and reading text in natural images. There are two stages two any scene text recognition pipeline: • Text Detection - Passing an inexpensive classier to split the image into \text" and \non-text" regions 52 Figure 3.9: The result of passing a scene text recognition algorithm over an image. The algorithm nds and recognizes the text as \Triple Door". • Text Recognition - Passing a smarter classier over the \text" regions in an image. In this stage the algorithm determines what words are in those regions In the following section, dierent approaches and algorithms are discussed as well as the rationale for choosing a scene text algorithm for text recognition. 3.1.3.1 Neural Network Based Approach Neural networks can be applied to scene text recognition tasks. Two neural network based scene text recognition algorithms were considered for integration into this project's text recognition module [(Wang et al. 2012), (Jaderberg et al. 2016b)]. Due to the overall similarity in their approaches, only one of the approaches is discussed. The approach from Wang utilizes one neural network for text detection and a similarly constructed neural network for recognition of text (Wang et al. 2012). The networks were trained on a dataset of greyscale images and synthetic data (Figure 3.10). The detection network (Figure 3.11) takes a 3232 input image and decides whether or not that region 53 Figure 3.10: Examples from the Wang training set for text recognition Wang et al. 2012. (Left) from ICDAR 2003 dataset. (Right) Synthetic data. contains a character. In this approach, the detector is slid across the entire image via a sliding window approach. Thus, every portion of the image is deemed a character or not character. Once character detection is nished, the next step is to merge the character detections into possible word boxes. To do this, rows with possible character regions are merged into horizontal boxes. The edges of the boxes are determined by spacing between possible character regions. If the horizontal distance between two characters is above a threshold then the characters are placed into dierent boxes. Therefore, each row may have multiple horizontal boxes. This process is done at multiple scales, and as a result boxes of dierent heights and widths may overlap each other within the image. The nal step is to remove/merge overlapping boxes. This is done by choosing the boxes that yield the highest probability for being character regions (i.e. the results from the previously mentioned neural network). These nal boxes are now possible word regions (or bounding boxes) to be passed to the recognition network. Once the word regions from the detection step are found, the recognition neural network is slid across each of these bounding boxes. This network also takes a 3232 input image, 54 Figure 3.11: The text detection neural network used in approach 1 (Wang et al. 2012). however, the output is the proposed alphanumeric character (e.g., \A", \1", \k"). The result is a possible word guess for that bounding box. 3.1.3.2 Class Specic Extremal Regions Another type of approach is to use hand crafted features for scene text detection. One hand crafted feature, Extremal Regions, were considered for integration into this project's scene text recognition module. The premise of extremal regions is to detect blobs within an image. A type of extremal regions called Class Specic Extremal Regions (CSER) deal with nding blobs of a certain class. In this case, the blob types are characters (or text). A blob is a set of pixels sharing the same intensity value (where 0<< 255). The determination of characters or non characters is done via a two stage classier. In the rst stage, a set of quickly computable features is calculated. This results in possible character regions. This rst stage reduces the amount of regions to process, and therefore increases the eciency of the algorithm. The second step uses more computationally expensive features; these features are more robust and are accurate for determining if a region is a character. 55 CSER - First Stage Classier First, an image is thresholded from the minimum intensity to it's maximum intensity (i.e. 0 < < 255). At a threshold value of , all pixels below a the threshold value are black, and pixels above that value are white. As the image is thresholded, adjacent pixels with the same intensity value are saved as blobs. At this point, the rst stage classier determines if a blob is a character, by calculating quickly computable features. These features are area, perimeter, bounding box and Euler number. These features can be calculated in one operation (e.g., multiplication, addition). The trade o is that many regions can be detected in this stage (1,000,000 regions for 1 megapixel image). For each threshold value, all regions as well as their calculated features are stored. This allows the algorithm to avoid recalculating regions scores repetitively. Ultimately, this portion of the algorithm removes regions that share no resemblance to a character. CSER - Second Stage Classier The next step uses more advanced features to classify the detected character regions from the rst step. The hole area ratio, convex hull ratio are examples of the advanced features. Any region that scores above a certain response is thus deemed as a character. The possible character regions are then grouped into possible word regions (or bounding boxes). The possible word regions can then be passed to any text recognition classier; one could use this CSER detector in conjunction with a neural network classier such as the text recognition network mentioned above (Section 3.1.3.1). 56 3.1.4 Object Detection and Recognition Object recognition deals with detecting singular instances of items within or identifying categories of all items in an image. Many dierent types of approaches can be considered for object recognition. The approaches can be divided into hand crafted feature based and learned feature based. In this section, dierent hand crafted features such as: • Scale Invariant Feature Transform - SIFT • Speeded Up Robust Features - SURF • Color Cooccurrence Histograms - CH 3.1.4.1 Scale Invariant Feature Transform Scale Invariant Feature Transform (SIFT) is a feature developed for nding key-points in images (Lowe 2004). The goal of SIFT keypoints are to nd distinct features or image regions that are invariant to image scale and rotation. These features can then be used to match between dierent images of the same object. The algorithm can be divided into four main steps: 1. Scale-space extrema detection 2. Keypoint localization 3. Orientation assignment 4. Keypoint descriptor 57 Scale-space Extrema Detection In the rst step, the algorithms goal is to detect keypoints that occur at multiple locations and scales in the image. First, two Gaussians of scale andk are generated; the two dimensional Gaussian is dened by Equation 3.7. Next, an image I(x;y) is convolved with a Dierence of Gaussians (DoG) lter (Equa- tion 3.8). As the name implies, the dierence of Gaussians is attained by subtracting two Gaussians from each other. In regards to the values of a k and an initial , empirical values used in practice are k = p 2 and = 1:6. This convolution process DoG is done at multiple values of ; the result is a set of multiple DoG images at multiple scales for a specic octave (Figure 3.12). Octaves are changed by multiplying by 2 (i.e. down sampling). G(x;y;) = 1 2 2 e (x 2 +y 2 )=2 2 (3.7) 0<x<ImageWidth 0<y<ImageHeight D(x;y;) = (G(x;y;k)G(x;y;))I(x;y) (3.8) I(x;y) =Image Taking the DoG of an image I(x;y) acts as an edge and blob detection technique; thus, convolving the image with DoG of dierent scales nds edges and blobs at dierent scales. The next step is to lter out weak keypoints. Pixels within a scale, i, are compared to 8 58 Figure 3.12: For each octave of scale space, the initial image is repeatedly convolved with Gaussians to produce the set of scale space images shown on the left. Adjacent Gaussian images are subtracted to produce the dierence-of-Gaussian images on the right. After each octave, the Gaussian image is down-sampled by a factor of 2, and the process repeated. 59 Figure 3.13: Here the maxima and minima keypoints are detected. This is done by comparing a pixel (marked with X) in a DoG image to its 26 neighbors in 3 3 regions at the current and adjacent scales (marked with circles). neighboring pixels within scale i and 9 neighboring pixels from scale i-1 and i+1 (Fig- ure 3.13). If the initial pixel is less than or greater than all of its neighboring pixels, then it is denoted as a minima or maxima, respectively. If the pixel is a maxima/minima, then neighboring pixels are removed from consideration for maxima/minima checking. This is repeated for all candidate pixels. To localize keypoints, keypoints with low contrast and edge keypoints are removed. These remaining keypoints are strong keypoints. Orientation Assignment The previous steps of the SIFT determined robust keypoints at diering scales and translations. This portion, orientation assignment, algorithm makes these keypoints robust to rotation. A 360°region around each keypoint is examined; in this region, a histogram with 36 bins (10°per bin) is generated. Points within the region have a gradient (Equation 3.9) and an orientation (Equation 3.10). The importance of each orientation is weighted by it's gradient. For each histogram, H, generated, the highest peakH peak , as well as the orientation peak of that peak are stored. The orientation peak is assigned to keypoint of the region being examined. Additionally, orientations with 60 peaks that are in within 80% of H peak are assigned to copies of the same keypoint. The result is a keypoint that can exist with multiple orientations. This process is done for every keypoint at the scale of each keypoint. m(x;y) = p (L(x + 1;y)L(x 1;y)) 2 + (L(x;y + 1)L(x;y 1)) 2 (3.9) (x;y) =tan 1 ( L(x;y + 1)L(x;y 1) L(x + 1;y)L(x 1;y) ) (3.10) Keypoint Descriptor The nal step of the algorithm is to create a descriptor. The descriptor must be invariant to lighting changes be extremely distinctive. This will allow future images with similar content to produce matching keypoints. The process here is similar to the previously described orientation assignment step. However, the resulting information is dierent. Initially, a 1616 region around the keypoint is examined. Each 1 1 grid in the region is assigned an orientation. After determining orientations in each 1 1 grid, they are accumulated into 4 4 grids. Within each of these 4 4 grids, an 8 orientation bin histogram is formulated from the previous 1 1 grids (Figure 3.14). The result is a 128 (= 4 4 8) dimension vector. The nal vector is then normalized for robustness to illumination changes. 3.1.4.2 Speeded Up Robust Features Speeded up robust features (SURF) are a modied and faster version of SIFT keypoints (Bay et al. 2008). The SURF algorithm uses multiple approximations from the SIFT algorithm to reduce complexity and increase speed. The algorithm removes the Gaussian 61 Figure 3.14: This gure shows an 8x8 region around a keypoint (center of left image). Each 1 1 grid in the Image Gradient gure has an orientation. In the right Keypoint Descriptor image the 16 grids in one corner of the Image Gradients are reduced to 1 quadrant in a 22 grid. Each quadrant contains an 8 bin histogram. In this diagram the descriptor generated would be a 32 dimension vector (2 2 8). Most implementations utilize 4 4 grids with 8 bins each (i.e. 128 dimensions). (Equation 3.7) and DoG convolution operation over multiple image scales and octaves (Equation 3.8). Instead, the algorithm uses an approximated version of the Hessian De- tector to nd salient keypoints. Firstly, the Hessian Detector is dened in Equation 3.11 as follows: } (x;y;) = 2 6 6 4 L xx (x;y;) L xy (x;y;) L xy (x;y;) L yy (x;y;) 3 7 7 5 (3.11) L xx (x;y;) is dened as the convolution of a second order derivative Gaussian with an Image I at the point (x,y). By computing the Hessian matrix (Equation 3.11) at a scale , keypoints are detected. These keypoints are blobs, corners and T-Junctions. The SURF algorithm uses a variation of the Hessian matrix for its Fast Hessian Detector. Specically, instead of second order Gaussians, box lters (Figure 3.15) are used as close 62 Figure 3.15: (a) and (b) are Gaussian second order partial derivatives in y-direction and xy-direction, respectively. (c) and (d) represent approximations of (a) and (b). Using the box lter approximation improves overall computational speed of the algorithm. approximations. Using the box lter approximations yields faster computation times for the SURF algorithm. 3.1.4.3 Attention Biased Speeded Up Robust Features (AB-SURF) Attention Biased Speeded Up Robust Features (AB-SURF) is an algorithm used in the previous object recognition module for this systems rst prototype (Thakoor et al. 2013). The algorithm is a two step process. The steps are: 1. Attention biasing detects 3 salient regions of size widthheight in the image. The width and height can be varied from 45 - 90 pixels. 63 2. A SURF based recognition algorithm examines the 3 salient regions and determines what objects are present. In the rst step, the Attention Biased (AB) algorithm uses features based on hue, orientation and intensity. In a preprocessed step, low level features for a set of grocery items are stored in a database. This database is used to compare future test images containing grocery items. Specically, the AB will compare the hue, orientation and intensity features in the test image to the same features in the database. The end result is to choose 3 regions in the test image that contain similar features to the item's database features. In the second step, all three regions are processed by the SURF algorithm from Section 3.1.4.2. The result is one set of features per region. The database contains a set of features for each item from a pre-training step. Then, the algorithm deems the region with the most matching features as the queried item. 3.1.4.4 Color Occurrence Histograms Color Cooccurence Histograms (CCH) are a type of feature used in the past for ob- ject recognition tasks(Chang and Krumm 1999). The premise of the CCH is to not only capture color information in an image, but to also capture spatial information within the image. The color CH holds the number of occurrences of pairs of color pixels c 1 = (R 1 ;G 1 ;B 1 ) and c 2 = (R 2 ;G 2 ;B 2 ) separated by an image plane vector (x; y) (Figure 3.16). 64 Figure 3.16: A graphical representation of one pair in a Color Cooccurence Histogram (CCH). Two pixels with colors C 1 and C 2 are separated by a vector (x; y). 3.1.4.5 Support Vector Machines Support vector machines are a type of classier that aim to determine decision boundaries between two classes of items. Specically, examples of each class can be represented as points in an example space. Ultimately, the goal of the support vector machine is to determine a hyperplane that maximizes the margin between examples of the rst class and second class (Figure 3.17). For multiclass problems, (i.e. more than two classes), multiple SVM's can be used to create multiple hyperplanes. 65 Figure 3.17: A support vector machine (SVM) determines the hyper plane between a set of examples. Each example belongs to one of two classes. The SVM maximizes the margin between the nearest examples in each class. 3.2 Methods and Approach 3.2.1 Scene Text Pipeline 3.2.1.1 Text Detection A scene text algorithm developed for real time text detection (Neumann and Matas 2011) was incorporated into this projects scene text pipeline. The details of the algorithm were discussed in Section 3.1.3.2. To summarize, the algorithm poses the text detection prob- lem as nding blobs in an image. A blob is a set of adjacent pixels with matching intensity values (or the same color). Once blobs are found, they are classied as a character via a two stage classication process. The rst stage uses simple features to nd a high volume of possible character regions. The second stage uses more complex and robust features to remove any noise from the rst stage. The nal result is a set of possible word bounding 66 boxes. Ultimately, this CSER based text detection algorithm was chosen for it's speed and accuracy. 3.2.1.2 Text Recognition The text recognition module incorporates a neural network based recognition algorithm (Coates et al. 2011) for recognizing detected text regions. A similar algorithm is dis- cussed in Section 3.1.3.1. Essentially, the word detection box from the previous section is processed by a neural network. The network takes a 32 32 input image and outputs it's prediction for the most probable character (e.g., \A", \1", \d"). After sliding the 3232 window region over the entire word bounding box, a list of characters is outputted as a possible word. Lexicon In addition to the word output from the neural network, a lexicon is used to increase the accuracy of the recognition guesses. A lexicon is simply a list of words (i.e. a dictionary) that can be used to restrict the amount of possible guesses for the recognition system. As a result, partially recognized words can be compared to the lexicon. If any partial matches occur, then the nal output of the text pipeline will simply be the matched word from the lexicon. In this pipeline, the lexicon was a list of grocery items being searched for. 67 Figure 3.18: A high level overview of the text recognition process. Given an original image, the text detection algorithm detects possible word bounding boxes in an image. Once the possible word regions are found, the regions are passed through a recognition algorithm to predict the actual characters inside of the bounding box. Word guesses are compared to the lexicon and partially matching words are updated to dictionary words. 68 3.2.2 Object Recognition Pipeline In this section, the work done to implement a novel object recognition pipeline is dis- cussed. A convolutional neural network (CNN) was the foundation of the object recog- nition module. To build the module, large amounts of data and a way to construct & train the network were necessary. To acquire large amounts of data an automated web crawler was developed (Section 3.2.2.1). Next, a CNN was built (Section 3.2.2.2) using Torch7, a matlab like interface for building neural networks (Collobert, Kavukcuoglu, and Farabet 2012). Once constructed, the neural network was trained using Stochastic Gradient Descent (SGD), and tested on both in vitro website images (Section 3.3.1.1) and in vivo web camera video frames (Section 3.3.1.2). 3.2.2.1 Data Collection and Augmentation In order to build an object recognition pipeline, it is necessary to gather data to train the pipeline. The structure of the data as well as the sheer amount of labeled data used to train directly determines the ecacy of the object recognition system. Based on this notion, a feasible approach for gathering a large amount of grocery item images was required. There were two main options for gathering these images: (1) automated collection of in vitro images, (2) manual collection. In this work, an automated method for collecting a large amount of training images was implemented. The automated program was built to handle labeling and fetching data from a prominent grocery store's website. These website images are referred to as in vitro data. Addition- ally, a small amount of web camera video images from a set of 17 items were also gathered for testing the network. These web camera video images are referred to as in vivo data. 69 The purpose of collecting a large amount of in vitro website images for training was to determine if the object recognition pipeline could nd patterns between website images and web camera video images. Furthermore, using the automated fetching methodology would make training future iterations of the system much simpler. Every so often, grocery item's appearances change. Using this methodology would allow future engineers to run this program to fetch newly updated website images and retrain the object recognition algorithm. Automated Data Collection - In Vitro Data A custom program built on top of web crawler was built to download images from a grocery store website (Figure 3.19). A web crawler is a program or bot that browses websites or computers for data. The manual process for downloading a web image would require multiple steps: (1) querying the website (e.g., \rice"), (2) right clicking the image to save it, (3) placing it in an appropriate folder on the computer, (4) labeling it with the right class. The web crawler program automates all of these tasks; the only manual intervention necessary is to create a list of queries/categories (e.g., \rice-rice+roni", \sugar-splenda", \water-dasani") and to inspect the downloaded images for validity. Automated Data Augmentation - After grocery images have been downloaded from the web and inspected, the images are run through another program. This program warps, rotates, scales, translates, brightens and darkens an image (Figure 3.20). Each image in the dataset is run through the program. For each original image, 47 processed images are generated. 70 Figure 3.19: A ow diagram for the automated web crawler. The only manual inter- vention required is a list of queries/categories. The web crawler uses a list of queries to automatically request and download images from a grocery store website. Figure 3.20: An actual original image from the dataset being preprocessed. In this gure, the original image is translated, warped, and rotated. Additional operations done to the original images are brightening, darkening and scaling. 71 Figure 3.21: The manual collection pipeline. For each frame of the web camera video, multiple items within the image were cropped and labeled. These cropped and labeled images were used in subsequent testing of the object recognition pipeline. Manual Data Collection - In Vivo Data In addition to collecting website data, web camera video images were collected as well. The purpose of collecting these images was to actually test the neural network on data it will see in the wild. To do this, web cam videos were collected of a shelf of items. Each region in the image with an item was cropped and labeled appropriately. Multiple crops of dierent sizes, distances and viewing angles were taken for each item (Figure 3.21). 3.2.2.2 Neural Network Formulation Choosing the structure of the neural net is dependent upon the amount of data as well as the structure of the data. The structure of the network is dened by: 1. Layers • Type of layers (e.g., Convolutional, Fully Connected, ReLU) 72 Figure 3.22: One version of the network used in this research. The network takes the Red Green and Blue channels of the image, and ultimately produces a guess for the brand of the item. The outputs are of the format <category>/<brand> (e.g., rice/rice-roni, or sugar/splenda). • Number of layers 2. Nodes - Number of nodes in each layer The nal version of the neural network contains 1 input layer, 3 hidden layers, and one output layer: 1. Input Layer - 3 50 50 input images 2. Spatial Convolutional Layer + ReLU + Max Pooling 3. Spatial Convolutional Layer + ReLU + Max Pooling 4. Fully Connected 5. Fully Connected + Soft Max Classier 3.2.2.3 Stochastic Gradient Descent Learning Algorithm The last part of a neural network based pipeline is the determination of the hyper- parameters as well as the learning algorithm. The learning algorithm determines training time, the weights achieved and as a result, the overall accuracy of the system. 73 A learning algorithm is used to determine how to increase or decrease the weight values of the neural network. The purpose of increasing or decreasing these weight values, , is to eventually minimize the overall cost function, J() of the network (Equation 3.12). Furthermore, the cost function is used as a means to mathematically determine the penalty of the network incorrectly labeling an input image. For example, if the network were provided with an image of a water bottle, and the network guessed that this image was a box of cheerios, then there would be a cost penalty for incorrectly guessing cheerios. The Negative Log Likelihood (NLL) (Equation 3.12) is a common cost function used to train neural networks. J() = N X n=1 K X k=1 t kn ln(y k (x n ; k )) (3.12) t kn = 8 > > > > < > > > > : 1 if k ==DesiredLabel 0 otherwise (3.13) y k (x; k ) = exp(a k (x; k )) P j exp(a j (x; j )) (3.14) a k (x;) =ReLU(x k ) =max(0;x k ) (3.15) knew = kold @J @ k (3.16) 74 In the equations above, is a matrix of weights and k is a vector of weights for the kth output node. The weights are real valued numbers. K is the number of classes in the network, and N is the amount of training examples to calculate the cost over. In this network, an example of a class would be \rice/rice-roni". Essentially, the cost J is proportional to the probability of the nth example being labeled correctly. If the network outputs y k = 100% probability for class k, then the natural log of that value will be 0, and as a result the cost penalty for that example will be 0. Thus, if the probability for guessing the class k for an example is any value less than 100% then the natural log will be negative, and the overall cost will be positive (note the negative sign in Equation 3.12). The functiont kn states that when examining a training example n with label k only the weights for the output node y k should be updated. As mentioned before, the weights for node y k are a vector of real numbered values ( k ). In the standard version of gradient descent, N is the size of the entire training set. Cal- culating the cost then becomes a matter of calculating the cost penalty for every example in the training set. Stochastic gradient descent chooses 1 or more random examples to calculate the cost penalty of incorrect labeling. In practice, the SGD technique works well, and the speed up in training is signicant. 3.3 Results 3.3.1 Object Recognition The accuracy of the object recognition neural network on training and test data was analyzed for eectiveness in grocery item recognition. Two types of accuracy results were 75 Figure 3.23: (Left) An example of an in vitro web image downloaded from the internet. (Right) An example of an in vivo image frame taken from a web camera. analyzed. In Vitro and In Vivo data accuracy are the two types of data analyzed. In context to this research, in vitro data are images taken from the web, and in vivo images are taken from web camera video (Figure 3.23). 3.3.1.1 In Vitro Data Accuracy When building a neural net, the data used to build the network can be split into dierent subsets: the Training and Testing datasets. The purpose of the training set is to cause the network to learn weights/parameters that can be used for predicting future images. The test dataset acts as a set of future images. Splitting the two data sets should hopefully force the algorithm to generalize recognition tasks to varying viewpoints and scales of an object up to a certain extent. In this section, the training and testing data was composed of website images. Web- site images were gathered via an automated data collection pipeline (discussed in Sec- tion 3.2.2.1). Testing was done in two phases: • Phase 1 The rst version of the neural network was tested on the rst version of the dataset 76 • Phase 2 The second version of the neural network was tested on the second version of the dataset There were many iterations of the neural network, however the two major iterations (version 1 & 2) were tested against the data to determine accuracy. To reiterate, in this section both versions of the dataset (version 1 & 2) were purely website images gathered from the internet. Neural Net v1 & Dataset v1 The rst version of the neural network contained six layers in total: 1 input layer, 3 hidden layers and 1 output layer (Figure 3.24). The input layer was a 35050 input. Specically, the value 3 corresponds to the amount of image channels (red, green, blue) and the value of 50 corresponds to the height and width of the image. Thus, all images were center cropped to 50 50 pixel images, and each image channel was passed to the network. The next layer was composed of 30 46 46 spatial convolution maps, ReLUs and 30 15 15 max pools. Following the max pools was a fully connected layer of 6750 neurons and 2000 output neurons. This layer immediately leads to another fully connected layer with 2000 input neurons and 400 output neurons. Finally, this connected layer led to a fully connected output layer with 400 input neurons and 47 output neurons. Each of the 47 output neurons corresponds to a brand label. The brand classes are shown in Figure 3.25. After formulating the network, it was tested on version 1 of the dataset (v1-dataset). The total v1-dataset was composed of 855 images from 46 brand classes. Each image corresponded to a single item. 70% of the 855 images were used to train the network, and the remaining 30% were used to test for labeling accuracy. In addition to these brand 77 Figure 3.24: The rst version of the object recognition neural network. It consists of 1 input layer, 3 hidden layers and an output layer. classes, one background image class was created. Background images were pictures such as oors, ceilings or shelves in between grocery items. To reiterate, the accuracy is dened as: Accuracy = T c T (3.17) where T c is the number of correctly labeled images and T is the total number of test images. In the v1-dataset there are 257 test images. After testing, it was found that the overall accuracy across all 46 brands/classes was 46.9%. The per brand accuracy has been highlighted in Figure 3.25. Neural Net v2 & Dataset v2 The accuracy from the rst stage showed that for 36 out of the 46 brands, the network was able to recognize web images much better rate than random chance (2.2%). Nonetheless, it was worthwhile to determine how much more accuracy could be increased. To do this, network structure was altered through several iterations. The nal alterations led a network with 1 input layer, 3 hidden layers 78 Figure 3.25: This plot shows the accuracy of the neural network in classifying examples from 46 dierent classes. The classes were represented as brands. These results are from testing the rst version of the neural net with the rst version of the dataset (v1-dataset). The v1-dataset consisted of 855 images (599 training, 256 testing). 79 Figure 3.26: The second version of the object recognition neural network. It consists of 1 input layer, 3 hidden layers and an output layer. In this version, two of the hidden layers convolutional, and 1 of the layers is fully connected. This is in contrast to version 1 (v1 had 1 convolutional and 2 fully connected). and 1 output layer (Figure 3.26). In addition to increasing the amount of nodes within the network, the data was augmented by programmatically scaling, rotating, translating, brightening and darkening the initial 855 images (discussed in Section 3.2.2.1). The nal result was a set of 41,660 images. Following the same methodology, 70% of these 41,660 images were used to train the network and the remaining 30% of the images were tested. The overall accuracy following this data augmentation was 76.9% for all 47 classes (46 brands and 1 background class). For the 46 brand classes, the overall accuracy was 73%. These results are highlighted in Figure 3.27. 3.3.1.2 In Vivo Data Accuracy The second stage in testing for this work was to consider web camera images from live video. These images have the same appearance as what is seen by the full system ap- plication (Chapter 4). However, these images vary slightly from the web images from the previous section due to lighting and viewing angle changes. Four brands/items out 80 Figure 3.27: This plot shows the accuracy of version 2 of the neural network in classifying examples from version 2 of the dataset (v2-dataset). The v2-dataset consisted of 41,660 images (29,162 training, 12498 testing) from 46 brands and 1 background class. 81 Figure 3.28: Actual images of four brands/classes taken from live web camera video. of the 46 brand classes were chosen. The brands were pasta/pasta-roni, sugar/splenda, rice/uncle-ben and cereal/kellogg. Actual images of the brands/items used for testing are shown in Figure 3.28. The Top-1 accuracy and Top-5 accuracies were determined for each of the four classes. The Top-N error described in Equation 3.1 is simply 1TopNaccuracy. Only one of the four classes (cereal/kellogg) could be accurately detected, and these rates of detection were below the in vitro results for this class. These results are highlighted in table 3.1. The possible reasons for low accuracies are discussed in Section 3.4.1. Data Type Brand Top-1 Accuracy Top-5 Accuracy In Vivo cereal/kellogg 5.3% 36.84% In Vitro cereal/kellogg 64.71% 89.6% Table 3.1: The Top-1 and Top-5 accuracies for the cereal/kellogg class. Both in vitro and in vivo accuracies are reported. 82 3.4 Summary Two modules, a text recognition module and an object recognition module, were devel- oped. The text recognition module leveraged an algorithm designed for the purpose of real time text recognition (Neumann and Matas 2011). The object recognition module was formulated as a neural network. To train the network, a vast amount of image data was needed. The options to gather the data were either via website images, or manual collection of web camera video. In the interest of building a system that relied only on web images for training, an automated data collection pipeline was built. The website image approach was desired for several reasons. The main benet is the automated nature of network training made possible by this approach. Also, because grocery items appear- ances are sometimes updated; the website approach would allow system administrators to simply run the data processing pipeline to reacquire new images from the web. To be specic, the pipeline generated 41,660 images of 47 brands for training the network with. Testing the accuracy of the neural network on in vitro website images showed promising results for average accuracy (76.9%, Section 3.3.1.1). However, the results for testing the trained network on web camera data were not as successful (average Top-1 accuracy = 1.1%, Top-5 = 10.2%). 3.4.1 Neural Network Limitations and Remedies The results strongly suggest that training the current neural network using in vitro website images for the purpose of testing with in vivo web camera video images is not a viable 83 solution. There are two possible remedies to this problem: (1) training network with in vivo web camera video images, (2) new network structure. Retraining with web camera video images This approach will require taking web camera pictures of grocery items from diering viewing angles. In addition to gathering the images, they will have to be labeled for the neural network to learn the patterns within the image data. The results from the in-vitro experiments strongly suggest that the neural network can nd patterns between similar datasets (e.g., in-vitro training and in-vitro testing). Thus, using in vivo web camera data to train the network may be a viable method to boost the object recognition web camera accuracy. The user may be required to actively train the system, which in the case of a blind person, may require a caregiver or low-vision therapist to accompany them on the rst few visits to a grocery store. New Network Structure The network structures used in stages 1 and 2 of in vitro experiments were the result of manually iterating through a vast amount of dierent network combinations. At the moment, this is the only way to design neural networks. Specically, an engineer chooses amount of layers, nodes and the layer types based on intuition and the problem space (e.g., how many classes, what type of data). It would be worthwhile to still keep exploring dierent combinations of network types to increase the accuracy of the system. Given the speed at which the algorithm operates, adding complexity is an option. 84 Chapter 4 Full System Integration 4.1 Introduction 4.1.1 Problem In Chapter 2, the goal of the work done was to determine the ecacy of a system that can provide real time feedback on an object's location within a camera's FOV. The work from Chapter 3 aimed to see if a recognition module could detect and recognize grocery items in an image. In this chapter, the software framework, and integration of these modules for an end to end grocery shopping system is discussed. 4.1.2 Network Programming Network programming is an important concept in any program utilizing a client-server model. A client is an entity requiring an action or service from a server. A server is an entity that provides data, or does a task on behalf of the requesting client. In the context of this research, the client is a smartphone application (Section 4.2.1), and the server is the computer vision backend with all the algorithms (Section 4.2.2). 85 Javascript Object Notation (JSON) Is a compact data interchange format. It is used within client-server applications as a way to send and receive data. The reason for use in this application was two-fold: (1) human readability, (2) ease of machine parsing. In regards to the human readability, it is very ecient to write code that uses JSON because examining the data while implementing the software is simple (listing 4.1). In regards to machine parsing, JSON can be interpreted by any modern programming language (e.g., C++, JavaScript, Objective C, Python). Thus, any machine whether it be Windows, Linux, iOS, Android or Mac OS X can understand this data easily. This makes the application data exchange format used in this project compatible with future Operating System (OS) updates. 1 f 2 'command ' : ' r ' , 3 ' items ' : [ ' Pasta Roni Parmesan ' ] 4 g Listing 4.1: JSON Example User Datagram Protocol (UDP) Is a network protocol used for transmitting data over a computer connection. It's used heavily in applications requiring low latency and connectionless communication. This makes it a perfect protocol for applications requiring speed and eciency. Data is sent in the form of packets between machines. Furthermore, the packets encapsulate the JSON data mentioned above. In this project, the iOS appli- cation is structured as a UDP client, and the computer is structured as a UDP server. 86 4.1.3 Asynchronous Programming Executing computationally expensive tasks (e.g., recognition) can cause large perfor- mance hits in the program. Using asynchronous programming techniques allows the application to do large tasks without having to wait for that task to nish. The problem can be posited as a real world example. A synchronous task can be thought of as waiting in one line for a cashier. You must remain in the line until the cashier has checked you out. An asynchronous task can be thought of as ordering food at a restaurant. Once an order is placed by a customer, other restaurant customers can also place their orders; customers don't have to wait until one order of food is complete to make an order. 4.1.4 Object Oriented Programming A modular framework was developed to build a program for nding and detecting objects. This framework was built using a paradigm known as Object Oriented Programming (OOP). OOP is a programming paradigm within software engineering based on the notion of objects. These objects contain data, and functionality for manipulating that data. The OOP paradigm is a broad and wide eld, however there are certain concepts that were crucial for the development of this program: (1) Classes, and (2) Objects. Classes can be thought of as a template for Objects. The class lists and denes the data members and functions; the object is a running instance of the class. Blueprints (the class) and constructed houses (the object) are a real world example of this notion. A blueprint (the class) informs the contractor how many oors or rooms the house contains, as well as the square footage, and other miscellaneous data. The house (the object) is the actual rendering of that blueprint. 87 In the context of this project, the modules/classes dened encapsulated dierent bits of functionality. Modules were made for text recognition, object recognition and feedback. The details of each of these modules will be explained in the Computer Vision Backend section (4.2.2). 4.2 Methods and Approach 4.2.1 System Controller Application The system controller application is the main input interface between the visually im- paired user and the computer vision backend for image processing. The accessibility app leverages VoiceOver, a built in iPhone feature that informs the user what buttons on the phone they're clicking. Specically, the text labels on each button are read aloud by the speech synthesis software on the phone. This allows visually impaired people to use the smartphone as any other user would. The goal of the app is to allow visually impaired people to look for items or remove them from the list. A single click of an item utters the text they've clicked. Double tapping an item tells the computer vision webcam to trigger the recognition and tracking algorithms. Double tapping and holding, combined with a left swipe removes the item from the list. 4.2.2 Computer Vision Backend The computer vision backend contains the object recognition and scene text recognition modules from Chapter 3, as well as the tracking and feedback module from Chapter 2. In addition to these modules, the speech module was updated from an earlier prototype to 88 Figure 4.1: (Left) The rst version of the system controller app. This android app (Zhang, Weiland) was developed to integrate with the rst version of this system, the Wearable Visual Aid. (Right) The second version of the app (Mante, Weiland) was implemented to integrate with the new system discussed in this chapter. The new app allows the subject to hear the list of items they have left as well as read text in front of the camera. be more responsive and informative and to use higher quality audio for voice generation. It leverages the OOP approach described in Section 4.1.4. 4.2.2.1 Speech Synthesis Module Providing useful feedback in addition to the simplistic commands for localization (Chapter 2) can change the user experience drastically. The goal of the speech synthesis module was to ll that void. In addition to generating speech, the Speech Synthesis module acted as a Server Class. The speech synthesis module will be referred to as the Synthesis class. The Application Programming Interface (API) for the Synthesis class contained 4 main functions: (1) speak, (2) speakPhrase, (3) startServer, (4) receiveData. 89 Synthesizer::speak and Synthesizer::speakPhrase The speak and speakPhrase functions use the underlying synthesis software of the host machine. In this project, an Apple Macbook was used to house the algorithms. Thus, Apple's speech libraries were used as the means to generate speech. To do this, a software bridge was imple- mented to connect the Synthesis module to Apple's speech code (gure 4.2). Each time the speak and speakPhrase functions were called by other modules in the program, a cascade of events would occur: 1. A phrase for speech is requested 2. The Synthesis module receives the request 3. Synthesis module forwards the phrase to Apple's libraries (on the machine) 4. Apple's code causes the machine to say the phrase. Additionally, a set of dierent phrases were programmed into the Synthesis module. These phrases were categorized by the type of information needed during times of feedback (Table 4.1). For example, when an item such as Rice r Roni was found by the computer vision system, one type of phrase was \Look's like I found Rice r Roni". Code Example Phrase itemFound \I couldn't nd <Pasta Roni>" doneWithItem \Great. You picked up <Pasta Roni>" doneWithList \You nished your list <Nii>" Table 4.1: The left column represents codes used within the software. For example, if an item was found within the camera's eld of view, then the Recognition module would send a itemFound request to the Synthesis module. Then, the synthesis module would generate a command such as \I couldn't nd<Pasta Roni>". The text within the angled brackets <> can be varied depending on what item or name should be said. 90 Figure 4.2: The Synthesis module (within light blue box) communicates with Apple's speech libraries through a custom bridge. The custom Bridge was developed for this the- sis. The purpose of the bridge was to allow code from a dierent programming language to be called within our system. The code within this projects system is C++, whereas Apple's libraries are implemented in one of their programming languages, Objective C. Synthesizer::startServer and Synthesizer::receiveData The startServer and re- ceiveData functions were responsible for communication between the iOS application 4.2.1 and the C++ program on the computer. In addition to acting as a speech synthesizer, the Synthesis module received commands over a wireless data connection. WiFi or cellular data can be used for communication; in this project, the cellular phone containing the iOS accessibility app was used as a Personal Hotspot. Communication between the app and computer was then done over the cellular data connection of the phone. The server followed the UDP protocol discussed in the Network Programming section (4.1.2). Calling the startServer function informs the Synthesis module to listen for com- mands from the iOS application. In theory, this code could be placed on a distant server and video could be live-streamed to the server for processing. In the case of this project, the web camera is directly connected to the computer containing the computer vision algorithms. 91 Figure 4.3: The App/Client waits for a subject to click an item (e.g., Pasta). In parallel, the server is started and running. The server then invokes the receiveData function and waits for data in a loop. Once a click on the iOS app occurs, the phone sends JSON data containing the command and the item name to the Computer/Server. Once the computer receives a piece of data, the appropriate modules are invoked. Upon initialization of the server, the receiveData command is called; this function waits until data is received from the smartphone application. Once an item from the iOS application is clicked, the app sends the type of command (e.g., \recognize") and item of interest (table 4.2) as a JSON object. Then the desired functionality is executed in the computer vision modules (e.g., text recognition & object recognition). Finally, the program calls the receiveData function and repeats the loop (gure 4.3). Code Example Item \recognize" \<Pasta Roni Parmesan >" \delete" \<Idaho Spuds>" Table 4.2: The left column represents code used for communication between the app, and the server. The right column represents an actual item that can be sent to the server. 4.2.2.2 Object Recognition Module The object recognition module leverages the neural network developed in Chapter 3. The results from Section 3.3.1.2 showed that the Neural Network was low accuracy in regards 92 to recognizing objects from webcam video. Nonetheless, the module was incorporated into the full system for any future iterations of the system. Specically the neural network can be trained and improved independently, and when necessary the new version can replace the one currently apart of the full system. Multiscale Sliding Window To deal with localization in addition to recognition, a Multi Scale Sliding Window module was developed to work in conjunction with the neural network. In computer vision, the sliding window approach deals with creating a window region and sliding it over an entire window (gure 4.4). The window chosen is varied in size and position to deal with seeing objects at multiple scales and positions within the image. While sliding this window along the image, the neural network \looks" at each region and makes a guess for what object is inside the region. This allows the neural network to process as many dierent regions within the image as possible. The implementation examines scales of [0.5, 0.9, 1.0, 1.5, 2.0, 2.5, 3.0] times the size of the image. Additionally, the window size used is 100x100 pixels. Finally, the size of the entire image is 640x480 pixels. Bounding Box Result and Class For each smaller window region, the probability of all 47 classes are generated by the neural net and sorted. In the current system, the bounding boxes with probabilities above a threshold are stored. 4.2.2.3 Text Recognition Module The text recognition module incorporates the Neumann and Matas algorithm discussed in Section 3.2.1. The goal of this module was to nd text in the image that matches text 93 Figure 4.4: Given an image, the sliding window approach creates smaller windows/bounding-boxes (yellow squares) and slides them over the entire window. A processing algorithm can then be applied to any window within the image. from the list of items from the iOS app (Section 4.2.1). The module rst detects possible text regions, then attempts to recognize text within those regions (algorithm 1). To start, (1) the module rst splits the initial camera image (640x480 pixels) into multiple image channels. In this implementation, the red, green, blue and grey image channels are examined. Once the channels have been split, (2) the module searches each image channel for possible character regions. (3) The next step is to merge any character regions into entire word boxes. (4) Any word boxes are then passed through a text recognition algorithm. This projects implementation can toggle between using the Tesseract text recognition system or a Neural Network Based text recognition system (Coates et al. 2011). The neural net based text recognition was used for the blindfolded sighted subject experiments. Parallel Processing To further increase the eciency of the module, computationally expensive tasks were done in parallel. Specically, the processing of each image chan- nel, and recognition of detected word regions were examples of these tasks. Essentially, 94 ALGORITHM 1: The scene text module process. First the module splits the image into red, green, blue, grey channels. Then the module attempts to detect characters within each of the dierent channels. Following this, adjacent character regions are grouped into possible word regions. Any possible word regions are passed through the text recognition algorithm. Then, all recognized text is sanitized; any text with less than 2 characters, or words below a globally set condence score are removed. These text regions are then returned to another module within the system that is requesting text. channels split(image); characterRegions None; possibleTextRegions None; intermediateWordBoxes None; nalWordBoxes None; for channel2channels do characterRegions extractRegions(channels); end possibleWordRegions groupRegions(channels, characterRegions); for possibleTextRegion2possibleTextRegions do intermediateWordBox doTextRecognition(possibleTextRegion); intermediateWordBoxes.add(intermediateWordBox); end for iWordBox2intermediateWordBoxes do if iWordBox notRepetive AND moreThanOneCharacter AND aboveCondenceScore then nalWordBoxes.add(iWordBox); else end Result: nalWordBoxes 95 multiple region detectors and text recognizers were spawned, and for each detector or recognizer a batch of possible word regions were processed (gure 4.5). 4.2.2.4 Master Module The nal master module controls the object recognition, text recognition and speech synthesis modules. It is responsible for running the three aforementioned modules. It starts asynchronous tasks when necessary, and it is responsible for passing data between each of the dierent modules. For example, clicking an item within the controller app (gure 4.1) alerts the master module to trigger the recognition task. Flow of Execution First, the master module spawns the Synthesis module. As de- scribed before, the synthesis module also acts as a server (Section 4.2.2.1). Once the synthesis module receives a button press from the iOS app, the master module parses the result. The result causes the master module to either (1) start the recognition (2) remove a grocery item from the list. In the case of (1), the master module grabs a frame from the video camera and starts an asynchronous background thread. This thread starts the object recognition and text recognition threads. The importance is to keep the master module running parallel so that it can continue listening for requests or processing video. When a recognition task completes, it returns a bounding box result to the master mod- ule. At this point, the master module starts the tracking and speech feedback modules on a background task. This ow can be seen in gure 4.6. 96 Figure 4.5: The Text Recognition Module Pipeline. Image channels are processed in parallel by the text detection algorithm. Once possible character regions are detected, they are passed to multiple character merger processes. The result is a list of M possible word regions. The list of word regions is split M/N sized batches where N is the number of Optical Character Recognition processes. The nal result is a list of recognized text (e.g., Pasta, Roni, Splenda, etc.). 97 Figure 4.6: The full system ow chart. (1) The subject chooses an item via smartphone. This tells the computer to ask for one video frame from the (2) camera. 98 Figure 4.7: (Left) A blind folded subject standing in front of a shelf of 14 grocery store items. The subject is wearing a head mounted web camera and holding the smartphone application. (Right) An image showing the blindfolded sighted subjects point of view. The image also shows the bounding box localization of the recognition module (Chapter 3) for Kelloggs Rice Krispies, as well as the subject reaching and grasping for the object. 4.3 Experiments 4.3.1 Blindfolded Sighted Demo The system was tested with two blindfolded sighted subjects to determine system oper- ability. A shelf of 14 items was placed in front of the subject (gure 4.7). Out of the 14 items, four items were placed into the smartphone application. The four items chosen were items that the recognition module could recognize fairly well. The premise being { can the system guide a subject with complete autonomy?. In the demo done, the subject was able to successfully choose an item via smartphone and eventually reach and grasp for the item (gure 4.7). 99 4.3.2 Proposed Experiments In order to determine the ecacy of the system, there are certain metrics that can be measured. These metrics relate to the computer vision algorithm performance and the human computer interface elements of the system. System metrics such as the: • Time based metrics 1. Time to object recognition 2. Time to text recognition 3. Time to merging object recognition • Accuracy based metrics 1. Item Localization accuracy 2. Item recognition accuracy 4.4 Summary The testing done with a blindfolded sighted subject showed that given a list of items, this system allows the subject to choose an item via an application, and reach and grasp for that item successfully. The system (gure 4.6) can locate an item within the scene in the range of 1-2 seconds. After the system localizes the item, the subject was able to reach and grasp for an item within 20 seconds. It should also be noted that this subject spent approximately on the 100 order of 30 minutes training and being acclimated with this systems commands. Further training and use of the system could lead to even quicker reaching and grasping tasks. 4.4.1 Previous System Improvements The previous system leveraged the AB-SURF algorithm for object recognition, and e- speak for speech synthesis. The time from clicking an item to the computer vision module recognizing an item within the eld of view would take on the order 9 seconds. Addi- tionally, the previous program didn't contain feedback to inform the subject what item it believes it found. In the new system, multiple improvements were made: • Useful feedback for system events (e.g., \Looks like I found Pasta Roni" or \I didn't see any items") • Faster recognition times - system can recognize items in 1 - 2 seconds Additionally, the current recognition module relies heavily on the text recognition algo- rithm. Thus, it is pretrained to recognize any type of text. As a result, it can determine what text is on any item without necessarily having been trained on it. This makes the current system more exible because being trained on all items is not absolutely neces- sary; because it's trained on letters, and words, it will do it's best to recognize whatever text is present. 101 4.4.2 System Limitations The blindfolded sighted subject demo showed positive results for the grocery shopping assistant. There are however areas of improvement and limitations to the system that can be addressed. Recognition The recognition module from Chapter 3 is adequately fast (1 - 2 seconds for a result), however, it still needs improvement to increase accuracy. At the moment, the algorithm used (Neumann and Matas 2011) has a word recognition f-score of 36.5% on the ICDAR dataset. Essentially, this means that the text recognition accuracy of this system is not completely solved. That shows that text recognition is a feature that can be improved upon for future work. Additionally, the neural network discussed from Chapter 3 showed it's accuracy needs improvement as well. Although the system could recognize in vitro images (website images) at an accuracy of 77%, it did not perform as well in the in vivo case (webcam video). Making object recognition more robust would aid in making the overall recognition system more robust. Voice Input The smartphone application was built simply to allow for choosing pre- lled items from a list. However, it will be necessary to allow subjects to input items they would like to nd in a store. This would require a simplistic way for items to be searched within a database of all items in the grocery store. Visually impaired subjects are adept at using the VoiceOver (iOS) or TalkBack (Android) features to type text; thus, a text input method for searching items would be useful. In addition to a text input method, voice recognition would also be very useful. Allowing the subject to search for items by 102 giving key phrases such as \Find honey nut cheerios" would give another option to the subjects. 103 Chapter 5 Conclusion 5.1 Conclusion and Future Work 5.1.1 Summary of Contributions The following question was proposed at the beginning of this thesis: can an intelligent sys- tem be implemented to aid the visually impaired in nding items of interest? In Chapter 2 a human computer interface was designed and tested with visually impaired subjects. The results from this chapter informed decisions taken for Chapters 3 and 4. Finally, in those two chapters, algorithms were chosen, developed and bundled into a software framework to demonstrate a working system. The contributions are listed below: • A feedback module that uses object tracking, speech synthesis and a custom microcomputer based vibrotactile system for generating real time feedback (Chapter 2) • A neural network training and testing program built on top of Torch (Col- lobert, Kavukcuoglu, and Farabet 2012) (Chapter 3) 104 { A data scraper for programmatically nding and preprocessing grocery store images { A program for automating the training and testing of a custom Deep Neural Network for object recognition { First version of a real time custom neural network for object recognition • A Real Time Scene Text Recognition module implementation that leverages scene text algorithms (Chapter 3). • A full system demo combining the work recognition, tracking and feedback mod- ules (Chapter 4). The system allows items to be chosen from a smartphone interface, found and tracked by the computer vision algorithms. Intelligent feedback is then generated by the speech synthesis module { More intelligent speech synthesis for informing the subject about the system's ongoing tasks { A recognition module combines scene text recogntion and object recognition to determine desired items within an image In addition to building these modules, human subjects testing was done using the rst prototype. The rst prototype, the OLTS, is highlighted in Chapter 2. The human subjects testing gave insight towards custom metrics for measuring a system such as this one: • Localization, Reaching and Grasping Experiments - These experiments demonstrate a visually impaired persons ability to orient their head to localize 105 an item so that they may eventually reach and grasp for it. The head orientation is guided by real time auditory/vibrotactile feedback (Chapter 2). Results show on average subjects were able to grasp an item from a set of three within 20 seconds within 2 attempted reaches. The results from human subjects testing in Chapter 2 show that subjects can grab items using a localization system. These results coupled with the completion of a full system demo in Chapter 4 suggest that this system, or a future version of it, can be viable in assisting the visually impaired. However, there are shortcomings that exist with the full system prototype that can make the system even more viable. These shortcomings were experienced throughout the visually impaired subjects testing as well as through testing the full system with a blindfolded sighted subject. These shortcomings and possible remedies are discussed in the next section. 5.1.2 Directions for Future Work 5.1.2.1 Scene Detection Within a system such as this, the search-space of the recognition module directly eects the likelihood of confusion. Essentially, as you increase the amount of items on a list or database, the likelihood that a recognition module (Chapter 3) confuses items increases. Thus, it is worthwhile to use techniques that can bias the recognition algorithm by re- ducing the amount of categories or items to consider. The notion of scene detection aims to solve that problem. Specically, a computer vision algorithm that determines that a 106 scene appears to be a rice aisle could assist the recognition module by forcing it to only consider that rice items are present. 5.1.2.2 Improvement of Recognition Module The work done in Chapter 3 resulted in a custom neural network based object recognition module as well as a scene text module leveraging previous algorithms. It would be worth- while for a future researcher to examine improving the neural network by experimenting with larger grocery item datasets, and dierent network structures. The dataset problem would be something the researcher would directly work on; since large datasets with large amounts of in vivo (web cam video) and in vitro (web image) aren't readily available, creating a robust dataset for training and testing would be extremely benecial. 5.1.2.3 Port Algorithms to Mobile Devices The real time aspect of this work was strongly considered not only for speed, but also for portability. In order to achieve real time processing, the algorithms used were required to be lightweight and ecient to run on a standard personal computer. These constraints are also useful when considering mobile devices such as android phones, iphones, google glasses or another product known as Osterhout Design Group glasses. Porting the work done from this research to work on a mobile is extremely worthwhile. The software written was deliberately written in C++; thus, it is runnable on all devices. Porting the code would require an understanding of the target device (e.g., iPhone) as well as computer vision and the C++ language. 107 5.1.2.4 Full System Human Subjects Testing Testing the full system with visually impaired subjects is something that can be done to validate the system. The current prototype includes the system controller application and the computer vision backend (Chapter 4). The main obstacle would be conguring a new researchers personal computer with the computer vision software. Work would need to be done to automate the installation of the system code onto any operating system (Linux, OS X, Windows). Upon completion of system setup, testing could immediately be done. This would lead to quantitative data showing how quickly a visually impaired subject can nd an item from the time they clicked an item via the smartphone app. Additionally, the accuracy and speed of the system can be measured and clocked (4.3.2). 108 Glossary Acronyms AB Attention Biased AB-SURF Attention Biased Speeded Up Robust Fea- tures ANN Articial Neural Network ANPR Automatic Number Plate Recognition API Application Programming Interface CCH Color Cooccurence Histograms CNN Convolutional Neural Network CSER Class Specic Extremal Regions CVA Central Visual Angle DoG Dierence of Gaussians 109 FOV Field of View HMC Head Mounted Camera ICDAR International Conference on Document Anal- ysis and Recognition ILSVRC ImageNet Large Scale Visual Recognition Challenge JSON Javascript Object Notation NLL Negative Log Likelihood OCR Optical Character Recognition OLTS Object Tracking and Localization System OOP Object Oriented Programming OS Operating System ReLU Rectied Linear Units SGD Stochastic Gradient Descent SIFT Scale Invariant Feature Transform 110 STR Scene Text Recognition SURF Speeded Up Robust Features UDP User Datagram Protocol 111 References Adebiyi, A. et al. (2013). \Evaluation of feedback mechanisms for wearable visual aids". In: Electronic Proceedings of the 2013 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2013. Ahmad, A. R. et al. (2004). \Online Handwriting Recognition using Support Vector Ma- chine Keywords :" in: Proceedings of the Second International Conference on Articial Intelligence in Engineering & Technology, pp. 250{256. doi: 10.1109/TENCON.2004. 1414419. Ahuja, A. K. et al. (2011). \Blind subjects implanted with the Argus II retinal prosthesis are able to improve performance in a spatial-motor task." In: The British journal of ophthalmology 95.4, pp. 539{43. issn: 1468-2079. doi: 10.1136/bjo.2010.179622. arXiv: NIHMS150003.url: http://bjo.bmj.com/cgi/content/abstract/95/4/539. Alary, F. et al. (2008). \Tactile acuity in the blind: A psychophysical study using a two-dimensional angle discrimination task". In: Experimental Brain Research 187.4, pp. 587{594. issn: 00144819. doi: 10.1007/s00221-008-1327-7. Bay, H. et al. (2008). \Speeded-Up Robust Features (SURF)". In: Computer Vision and Image Understanding 110.3, pp. 346{359. issn: 10773142. doi: 10.1016/j.cviu. 2007.09.014. Belongie, S. (2012). Grocery Shopping Assistant. Ed. by grozi.calit2.net. [Online; posted 27-August-2012]. url: http://grozi.calit2.net. Bissacco, A. et al. (2013). \PhotoOCR: Reading text in uncontrolled conditions". In: Proceedings of the IEEE International Conference on Computer Vision, pp. 785{792. issn: 1550-5499. doi: 10.1109/ICCV.2013.102. 112 Bo, L., Ren, X., and Fox, D. (2011). \Depth kernel descriptors for object recognition". In: IEEE International Conference on Intelligent Robots and Systems, pp. 821{826. issn: 21530858. doi: 10.1109/IROS.2011.6048717. Bourbakis, N et al. (2008). \Sensing Surrounding 3-D Space for Navigation of the Blind". In: Engineering in Medicine and Biology Magazine, . . . February, pp. 49{55. url: http://ieeexplore.ieee.org/xpls/abs{\_}all.jsp?arnumber=4435653. Brainport (2016). Grocery Shopping Assistant. Ed. by wicab.com. [Online]. url: http: //www.wicab.com/en_us/. Bressan, M., Guillamet, D., and Vitria, J. (2003). \Using an ICA representation of local color histograms for object recognition". In: Pattern Recognition 36.3, pp. 691{701. issn: 00313203. doi: 10.1016/S0031-3203(02)00104-8. Carlson, S, Hyv arinen, L, and Raninen, a (1986). \Persistent behavioural blindness after early visual deprivation and active visual rehabilitation: a case report." In: The British journal of ophthalmology 70, pp. 607{611. issn: 0007-1161. doi: 10.1136/bjo.70.8. 607. Castiello, U., Bennett, K. M. B., and Stelmach, G. E. (1993). \The bilateral reach to grasp movement". In: Behavioural Brain Research 56.1, pp. 43{57. issn: 01664328. doi: 10.1016/0166-4328(93)90021-H. Chang, P. C. P. and Krumm, J. (1999). \Object recognition with color cooccurrence histograms". In: Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149) 2, pp. 1{7. issn: 1063-6919. doi: 10.1109/CVPR.1999.784727. Chen, X. et al. (2009). \Rapid and precise object detection based on color histograms and adaptive bandwidth mean shift". In: 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009, pp. 4281{4286. doi: 10.1109/IROS. 2009.5354739. Coates, A et al. (2011). \Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning". In: Document Analysis and Recognition (ICDAR), 2011 International Conference on IS -, pp. 440{445. issn: 1520-5363. doi: 10.1109/ 113 ICDAR.2011.95. arXiv: fa. url: http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=6065350$\backslash$npapers2://publication/doi/10. 1109/ICDAR.2011.95. Collobert, R., Kavukcuoglu, K., and Farabet, C. (2012). \Implementing neural networks eciently". In: Lecture Notes in Computer Science (including subseries Lecture Notes in Articial Intelligence and Lecture Notes in Bioinformatics) 7700 LECTU, pp. 537{ 557. issn: 03029743. doi: 10.1007/978-3-642-35289-8-28. Danilov, Y. and Tyler, M. (2005). \Brainport: an Alternative Input To the Brain". In: Journal of Integrative Neuroscience 4.4, pp. 537{550. issn: 0219-6352. doi: 10.1142/ S0219635205000914. Daume, H. (2012). A course in machine learning, p. 189. isbn: 1439824142. Del-Blanco, C. R. et al. (2014). \Foreground segmentation in depth imagery using depth and spatial dynamic models for video surveillance applications". In: Sensors (Switzer- land) 14.2, pp. 1961{1987. issn: 14248220. doi: 10.3390/s140201961. Deng, L., Hinton, G. E., and Kingsbury, B. (2013). \New types of deep neural network learning for speech recognition and related applications: An overview". In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599{8603. isbn: 978-1-4799-0356-6. doi: 10.1109/ICASSP.2013.6639344. arXiv: arXiv:1303. 5778v1. url: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm? arnumber=6639344. Dinh, T. B. and Vo, N. (2011). \Context Tracker : Exploring Supporters and Distracters in Unconstrained Environments". In: Proceedings of the IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition, pp. 1177{1184. Dramas, F. et al. (2008). \Designing an assistive device for the blind based on object localization and augmented auditory reality". In: Proceedings of the 10th international ACM SIGACCESS conference on Computers and accessibility - Assets '08, pp. 263{ 264. doi: 10.1145/1414471.1414529. url: http://portal.acm.org/citation. cfm?id=1414529$\backslash$nhttp://portal.acm.org/ft{\_}gateway.cfm? id=1414529{\&}type=pdf{\&}coll=GUIDE{\&}dl=GUIDE{\&}CFID=55075414{\& }CFTOKEN=82534184$\backslash$nhttp://doi.acm.org/10.1145/1414471. 1414529. 114 Dunai, L. et al. (2015). \Virtual Sound Localization by Blind People". In: Archives of Acoustics 40.4, pp. 561{567. issn: 01375075. doi: 10.1515/aoa-2015-0055. Dutilh, G. et al. (2011). \A Phase Transition Model for the Speed-Accuracy Trade-O in Response Time Experiments". In: Cognitive Science 35.2, pp. 211{250.issn: 03640213. doi: 10.1111/j.1551-6709.2010.01147.x. Eimer, M. (2004). \Multisensory Integration: How Visual Experience Shapes Spatial Per- ception". In: Current Biology 14.3, pp. 115{117.issn: 09609822.doi:10.1016/S0960- 9822(04)00033-8. Eller, M. A. (2002). \Tactile picture perception in sighted and blind people". In: Be- havioural Brain Research 135.1-2, pp. 65{68. issn: 01664328. doi: 10.1016/S0166- 4328(02)00156-0. Filipe, V. et al. (2012). \Blind navigation support system based on Microsoft Kinect". In: Procedia Computer Science 14.Dsai, pp. 94{101. issn: 18770509. doi: 10.1016/ j.procs.2012.10.011. Gaunet, F. and Rossetti, Y. (2006). \Eects of visual deprivation on space representa- tion: Immediate and delayed pointing toward memorised proprioceptive targets". In: Perception 35.1, pp. 107{124. issn: 03010066. doi: 10.1068/p5333. George, M. and Floerkemeier, C. (2014). \Recognizing products: A per-exemplar multi- label image classication approach". In: Lecture Notes in Computer Science (including subseries Lecture Notes in Articial Intelligence and Lecture Notes in Bioinformatics) 8690 LNCS.PART 2, pp. 440{455. issn: 16113349. doi: 10.1007/978-3-319-10605- 2{\_}29. George, M. et al. (2015). \Fine-Grained Product Class Recognition for Assisted Shop- ping". In: pp. 154{162. doi: 10.1109/ICCVW.2015.77. arXiv: 1510.04074. url: http://arxiv.org/abs/1510.04074. Gold, D. and Simson, H. (2005). \Identifying the needs of people in Canada who are blind or visually impaired: Preliminary results of a nation-wide study". In: International Congress Series 1282, pp. 139{142. issn: 05315131. doi: 10.1016/j.ics.2005.05. 055. 115 Gosselin-Kessiby, N., Kalaska, J. F., and Messier, J. (2009). \Evidence for a Proprioception-Based Rapid On-Line Error Correction Mechanism for Hand Orien- tation during Reaching Movements in Blind Subjects". In: Journal of Neuroscience 29.11, pp. 3485{3496. issn: 0270-6474. doi: 10.1523/JNEUROSCI.2374-08.2009. url: http://www.jneurosci.org/cgi/doi/10.1523/JNEUROSCI.2374-08.2009. Gupta, S. et al. (2014). \Learning rich features from RGB-D images for object detec- tion and segmentation". In: Lecture Notes in Computer Science (including subseries Lecture Notes in Articial Intelligence and Lecture Notes in Bioinformatics) 8695 LNCS.PART 7, pp. 345{360. issn: 16113349. doi: 10.1007/978-3-319-10584- 0{\_}23. arXiv: arXiv:1407.5736v1. Hare, S., Saari, A., and Torr, P. H. S. (2011). \Struck Structured output tracking with kernels.pdf". In: pp. 263{270. Heller, M. A. (1989). \Tactile memory in sighted and blind observers: The in uence of orientation and rate of presentation". In: Perception 18.1, pp. 121{133.issn: 03010066. doi: 10.1068/p180121. Heller, M. A. and Kennedy, J. M. (1990). \Perspective taking, pictures, and the blind." In: Perception & psychophysics 48.5, pp. 459{466. issn: 0031-5117. doi: 10.3758/ BF03211590. Hern andez-L opez, J.-J. et al. (2012). \Detecting objects using color and depth segmen- tation with Kinect sensor". In: Procedia Technology 3, pp. 196{204. issn: 22120173. doi: 10.1016/j.protcy.2012.03.021. url: http://www.sciencedirect.com/ science/article/pii/S2212017312002502. Hotting, K. and Roder, B. (2004). \Hearing Cheats Touch, but Less in Congenitally Blind Than in Sighted Individuals". In: Psychological Science 15.1, pp. 60{64. issn: 0956- 7976. doi: 10.1111/j.0963-7214.2004.01501010.x. url: http://pss.sagepub. com/lookup/doi/10.1111/j.0963-7214.2004.01501010.x. Jaderberg, M. et al. (2016a). \Reading Text in the Wild with Convolutional Neural Net- works". In: International Journal of Computer Vision 116.1, pp. 1{20.issn: 15731405. doi: 10.1007/s11263-015-0823-z. arXiv: arXiv:1412.1842v1. 116 Jaderberg, M. et al. (2016b). Reading Text in the Wild with Convolutional Neural Net- works. doi: 10.1007/s11263-015-0823-z. arXiv: arXiv:1412.1842v1. Jakobson, L. S. et al. (1991). \A kinematic analysis of reaching and grasping movements in a patient recovering from optic ataxia". In: Neuropsychologia 29.8. issn: 00283932. doi: 10.1016/0028-3932(91)90073-H. Jeannerod, M., Decety, J., and Michel, F. (1994). \Impairment of grasping movements following a bilateral posterior parietal lesion". In: Neuropsychologia 32.4, pp. 369{380. issn: 00283932. doi: 10.1016/0028-3932(94)90084-1. Jeannerod, M. (1986). \Mechanisms of visuomotor coordination: A study in normal and brain-damaged subjects". In: Neuropsychologia 24.1, pp. 41{78. issn: 00283932. doi: 10.1016/0028-3932(86)90042-4. Jihong, L. and Xiaoye, S. (2006). \A survey of vision aids for the blind". In: Proceedings of the World Congress on Intelligent Control and Automation (WCICA) 1, pp. 4312{ 4316. doi: 10.1109/WCICA.2006.1713189. Kalal, Z., Mikolajczyk, K., and Matas, J. (2011). \Tracking-Learning-Detection." In: IEEE transactions on pattern analysis and machine intelligence 34.7, pp. 1409{1422. issn: 1939-3539. doi: 10.1109/TPAMI.2011.239. Kang, D. S. et al. (2013). \Virtual sound source generation: Its various methods and performances". In: J Acoust Soc Am 133.5, p. 3250. issn: 1939800X. doi: 10.1121/ 1.4805225. url: http://www.ncbi.nlm.nih.gov/pubmed/23654585. Katz, B. F. G. et al. (2012). \NAVIG: Augmented reality guidance system for the visually impaired: Combining object localization, GNSS, and spatial audio". In: Virtual Reality 16.4, pp. 253{269. issn: 13594338. doi: 10.1007/s10055-012-0213-6. Kim, G. et al. (2014). \10.4 A 1.22TOPS and 1.52mW/MHz augmented reality multi-core processor with neural network NoC for HMD applications". In: Digest of Technical Papers - IEEE International Solid-State Circuits Conference 57, pp. 182{183. issn: 01936530. doi: 10.1109/ISSCC.2014.6757391. 117 Klatzky, R. L. et al. (2006). \Cognitive load of navigating without vision when guided by virtual sound versus spatial language." In: Journal of experimental psychology. Applied 12.4, pp. 223{232. issn: 1076-898X. doi: 10.1037/1076-898X.12.4.223. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). \ImageNet Classication with Deep Convolutional Neural Networks". In: Advances In Neural Information Processing Systems, pp. 1{9. issn: 10495258. doi: http://dx.doi.org/10.1016/j.protcy. 2014.09.007. arXiv: 1102.0183. Ladetto, Q. and Merminod, B. (2002). \In step with INS: Navigation for the blind, track- ing emergency crews". In: GPS World 13.10, pp. 30{38. issn: 10485104. Lai, K. et al. (2011). \Sparse distance learning for object recognition combining RGB and depth information". In: Proceedings - IEEE International Conference on Robotics and Automation 1, pp. 4007{4013. issn: 10504729. doi: 10.1109/ICRA.2011.5980377. Le, Q. V. et al. (2011). \Building high-level features using large scale unsupervised learn- ing". In: International Conference in Machine Learning 28.4, p. 38115.issn: 10535888. doi: 10.1109/MSP.2011.940881. arXiv: 1112.6209v3. url: http://arxiv.org/ abs/1112.6209$\backslash$nhttp://arxiv.org/pdf/1112.6209v5.pdf. Lee, Y. H. (2011). \RGB-D camera Based Navigation for the Visually Impaired". In: RSS 2011 RGBD: Advanced Reasoning with Depth Camera Workshop, pp. 1{6. Levy, S. (2016). The iBrain is Here - And it's already inside your phone. Ed. by S. Levy. [Online; posted 23-August-2016]. url: https://backchannel.com/an-exclusive- look-at-how-ai-and-machine-learning-work-at-apple-8dbfb131932b# .30lxdgasj. Liang, M. and Hu, X. (2015). \Recurrent Convolutional Neural Network for Object Recog- nition". In: Cvpr Figure 1. arXiv: 1306.2795v1. Lin, Z., Wang, J., and Ma, K. K. (2002). \Using eigencolor normalization for illumination- invariant color object recognition". In: Pattern Recognition 35.11, pp. 2629{2642.issn: 00313203. doi: 10.1016/S0031-3203(01)00207-2. 118 Loomis, J. M. et al. (2002). \Spatial updating of locations specied by 3-d sound and spatial language." In: Journal of experimental psychology. Learning, memory, and cognition 28.2, pp. 335{345. issn: 0278-7393. doi: 10.1037/0278-7393.28.2.335. L opez, M. D. L. (2005). \Accessibility for blind and visually impaired people". In: Inter- national Congress Series 1282, pp. 1038{1040. issn: 05315131. doi: 10.1016/j.ics. 2005.05.197. Lowe, D. G. (2004). \Distinctive image features from scale-invariant keypoints". In: In- ternational Journal of Computer Vision 60.2, pp. 91{110. issn: 09205691. doi: 10. 1023/B:VISI.0000029664.99615.94. arXiv: 0112017 [cs]. Matthew, R. P. et al. (2015). \Computer Vision - ECCV 2014 Workshops". In: Lecture Notes in Computer Science (including subseries Lecture Notes in Articial Intelligence and Lecture Notes in Bioinformatics) 8927, pp. 570{583. issn: 16113349. doi: 10. 1007/978-3-319-16199-0. arXiv: 1410.2488. url: http://www.scopus.com/ inward/record.url?eid=2-s2.0-84928819309{\&}partnerID=tZOtx3y1. Meijer, P. (2012). Mobile OCR, Face and Object Recognition for the Blind. Ed. by SeeingWithSound.com. [Online; posted 27-August-2012]. url: http://www. seeingwithsound.com/ocr.htm. Merler, M., Galleguillos, C., and Belongie, S. (2007). \Recognizing groceries in situ using in vitro training data". In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. issn: 10636919. doi: 10.1109/CVPR. 2007.383486. Mishra, A. K. and Aloimonos, Y. (2011). \Visual Segmentation of \ Simple " Objects for Robots". In: Robotics Science and Systems, pp. 1{8. issn: 2330765X. Mosby (1994). Mosby's Medical, Nursing Allied Health Dictionary. Mosby Book. Muselet, D. and Macaire, L. (2007). \Combining color and spatial information for ob- ject recognition across illumination changes". In: Pattern Recognition Letters 28.10, pp. 1176{1185. issn: 01678655. doi: 10.1016/j.patrec.2007.02.001. 119 Neumann, L. and Matas, J. (2011). \Text localization in real-world images using e- ciently pruned exhaustive search". In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pp. 687{691. issn: 15205363. doi: 10.1109/ICDAR.2011.144. | (2012). \Real-time scene text localization and recognition". In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3538{3545. issn: 10636919. doi: 10.1109/CVPR.2012.6248097. Orcam (2016). Orcam. Ed. by orcam.com. [Online]. url: http://www.orcam.com. O'Regan, K. and Noe, A. (2001). \A sensorimotor account of vision and visual conscious- ness". In: Behavioral and Brain Sciences 24, pp. 939{1031. issn: 0140-525X. doi: 10.1017/S0140525X01000115. Ostrovsky, Y., Andalman, A., and Sinha, P. (2006). \Vision following extended congenital blindness". In: Psychological Science 17.12, pp. 1009{1014. issn: 09567976. doi: 10. 1111/j.1467-9280.2006.01827.x. Pradeep, V., Medioni, G., and Weiland, J. (2010). \A wearable system for the visually impaired". In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC'10, pp. 6233{6236. issn: 1557-170X. doi: 10. 1109/IEMBS.2010.5627715. Prechtl, H. F. et al. (2001). \Role of vision on early motor development: lessons from the blind." In: Developmental medicine and child neurology 43.3, pp. 198{201. issn: 0012-1622. doi: 10.1111/j.1469-8749.2001.tb00187.x. R.E.~Schapire and Y.~Singer. (1999). \Improved boosting algorithms using condence- ratedpredictions." In: Machine Learning 37.3, pp. 297{336. Rita, P. Bach-y and W. Kercel, S. (2003). \Sensory substitution and the human-machine interface". In: Trends in Cognitive Sciences 7.12, pp. 541{546. issn: 13646613. doi: 10.1016/j.tics.2003.10.013. 120 Rita, P. Bach-y et al. (2005). \Late human brain plasticity: vestibular substituion with a tonque BrainPort human-machine interface". In: Intellectica 1.40, pp. 115{122. issn: 0077-8923. url: http://www.medigraphic.com/pdfs/plasticidad/prn-2005/ prn051{\_}2f.pdf$\backslash$nhttp://www.ncbi.nlm.nih.gov/pubmed/ 15194608. R oder, B., R osler, F., and Spence, C. (2004). \Early Vision Impairs Tactile Perception in the Blind". In: Current Biology 14.2, pp. 121{124. issn: 09609822. doi: 10.1016/j. cub.2003.12.054. Rossetti, Y, Desmurget, M, and Prablanc, C (1995). \Vectorial coding of movement: vision, proprioception, or both?" In: Journal of neurophysiology 74.1, pp. 457{463. issn: 0022-3077. Rossetti, Y, Gaunet, F, and Thinus-Blanc, C (1996). \Early visual experience aects memorization and spatial representation of proprioceptive targets". In: Neuroreport 7.6, pp. 1219{1223. issn: 0959-4965. doi: 10.1097/00001756-199604260-00025. Russakovsky, O. et al. (2015). \ImageNet Large Scale Visual Recognition Challenge". In: International Journal of Computer Vision 115.3, pp. 211{252. issn: 15731405. doi: 10.1007/s11263-015-0816-y. arXiv: 1409.0575. Salomon, I. and Mokhtarian, P. L. (1997). \University of California Transportation Uni- versity of California". In: Transportation Research-D 2.2, pp. 107{123. issn: 0308- 518X. doi: 10.1068/a201285. Sergio, L. E. and Scott, S. H. (1998). \Hand and joint paths during reaching movements with and without vision". In: Experimental Brain Research 122.2, pp. 157{164. issn: 00144819. doi: 10.1007/s002210050503. Shalev-Shwartz, S., Wexler, Y., and Shashua, A. (2011). \ShareBoost: Ecient multiclass learning with feature sharing". In: Nips, pp. 1{15. arXiv: 1109.0820. url: http: //eprints.pascal-network.org/archive/00008910/. Shao, L. et al. (2014). 3D Depth Cameras in Vision: Benets and Limitations of the Hardware, pp. 3{26. isbn: 9783319086507. doi: 10.1007/978-3-319-08651-4{\_ }1. url: http://link.springer.com/10.1007/978-3-319-08651-4{\_}1$ 121 \backslash$nhttp://link.springer.com/content/pdf/10.1007/978-3-319- 08651-4.pdf. Song, S. and Xiao, J. (2013). \Tracking revisited using RGBD camera: Unied benchmark and baselines". In: Proceedings of the IEEE International Conference on Computer Vision, pp. 233{240. issn: 1550-5499. doi: 10.1109/ICCV.2013.36. arXiv: 1212. 2823[cs.CV]. Srivastava, N. et al. (2014). \Dropout : A Simple Way to Prevent Neural Networks from Overtting". In: Journal of Machine Learning Research (JMLR) 15, pp. 1929{1958. issn: 15337928. doi: 10.1214/12-AOS1000. arXiv: 1102.4807. Stearns, L. et al. (2015). \The design and preliminary evaluation of a nger-mounted camera and feedback system to enable reading of printed text for the blind". In: Lecture Notes in Computer Science (including subseries Lecture Notes in Articial Intelligence and Lecture Notes in Bioinformatics) 8927, pp. 615{631. issn: 16113349. doi: 10.1007/978-3-319-16199-0{\_}43. Steinman, R. M. and Collewijn, H. (1980). \Binocular retinal image motion during active head rotation". In: Vision Research 20.5, pp. 415{429. issn: 00426989.doi: 10.1016/ 0042-6989(80)90032-2. Szeliski, R. (2010). \Computer Vision : Algorithms and Applications". In: Computer 5, p. 832. issn: 10636919. doi: 10.1007/978-1-84882-935-0. arXiv: arXiv:1011. 1669v3. url: http://research.microsoft.com/en-us/um/people/szeliski/ book/drafts/szelski{\_}20080330am{\_}draft.pdf. Thakoor, K. A. et al. (2013). \Attention biased speeded up robust features (AB-SURF): A neurally-inspired object recognition algorithm for a wearable aid for the visually- impaired". In: Multimedia and Expo Workshops (ICMEW), 2013 IEEE International Conference on, pp. 1{6. doi: 10.1109/ICMEW.2013.6618345. Thakoor, K. et al. (2015). A system for assisting the visually impaired in localization and grasp of desired objects. doi: 10.1007/978-3-319-16199-0{\_}45. url: http:// download.springer.com.libproxy1.usc.edu/static/pdf/525/chp{\%}253A10. 1007{\%}252F978-3-319-16199-0{\_}45.pdf?originUrl=http{\%}3A{\% }2F{\%}2Flink.springer.com{\%}2Fchapter{\%}2F10.1007{\%}2F978-3-319- 16199-0{\_}45{\&}token2=exp=1459569747{ ~ }acl={\%}2Fstatic{\%}2Fpdf{\% 122 }2F525{\%}2Fchp{\%}25253A10.1007{\%}25252F978-3-319-16199-0{\_}45. pdf{\%}3ForiginUrl{\%}3Dhttp{\%}253A{\%}252F{\%}252Flink.springer. com{\%}252Fchapter{\%}252F10.1007{\%}252F978-3-319-16199-0{\_}45*{ ~ } hmac = 33f0d84754c484772708333a7eb88378da0441260b37363a5422b96a8e18c98b (visited on 04/02/2016). Vanlierde, A. and Wanet-Defalque, M. C. (2004). \Abilities and strategies of blind and sighted subjects in visuo-spatial imagery". In: Acta Psychologica 116.2, pp. 205{222. issn: 00016918. doi: 10.1016/j.actpsy.2004.03.001. Walther, D., Itti, L., and Riesenhuber, M. (2002). \Attentional selection for object recog- nition|a gentle way". In: Biologically Motivated . . . Pp. 472{479. issn: 16113349. url: http://link.springer.com/chapter/10.1007/3-540-36181-2{\_}47. Wang, T. et al. (2012). \End-to-end text recognition with convolutional neural networks". In: ICPR, International Conference on Pattern Recognition, pp. 3304{3308. issn: 1051-4651. url: http://www-cs.stanford.edu/people/ang/papers/ICPR12- TextRecognitionConvNeuralNets.pdf. Warren, D. H. and Pick, H. L. (1970). \Intermodality relations in localization in blind and sighted people". In: Perception & Psychophysics 8.6, pp. 430{432. issn: 0031-5117. doi: 10.3758/BF03207040. url: http://www.springerlink.com/index/10.3758/ BF03207040. Wiberg, H. J. (2015). Be My Eyes. Ed. by H. Wiberg. [Online; posted 15-January-2015]. url: http://bemyeyes.org. Wu, Y., Lim, J., and Yang, M.-H. (2013). \1- Object Tracking -Online Object Tracking: A Benchmark". In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411{2418. doi: 10.1109/CVPR.2013.312. url: http://ieeexplore.ieee.org/ lpdocs/epic03/wrapper.htm?arnumber=6619156. 123
Abstract (if available)
Abstract
The role of vision in accomplishing most daily tasks is priceless. For visually impaired people, many of the tasks deemed simple by sighted individuals are difficult or impossible. Tasks such as navigating through complex environments or finding items can be useful problems to solve to aid in the improvement and overall autonomy of visually impaired individuals. In this thesis, a grocery item finding assistant has been developed. The system utilizes a head mounted camera for seeing the world, computers for processing camera information and simplistic feedback for efficient communication between the system and the visually impaired person. ❧ The work done in this thesis included developing a real time tracking and feedback module. This module was thoroughly tested with visually impaired people to determine the effectiveness of a head mounted system for tracking items in real time. The results show that subjects were able to utilize intuitive feedback commands to guide their center of vision towards a desired object and eventually reach and grasp for that item. ❧ The work done also resulted in a step towards object recognition for grocery items. By building a lightweight and real time neural network based object recognition system, as well as exploring the use of grocery web images for the recognition of images from a web camera, the research was able to determine the pitfalls and possible limitations of a web image only dataset. ❧ Additionally, a fully closed loop real time recognition system with contextual feedback is demonstrated. This closed loop system was a multi-threaded and asynchronous based program that linked the tracking, feedback, text recognition, and object recognition modules. The early human subjects testing with the feedback module coupled with the demonstration of the closed loop system with a blindfolded sighted subject suggest that this system can improve the lives of visually impaired people.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Robot vision for the visually impaired
PDF
RGBD camera based wearable indoor navigation system for the visually impaired
PDF
Efficient pipelines for vision-based context sensing
PDF
User-interface considerations for mobility feedback in a wearable visual aid
PDF
Outdoor visual navigation aid for the blind in dynamic environments
PDF
Transfer learning for intelligent systems in the wild
PDF
Biologically inspired approaches to computer vision
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Novel imaging systems for intraocular retinal prostheses and wearable visual aids
PDF
Object detection and digitization from aerial imagery using neural networks
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Neural networks for narrative continuation
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
Towards generalizable expression and emotion recognition
PDF
Generating gestures from speech for virtual humans using machine learning approaches
PDF
Intraocular and extraocular cameras for retinal prostheses: effects of foveation by means of visual prosthesis simulation
PDF
Automatic image and video enhancement with application to visually impaired people
PDF
Autonomous mobile robot navigation in urban environment
PDF
Ubiquitous computing for human activity analysis with applications in personalized healthcare
PDF
Computational foundations for mixed-motive human-machine dialogue
Asset Metadata
Creator
Mante, Nii Tete Q.
(author)
Core Title
Computer vision aided object localization for the visually impaired
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Biomedical Engineering
Publication Date
09/29/2016
Defense Date
08/30/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,deep learning,human computer interaction,human subjects,machine learning,neural networks,OAI-PMH Harvest,object recognition,visually impaired
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Weiland, James D. (
committee chair
), Humayun, Mark S. (
committee member
), Tanguay, Armand R., Jr. (
committee member
)
Creator Email
nii@niimante.com,nmante88@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-310591
Unique identifier
UC11214258
Identifier
etd-ManteNiiTe-4845.pdf (filename),usctheses-c40-310591 (legacy record id)
Legacy Identifier
etd-ManteNiiTe-4845.pdf
Dmrecord
310591
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Mante, Nii Tete Q.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer vision
deep learning
human computer interaction
human subjects
machine learning
neural networks
object recognition
visually impaired