Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient pipelines for vision-based context sensing
(USC Thesis Other)
Efficient pipelines for vision-based context sensing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Efficient Pipelines for Vision-Based Context Sensing
by
Xiaochen Liu
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the Requirements for the Degree
DOCTOR of PHILOSOPHY
(COMPUTER SCIENCE)
May 2020
c
Copyright 2020 by Xiaochen Liu
Acknowledgments
I want to first express my greatest appreciation to my advisor, Professor Ramesh Govindan, for his support
along my path pursuing the degree. He is very insightful in the domain of computer systems and networking,
which helps me grow fast in doing research. The first two years was pretty tough to me because of lack of
a big picture for the thesis and the challenging courses. Prof. Govindan helped me out by encouraging me
to explore different ideas and giving me very useful feedback. Later on, as things rolled smoother, his high
standard in research keeps me improving and the lessons learnt during the collaboration with him will benefit
my whole career. He is the best advisor that I could ever imagine for my PhD life. None of the projects in this
thesis could exist without his generous guidance. I would also thank my dissertation committee, Professor
Barath Raghavan and Professor Bhaskar Krishnamachari, for their insightful comments and suggestions on
improving the quality of the dissertation.
Throughout my dissertation, it has been a great honor to work with many outstanding professors and
researchers in top universities and labs: Gnome (2) and ALPS (3) are joint work with Suman Nath at Microsoft
Research. Caesar (4) is a joint work with Oytun Ulutan and Professor B. S. Manjunath at University of
California, Santa Barbara, and Kevin Chan from ARL. Grab (5) and TAR (6) are joint work with Yurong
Jiang, Puneet Jain, and Kyu-Han Kim during my internship at HP Labs. I would also like to thank my fellow
labmates Yitao Hu and Pradipta Ghosh, for the joint contribution to ALPS and Caesar.
Last but not least, I want to thank my family and friends for their support these years: specially my beloved
wife Olivia, my parents Hong and Songyan, and my grandparents Zuoxu, Hanjie, Baoling, and Ensheng.
ii
Abstract
Context awareness is an essential part of mobile and ubiquitous computing. Its goal is to unveil situational
information about mobile users like locations and activities. The sensed context can enable many services
like navigation, AR, and smarting shopping. Such context can be sensed in different ways including visual
sensors. There is an emergence of vision sources deployed worldwide. The cameras could be installed on
roadside, in-house, and on mobile platforms. This trend provides huge amount of vision data that could be
used for context sensing. However, the vision data collection and analytics are still highly manual today. It is
hard to deploy cameras at large scale for data collection. Organizing and labeling context from the data are
also labor intensive. In recent years, advanced vision algorithms and deep neural networks are used to help
analyze vision data. But this approach is limited by data quality, labeling effort, and dependency on hardware
resources. In summary, there are three major challenges for today’s vision-based context sensing systems: data
collection and labeling at large scale, process large data volumes efficiently with limited hardware resources,
and extract accurate context out of vision data.
The thesis explores the design space that consists of three dimensions: sensing task, sensor types, and task
locations. Our prior work explores several points in this design space. Specifically, we develop Gnome [223]
for accurate outdoor localization. It leverages 2D and 3D information from Google Street View to mitigate
GPS signal’s error in different cities, and uses offloading and caching to work efficiently on stock phones in
real time. For vision-only outdoor sensing, we present ALPS [196] that applies optimized object detection
algorithms to detects and localizes roadside objects captured in Google Street View. It optimizes the processing
speed by adaptively downloading and processing images. For indoor scenarios, we designed TAR [221]
and Grab [222] for tracking people and detecting their interactions with objects. Both projects fuse camera
outputs with those of other sensors to achieve state-of-the-art performance. For outdoor behavior sensing, we
develop Caesar [220] to abstract complex activities as graphs, and define customized activities using extensible
vocabulary. Caesar also uses lazy evaluation optimizations to reduce GPU processing overhead and mobile
energy consumption.
The thesis makes contributions by (1) developing efficient and scalable solutions for different points in the
design space of vision-based sensing tasks; (2) achieving state-of-the-art accuracy in those applications; (3)
and developing guidelines for designing such sensing systems.
iii
Table of Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Gnome: A Practical Approach to NLOS Mitigation for GPS Positioning in Smartphones . . . 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Background, Motivation, and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Gnome Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Estimating Building Height . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Estimating Path Inflation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.5 Location Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.6 Scaling Gnome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 Evaluating Gnome components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 ALPS: Accurate Landmark Positioning at City Scales . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Motivation and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 The Design of ALPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Approach and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Base Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Landmark Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.4 Image Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.5 Adaptive Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.6 Landmark Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.7 Putting it All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.8 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 Coverage and Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.3 Scalability: Bottlenecks and Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.4 Accuracy and Coverage Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4 Caesar: Cross-Camera Complex Activity Recognition . . . . . . . . . . . . . . . . . . . . . . 50
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Caesar Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Rule Definition and Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.3 Tracking and Re-Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.4 Action Detection and Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
iv
4.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.4 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4.5 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Grab: A Cashier-Free Shopping System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Grab Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.1 Identity tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.2 Shopper Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.3 GPU Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.1 Grab Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.2 Methodology, Metrics, and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.3 Accuracy of Grab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.4 The Importance of Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.5 GPU multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 TAR: Enabling Fine-Grained Targeted Advertising in Retail Stores . . . . . . . . . . . . . . . 92
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 TAR Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.2 A Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.3 Vision-based Tracking (VT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.4 People Tracking with BLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.5 Real-time Identity Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4.1 Methodology and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4.2 TAR Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4.3 TAR Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.1 Outdoor Localization and GPS Error Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2 Roadside Landmark Discovery and Localization . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3 Cross-camera Person Re-identification and Tracking . . . . . . . . . . . . . . . . . . . . . . . . 118
7.4 Targeted Advertising and Cashier-free Shopping . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.5 Complex Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.6 Wireless Camera Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.7 Scaling DNN Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8 Conclusions and Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.1 Guidelines for Vision-Based Context Sensing Systems . . . . . . . . . . . . . . . . . . . . . . . 124
8.2 Potential Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
v
List of Figures
1.1 The design space of vision-based context-sensing. . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Illustration of application scenarios of my work. . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 (a) A line-of-sight (LOS) signal path. (b) A non-line-of-sight (NLOS) signal path. (c) Multipath 10
2.2 An example of localization results in urban canyon on today’s smartphone platforms . . . . . . 10
2.3 Gnome workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 The skyline and satellites locations seen by a receiver. The receiver is at the center of the circle
and the squares represent satellites. The numbers represent satellite IDs and color signifies signal
strength. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Upper: the original depth information (each colored plane represents one surface of a building),
together with the missing height information in yellow. Lower: the corresponding panoramic
image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Building height estimation. (a) shows how the vector D is estimated from the surface geometry.
(b) shows how the panoramic image can be used for skyline detection andq estimation. (c) shows
how building height is estimated using D andq. . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 For robust height estimation, Gnome uses three viewpoints to estimate height, then averages these
estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 The adjusted depth planes, augmented with the estimated height. . . . . . . . . . . . . . . . . . 17
2.9 An example of ray-tracing and path inflation calculation. . . . . . . . . . . . . . . . . . . . . . 18
2.10 After adjusting pseudoranges, candidate positions nearer the ground truth will have estimates
that converge to the ground truth, while other candidate positions will have random corrections. 20
2.11 Heatmap of the relative distance between candidate positions and the revised candidate positions.
Candidates close to ground truth have small relative distances. . . . . . . . . . . . . . . . . . . 20
2.12 The ground truth traces (yellow), Android output (blue), and Gnome output (red) in the four cities. 23
2.13 Statistics of NLOS signals in the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.14 Extra NLOS signal path distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.15 Average positioning accuracy in different scenarios. . . . . . . . . . . . . . . . . . . . . . . . 25
2.16 Average accuracy in different cities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.17 Accuracy with different level of height adjustment. . . . . . . . . . . . . . . . . . . . . . . . . 28
2.18 Latency vs Accuracy with different grid sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.19 Accuracy of different candidate ranking approaches. . . . . . . . . . . . . . . . . . . . . . . . 29
2.20 Effectiveness of Kalman filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 ALPS Components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Base Image Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 How zooming-in can help eliminate false positives. . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Clustering can help determine which images contain the same landmark. . . . . . . . . . . . . 39
3.5 Clustering by bearing is necessary to distinguish between two nearby landmarks. . . . . . . . . 39
3.6 Adaptive Image Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 Landmark Positioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 (a) Hydrant occluded by a parked vehicle. (b) Detection failure because hydrant is under the
shade of a tree. (c) False positive detection of bollard as hydrant. . . . . . . . . . . . . . . . . 46
3.9 Distribution of position errors for hydrants in 90004 zip-code . . . . . . . . . . . . . . . . . . 47
3.10 Distribution of errors for Subways in five cities . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.11 Distribution of position errors for ALPS on Subway w/ and w/o ideal detector in Los Angeles . . 47
4.1 The high-level concept of a complex activity detection system: the user defines the rule then the
system monitors incoming videos and outputs the matched frames. . . . . . . . . . . . . . . . . 52
4.2 The high-level design of Caesar. Dots with different colors represent different DNN modules for
specific tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 The output of Caesar with annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
vi
4.4 Examples of the vocabulary elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Two examples of action definition using Caesar syntax. . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Two examples of parsed complex activity graphs. . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.7 Workflow of object detection on mobile device. . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.8 An example of graph matching logic: the left three graphs are unfinished graphs of each tube,
and the right three graphs are their updated graphs. . . . . . . . . . . . . . . . . . . . . . . . 63
4.9 Camera placement and the content of each camera. . . . . . . . . . . . . . . . . . . . . . . . . 65
4.10 Caesar’s (a) recall rate and (b) precision rate with different action detection and tracker accuracy.
(c) The statistics and sample images of failures in all complex activities. . . . . . . . . . . . . . 66
4.11 (a) Latency of Caesar and the strawman solution with different number of inputs. (b) Throughput
of Caesar and the strawman solution with different number of Inputs. (c) Maximum cache size
needed for Caesar and the strawman solution to reach the best accuracy, with different number of
Inputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.12 The total amount of data to be uploaded from each camera, with different uploading schemes. . 72
4.13 The average energy consumption of cameras in Caesar, with different uploading scheme and
action queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1 Grab is a system for cashier-free shopping and has four components: registration, identity
tracking, action recognition, and GPU multiplexing. . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 (a) Sample OpenPose output. (b,c,d,e) Grab’s approach adjusts the face’s bounding box using the
keypoints detected by OpenPose. (The face shown is selected from OpenPose project webpage [45]) 77
5.3 When a shopper is occluded by another, Grab resumes tracking after the shopper re-appears (lazy
tracking). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 OpenPose can (a) miss an assignment between elbow and wrist, or (b) wrongly assign one
person’s joint to another. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5 Grab recognizes the items a shopper picks up by fusing vision with smart-shelf sensors including
weight and RFID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.6 (a) Vision based item detection does background subtraction and removes the hand outline.
(b) Weight sensor readings are correlated with hand proximity events to assign association
probabilities. (c) Tag RSSI and hand movements are correlated, which helps associate proximity
events to tagged items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 (a) Left: Weight sensor hardware, Right: RFID hardware; (b) Grab sample output. . . . . . . . 85
5.8 Grab has high precision and recall across our entire trace (a), relative to other alternatives that
only use a subset of sensors (W: Weight; V: Vision; R: RFID), even under adversarial actions
such as (b) Item-switching; (c) Hand-hiding; (d) Sensor-Tampering . . . . . . . . . . . . . . . . 87
5.9 Grab needs a frame rate of at least 10 fps for sufficient accuracy, reducing identity switches and
identification delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.10 GPU multiplexing can support up to 4 multiple cameras at frame rates of 10 fps or more, without
noticeable lack of accuracy in action detection. . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1 System Overview for TAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2 A Targeted Ad Working Example in Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 Relationship between BLE proximity and physical distance . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 (a) Example of a visual trace;(b) Sensed BLE proximity traces;(c) DTW cost matrix for successful matching;
(d) Matching Process Illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5 (a) Camera Topology;(b) One ID cannot show in two cameras;(c) BLE ID must be sensed sequentially in
the network;(d) It takes time to travel between cameras . . . . . . . . . . . . . . . . . . . . . . . . 107
6.6 Experiment Deployment Layout: (a) Office;(b) Retail store. . . . . . . . . . . . . . . . . . . . . . . 108
6.7 Same person’s figures under different camera views (office). . . . . . . . . . . . . . . . . . . . . . . 108
6.8 Screenshots for Cross Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.9 (a) Multi-cam tracking comparison against state-of-the-art solutions (*offline solution);(b) Error statistics
of TAR and MCT+ReID;(c) Error statistics of re-identification in MCT+ReID and example images. . . . 110
6.10 Importance of Tracking Components in TAR (*offline solution). . . . . . . . . . . . . . . . . . . . . . 111
6.11 Recall, precision, and FPS of state-of-the-art people detectors. . . . . . . . . . . . . . . . . . . . . . 111
vii
6.12 Importance of Identification Components in TAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.13 Accuracy of TAR with different ratio of node failure (purple lines show the measured error). . . . . . . . 115
6.14 Relationship between the tracking accuracy and the number of concurrently tracked people. . . . . . . . 115
viii
List of Tables
2.1 Processing latency measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Coverage of ALPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Coverage with Seed Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Error of ALPS and Google for localizing Subways . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Processing time of each module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Evaluation of different object detection methods . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 Input and output content of each module in Caesar. . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 The runtime frame rate of each DNN model used by Caesar (evaluated on a single desktop
GPU [43]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Speed and accuracy of different DNNs on mobile GPU. . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Summary of labeled complex activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Caesar’s recall and precision on the seven complex activities shown in Table 4.4. . . . . . . . . 68
4.6 Activities in a three-camera dataset, and Caesar’s accuracy. . . . . . . . . . . . . . . . . . . . 70
5.1 State-of-the-art DNNs for many of Grab’s tasks either have low frame rates or insufficient
accuracy. (* Average pose precision on MPII Single Person Dataset) . . . . . . . . . . . . . . 90
6.1 ID-matching matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2 Accuracy (ratio of correct matches) of different trace similarity metrics . . . . . . . . . . . . . . . . . 112
ix
Chapter 1
Introduction
Context awareness is an essential part of mobile and ubiquitous computing. Its goal is to unveil the situational
information about mobile users like locations and activities. The sensed contexts can enable many services
such as navigation, advertisements, AR/VR, and personal monitoring. Previous work leverages the sensor data
(e.g., GPS, Bluetooth, gyroscope) to infer the context. However, those applications are limited by the hardware
cost, sensor type, accuracy, battery, and the computing power on the device. These limitations prevent the
development of easy-to-use and scalable solutions for context sensing.
In recent years, there is an emergence of cameras deployed worldwide. Governments have installed denser
surveillance networks for public monitoring. House and shop owners have set up more indoor cameras for
safety and business insight purposes. Mobile platforms like phones, drones, and cars are also becoming more
capable as video sources.
Ubiquitous cameras can be used to sense context, but the data collection and analytics are still highly
manual. For the data collection process, it is challenging to retrieve vision data from large scale and efficiently
label the metadata (e.g. camera pose and location). The vision analytics are also heavy in labor. For instance,
surveillance cameras are usually monitored by human beings, which is not scalable nor sensitive. Therefore,
researchers have to spend a lot of effort on hiring volunteers, collecting data, and labeling the dataset.
People start to leverage advanced algorithms and deep neural networks to help analyze vision data. The
fast evolving hardware also supports more complex software to run on the cloud, edge servers, and mobile
devices. However, such approach still faces challenges in practice. Vision-only solutions are not always
accurate, especially when the image quality or the camera position is suboptimal, which reduces algorithm
accuracy. Moreover, complex neural-network-based algorithms are compute-intensive and will not scale for
tasks that have large scale of inputs and limited hardware resources.
1
Work Overview
Location Sensing Behavior Sensing
Outdoor w/ sensor G nome (Chapter 2)
w/o sensor A LP S (Chapter 3) Caesar (Chapter 4)
Indoor w/ sensor TAR (Chapter 6) G rab (Chapter 5)
w/o sensor Caesar (Chapter 4)
8
Figure 1.1: The design space of vision-based context-sensing.
We summarize the above challenges of today’s vision-based context sensing tasks as follows: (1) Scalability:
The vision data for context sensing can be hard to collect at large-scale; (2) Efficiency: With the huge amount
of data to process, sensing system with limited hardware resource will face long end-to-end latency and low
processing speed; (3) Accuracy: The vision-only and the sensor-only solution will sacrifice the accuracy due
to their own limitations.
The thesis explores the space of designs for vision-based context sensing. The space is defined by three
dimensions. The first dimension is the sensing task. We find that most tasks are for at least one of the two
purposes: localization, and detecting behaviors. Localization determines the spatial-temporal information
of the target, and behavior sensing is to extract the target’s identity and activity. The second dimension is
the sensor type. We consider two classes of tasks: those that use vision sensors exclusively, and those that
fuse vision with other sensors. The third dimension is the location of the task: indoor or outdoor. Indoor
environment is usually more controllable in terms of sensor deployment and camera positions. The outdoor
scenarios usually have much more data and the object movement is less constrained than indoor cases. With
the above three dimensions, we could classify existing vision-based context sensing tasks into eight categories,
which is shown in Figure 1.1. Each of our prior projects focuses on the typical application scenario in that
category to resolve the related challenges. Figure 1.2 illustrates our work in those context sensing scenarios.
There are two slots marked in gray, which means that we did not explore the parts of the design space. The
top-right part (outdoor, behavior sensing with sensor + vision) is not explored because the targets in those
tasks are usually arbitrary, like pedestrians and cars on the street. This makes it hard to attach extra sensors to
the targets to help improve the sensing accuracy. Moreover, wide-spread monitoring areas will prevent people
deploying new sensors, and reusing existing surveillance cameras will be the best option. The bottom-left
part (indoor, location sensing with vision-only) is skipped because our prior work (TAR) proves that, for
indoor people tracking tasks, vision-only approaches have lower accuracy than vision+sensors. Therefore, the
vision-only approach is not the best option for indoor location sensing.
2
Work Overview Illustration
9 9 9 9 9 9
?
ALPS
TAR & Grab
Gnome
Caesar
Figure 1.2: Illustration of application scenarios of my work.
Outdoor person localization with vision and GPS sensors: Gnome [223] GPS signals suffer significant
impairment in urban canyons because of limited line-of-sight to satellites and signal reflections. In Gnome,
we focus on scalable and deployable techniques to reduce the impact of one specific impairment: reflected
GPS signals from non-line-of-sight (NLOS) satellites. Specifically, we show how, using publicly available
street-level imagery and off-the-shelf computer vision techniques, we can estimate the path inflation incurred
by a reflected signal from a satellite. Using these path inflation estimates we develop techniques to estimate the
most likely actual position given a set of satellite readings at some position. Finally, we develop optimizations
for fast position estimation on modern smartphones. Using extensive experiments in the downtown area of
several large cities, we find that our techniques can reduce positioning error by up to 55% on average.
Outdoor landmark localization with vision: ALPS [196] Ideally, every stationary object or entity in the
built environment should be associated with position, so that applications can have precise spatial context
about the environment surrounding a human. In ALPS, we take a step towards this ideal: by analyzing images
from Google Street View that cover different perspectives of a given object and triangulating the location of
the object, our system, ALPS, can discover and localize common landmarks at the scale of a city accurately
and with high coverage. ALPS contains several novel techniques that help improve the accuracy, coverage, and
scalability of localization. Evaluations of ALPS on many cities in the United States show that it can localize
storefronts with a coverage higher than 90% and a median error of 5 meters.
3
Outdoor person behavior recognition with vision: Caesar [220] Caesar is an edge computing based
system for complex activity detection, which provides an extensible vocabulary of activities to allow users to
specify complex actions in terms of spatial and temporal relationships between actors, objects, and activities.
It converts these specifications to graphs, efficiently monitors camera feeds, partitions processing between
cameras and the edge cluster, retrieves minimal information from cameras, carefully schedules neural network
invocation, and efficiently matches specification graphs to the underlying data in order to detect complex
activities. Our evaluations show that Caesar can reduce wireless bandwidth, on-board camera memory, and
detection latency by an order of magnitude while achieving good precision and recall for all complex activities
on a public multi-camera dataset.
Indoor shopper behavior recognition with vision and sensors: Grab [222] Grab leverages existing
infrastructure and devices to detect shopping behaviors to enable cashier-free shopping, which needs to
accurately identify and track customers, and associate each shopper with items he or she retrieves from shelves.
To do this, Grab uses a keypoint-based pose tracker as a building block for identification and tracking, develops
robust feature-based face trackers, and algorithms for associating and tracking arm movements. It also uses
a probabilistic framework to fuse readings from camera, weight and RFID sensors in order to accurately
assess which shopper picks up which item. In experiments from a pilot deployment in a retail store, Grab can
achieve over 90% precision and recall even when 40% of shopping actions are designed to confuse the system.
Moreover, Grab has optimizations that help reduce investment in computing infrastructure four-fold.
Indoor person localization and tracking with vision and sensors: TAR [221] TAR leverages widespread
camera deployment and Bluetooth proximity information to accurately track and identify shoppers in the store.
TAR is composed of four novel design components: (1) a deep neural network (DNN) based visual tracking,
(2) a person trajectory estimation by using visual features and BLE proximity traces, (3) an identity matching
and assignment to recognize person identity, and (4) a cross-camera calibration algorithm. TAR carefully
combines these components to track and identify people in real-time across multiple non-overlapping cameras.
It achieves 90% accuracy in two different real-life deployments, which is 20% better than the state-of-the-art
solution.
Our Contributions Overall, the thesis makes these contributions: We develop efficient and scalable solutions
(ALPS, Gnome, TAR, Caesar, and Grab) for different application scenarios in vision-based context sensing. In
those works, we show that pure vision algorithms are not accurate enough in real surveillance scenarios, for
which we can leverage low-cost sensors and cross-camera knowledge as complementary modules to achieve
4
high accuracy. We also find that the worldwide street-level imagery can help scale vision algorithms to large
scale. We analyze various trade-offs among accuracy, storage, latency, energy consumption, and cost. We build
prototypes and conduct experiments in real scenarios, and prove that our solutions achieve state-of-the-art
accuracy while being practical and scalable. We have developed guidelines for such vision-enabled context
sensing systems as following:
Data Source and Platforms: Leverage existing sensing infrastructure and imagery data source to scale
the data collection coverage. Exploit different sensors’ characteristics and leverage other sensors as
complementary to vision data.
Architecture Design: Identify the bottleneck of the pipeline. Build the system around one high-level
abstraction to keep accuracy and flexibility. Share the workload among all computation modules to
improve efficiency.
Pipeline Optimization: Apply cheap vision algorithms to reduce the dependency for expensive hard-
ware. Improve efficiency by exploiting software optimization tricks with data and task characteristics.
Dissertation Outline In Chapter 2, we introduce Gnome, a practical approach to NLOS mitigation for GPS
positioning in smartphones. In Chapter 3, we introduce ALPS for accurate landmark positioning at city scales.
In Chapter 4, we introduce Caesar for cross-camera complex activity recognition. In Chapter 5, we introduce
Grab, a cashier-free shopping system. In Chapter 6, we introduce TAR that enables fine-grained tracking and
targeted advertising. In Chapter 7, we present a comprehensive overview of related work in the literature.
Finally, we summarize our work and discuss the guidelines in Chapter 8.
5
Chapter 2
Gnome: A Practical Approach to NLOS
Mitigation for GPS Positioning in
Smartphones
2.1 Introduction
Accurate positioning has proven to be an important driver for novel applications, including navigation,
advertisement delivery, ride-sharing, and geolocation apps. While positioning systems generally work well in
many places, positioning in urban canyons remains a significant challenge. Yet it is precisely in urban canyons
in megacities that accurate positioning is most necessary. In these areas, smartphone usage is high, as is the
density of places (storefronts, restaurants etc.), motivating the need for high positioning accuracy.
Over the last decade, several techniques have been used to improve positioning accuracy, many of which
are applicable to urban canyons. Cell tower and Wi-Fi based localization [271, 270] enable smartphones to
estimate their positions based on signals received from these wireless communication base stations. Map
matching enables positioning systems to filter off-road location estimates of cars [145, 202]. Dead-reckoning
uses inertial sensors (accelerometers, and gyroscopes) to estimate travel distance, and thereby correct position
estimates [202]. Crowd-sourcing GPS readings [121] or using differential GPS systems [190, 189] can also
help improve GPS accuracy. Despite these improvements, positioning errors in urban canyons can average
15m.
That is because these techniques do not tackle the fundamental source of positioning error in urban
6
canyons [98, 206]: non-line-of-sight (NLOS) satellite signals at GPS receivers. GPS receivers use signals from
four or more satellites to triangulate their positions. Specifically, each GPS receiver estimates the distance
traveled by the signal from each visible satellite: this distance is called the satellite’s pseudorange. In an urban
canyon, signals from some satellites can reach the receiver after being reflected from one or more buildings.
This can inflate the pseudorange: a satellite may appear farther from the GPS receiver than it actually is. This
path inflation can be tens or hundreds of meters, and can increase positioning error.
Contributions This chapter describes the design of techniques, and an associated system called Gnome, that
revises GPS position estimates by compensating for path inflation due to NLOS satellite signals. Gnome can
be used in many large cities in the world, and requires a few tens of milliseconds on a modern smartphone to
compute revised position estimates. It does not require specialized hardware, nor does it require a phone to be
rooted. In these senses, it is immediately deployable.
This chapter makes three contributions corresponding to three design challenges: (a) How to compute
satellite path inflation? (b) How to revise position estimates? (c) How to perform these computations fast on a
smartphone?
Gnome estimates path inflation using 3D models of the environment surrounding the GPS receiver’s
position. While prior work on NLOS mitigation [243, 212, 164, 265, 232, 194, 180, 298, 120] has used
proprietary sources of 3D models, we use a little known feature in Google Street View [@streetview] that
provides depth information for planes (intuitively, each street-facing side of a building corresponds to a plane)
surrounding the receiver’s position. This source of data makes Gnome widely usable, since these planes are
available for many cities in North America, Europe and Asia. Unfortunately, these plane descriptions lack a
crucial piece of information necessary for estimating path inflation: the height of building planes. Gnome’s
first contribution is a novel algorithm for estimating building height from panoramic images provided by Street
View. Compared to prior work that determines building height from public data [181, 266], or uses remote
sensing radar data [279, 151, 148, 147], our approach achieves higher coverage by virtue of using Street View
data.
To compute the path inflation correctly, Gnome needs to know the ground truth position. However, GPS
receivers don’t, of course, provide this: they only provide satellite pseudoranges, and an estimated position.
Gnome must therefore infer the position most likely to correspond to the observed satellite pseudoranges.
To do this, Gnome’s second contribution is a technique to search candidate positions near the GPS location
estimate, revise the candidate’s position by compensating for path inflation, and then determine the revised
candidate position likely to be closest to the ground truth. This contribution is inspired by, but different from
the prior work that attempts to infer actual positions by simulating the satellite signal path [265, 232, 194], or
7
by determining satellite visibility [180, 298, 120].
Gnome’s third contribution is to enable these computations to scale to smartphones, a capability that, to
our knowledge, has not been demonstrated before. To this end, it leverages the observation that 3D models of
an environment are relatively static, so Gnome aggressively pre-computes, in the cloud, path inflation maps at
each candidate position. These maps indicate the path inflation for each possible satellite position, and are
loaded onto a smartphone. At runtime, Gnome simply needs to look up these maps, given the known positions
of each satellite, to perform its pseudorange corrections. Gnome also scopes the search of candidate positions
and hierarchically refines the search to reduce computation overhead.
Gnome differs from [265, 232, 194] in two ways. First, these approaches use proprietary 3D models for
ray-tracing, which are not accessible in many cities. In contrast, Gnome leverages highly available Street
View data for satellite signal tracing. Second, these approaches are offline while Gnome can compute location
estimates in real-time on Android devices.
Our evaluations of Gnome in four major cities (Frankfurt, Hong Kong, Los Angeles and New York) reveal
that Gnome can improve position accuracy in some scenarios by up to 55% on average (or up to 8m on average).
Gnome can process a position estimate on a smartphone in less than 80ms. It uses minimal additional battery
capacity, and has modest storage requirements. Gnome’ cloud-based path inflation map pre-computation takes
several hours for the downtown area of a major city, but these maps need only be computed once for areas
with urban canyons in major cities. Finally, Gnome components each contribute significantly to its accuracy:
height adjustment accounts for about 3m in error, and sparser candidate position selections also increase error
by the same amount.
2.2 Background, Motivation, and Approach
How GPS works GPS is an instance of a Global Navigation Satellite System. It consists of 32 medium
earth orbit satellites, and each satellite continuously broadcasts its position information and other metadata
at an orbit of about 2x10
7
meters above the earth. The metadata specify various attributes of the signal such
as the satellite position, timestamp, etc. Using these, the receiver computes, for each received signal, its
pseudorange or the signal’s travel distance, by multiplying light of speed with the signal’s propagation delay.
With these pseudorange estimates, GPS uses three satellites’ position to trilaterate the receiver’s position in
3D coordinates. In practice, the receiver’s local clock is not accurate compared with satellite’s atomic clock,
so GPS needs another satellite’s signal to estimate the receiving time. Thus, a GPS receiver must be able to
8
receive signals from at least four satellites in order to fix its own position.
GPS Signal Impairments GPS signals undergo four
1
different types of impairments [98] that introduce errors
in position estimates. The earth’s rotation between when the signal was transmitted and received can impact
travel time, as can the Doppler effect due to the satellite’s velocity. Ionospheric and tropospheric delays caused
by the earth’s atmosphere can inflate pseudoranges. Multipath transmissions, where the same signal is received
directly from a satellite and via reflection, can cause constructive or destructive interference and introduce
errors in the position fix. Finally, a receiver may receive a signal, via reflection, from an NLOS satellite.
Many modern receivers compensate for, either in hardware or software, the first two classes of errors.
Specifically, GPS receivers can compensate for earth’s rotation and satellite Doppler effects. GPS signals also
contain metadata that specify approximate corrections for atmospheric delays. Higher accuracy applications
that need to eliminate such correlated errors (the atmospheric delay is correlated in the sense that two receivers
within a few kilometers of each other are likely to see the same coordinated delays) can use either Differential
GPS or Real-Time Kinematic GPS. Both of these approaches use base stations whose precise position is known
a priori. Each base station can estimate correlated errors based on the difference between its position calculated
from GPS, and its actual (known) position. It can then broadcast these corrections to nearby receivers, who
can use these to update their position estimates.
Two other sources of error, multipath and NLOS reflections are not correlated, so different techniques must
be used to overcome them. These sources of error are particularly severe in urban canyons [232, 180, 298].
Urban Canyons To understand how NLOS reflections can impact positioning accuracy, consider Figure 2.1(a)
in which a satellite is within line-of-sight (LOS) of a receiver, so the latter receives a direct signal. If a tall
building blocks the LOS path, the satellite signal may still be received after being reflected, and causes NLOS
reception (Figure 2.1(b)). Thus, depending on the environment and the receiver’s position, a satellite’s primary
received signal can either be direct or reflected. In addition, the primary signal can itself be reflected (more
than once), resulting in multipath receptions (Figure 2.1(c)).
To mitigate the impact of multipath, GPS receivers use multipath correctors ([157, 321, 317]) that use
signal phase to distinguish (and filter out) the reflected signal from the primary signal. Modern receivers can
reduce the impact of multipath errors to a few meters.
However, when the primary signal is reflected (i.e., the signal is from an NLOS satellite), the additional
distance traveled by the signal due to the reflection can inflate the pseudorange estimate. The yellow lines
in Figure 2.1 represent reflected signal paths, which are longer than the primary paths shown in green. The
difference in path length between these two signals can often be 100s of meters. Unfortunately, GPS receivers
1
We have simplified this discussion. Additional sources of error can come from clock skews, receiver calibration errors and so forth
[98].
9
R
(c)
R
(b )
R
(a)
Figure 2.1: (a) A line-of-sight (LOS) signal path. (b) A non-line-of-sight (NLOS) signal path. (c) Multipath
Green: Ground truth trace
Blue: iOS trace (Avg Error 14.4m)
Red: Android trace (Avg Error 15.1m)
Figure 2.2: An example of localization results in urban canyon on today’s smartphone platforms
cannot reliably distinguish between reflected and direct signals, and this is the primary cause of positioning
error in urban areas. This chapter focuses on NLOS mitigation for GPS positioning.
Alternative approaches To mitigate NLOS reception errors in urban canyons, smartphones use several
techniques to augment position fixes. First, they use proximity to cellular base stations [163] or Wi-Fi access
points [117] to refine their position estimates. In this approach, smartphones use multilateration of signals
from nearby cell towers or Wi-Fi access points whose position is known a priori in order to estimate their own
position. Despite this advance, positioning errors in urban areas can be upwards of 15m, as our experiments
demonstrate in Figure 2.2.
Second, for positioning vehicles accurately, smartphones use map matching and dead-reckoning [202]
to augment position fixes. Map matching restricts candidate vehicle positions to street surfaces, and dead-
reckoning uses vehicle speed estimates to update positions when GPS is unavailable or erroneous. While these
techniques achieve good accuracy, they are not applicable to localizing pedestrians in urban settings.
Finally, as we have discussed above, approaches like Differential GPS and Real-time Kinematics assume
10
correlated error within a radius of several hundred meters or several kilometers. Errors due to NLOS receptions
are not correlated over these large spatial scales, so these techniques cannot be applied in urban canyons.
Goal, Approach and Challenges The goal of this chapter is to develop a practical and deployable system
for NLOS mitigation on smartphones. To be practical, such a system must not require proprietary sources of
information. To be deployable, it must, in addition, be capable of correcting GPS readings efficiently on the
smartphone itself.
Our approach is motivated by the following key insight: If we can determine the extra distance traveled
by an NLOS signal, we can compensate for this extra distance, and recalculate the GPS location on the
smartphone.
This insight poses three distinct challenges. The first is how to determine NLOS satellites and compute
the extra travel distance? While satellite trajectories are known in advance, whether a satellite is within
line-of-sight at a given location L depends upon the portion of the sky visible at L, which in turn depends on
the position and height of buildings around L. This latter information, also called the surface geometry at L
can be used to derive the set of surfaces that can possibly reflect satellite signals so that they are incident at L.
The second challenge is how to compensate for the extra distance traveled by an NLOS signal (we use
the term path inflation to denote this extra distance) incident at a location L. This is a challenge because a
smartphone cannot, in general, know the location L: it only has a potentially inaccurate estimate of L. To
correctly compensate for path inflation, the smartphone has to determine that the location whose predicted
reflected signal best explains the GPS signals observed at L.
The final challenge is to be able to perform these corrections on a smartphone. Determining the visibility
mask and the surface geometry are significantly challenging both in terms of computing and storage, particularly
at the scale of large downtown area of several square kilometers, and especially because these are functions
of L (i.e., each distinct location in an urban area has a distinct visibility mask and surface geometry). These
computing and storage requirements are well beyond the capability of today’s smartphones.
In the next section, we describe the design of a system called Gnome that addresses all of these challenges,
while providing significant performance improvements in positioning accuracy over today’s smartphones.
Prior work in this area has fallen into two categories: those that filter out the NLOS signal [289, 212, 195],
and those that compensate for the extra distance traveled [141, 194, 234, 110]. Gnome falls into the latter
class, but is unique in addressing our deployability goals.
11
Gnome Cloud
Panorama
& 3D Planes
User’s
Phone
Building
Height
Adjustment
Estimating
Path
Inflation
Gnome Mobile
I nf lat ion
Model
Raw GPS
Measurement
Location
Prediction
Result
Figure 2.3: Gnome workflow.
2.3 Gnome Design
In this section, we describe the design of Gnome. We begin by describing how Gnome addresses the challenges
identified, then describe the individual components of Gnome.
2.3.1 Overview
As described above, Gnome detects whether a satellite is within line-of-sight or not, and for NLOS satellites, it
estimates and compensates for the extra travel distance for the NLOS signal. Gnome is designed to perform all
of these calculations entirely on a smartphone.
To determine whether, at a given location L, a satellite is NLOS or not, and to compute the path inflation,
Gnome uses the satellite’s current position in the sky as well as the surface geometry around L. Specifically,
with a 3-D model of the buildings surrounding L, Gnome can determine whether a satellite’s signal might
have been reflected from any building by tracing signal paths from the satellite to L, and use that to compute
the path inflation. There exist public services to precisely determine the position for satellite at a given time.
Less well known is the fact that there also exist public sources for approximate surface geometry: specifically,
Google Street View [97] provides both 2-D imagery of streets as well as 3-D models (as an undocumented
feature) of streets for the downtown area of most large cities in the US, Europe, and Asia. These 3-D models
are, however, incomplete: they lack building height information, which is crucial to trace reflected signals from
satellites. In the section below, we describe how we use computer vision techniques to estimate a building’s
height. The availability of public datasets with 3-D information makes Gnome widely applicable: prior work
has relied on proprietary datasets, and so has not seen significant adoption.
To compensate for the NLOS path inflation on a smartphone, Gnome leverages the fact that modern mobile
OSes expose important satellite signal metadata such as what satellite signals were received and their relative
strength [112]. Gnome uses this information. The metadata, however, does not inform Gnome whether the
12
satellite was LOS or NLOS. To determine whether a satellite is NLOS at L, Gnome can use the derived surface
geometry. Unfortunately, the GPS signal only gives Gnome an estimate of the true location L, so Gnome
cannot know the exact path inflation. To address this challenge, Gnome searches within a neighborhood of the
GPS-provided location estimate L
est
to find a candidate for the ground truth L
c
whose positioning error after
path inflation adjustment is minimized.
Our third challenge is to enable Gnome to run entirely on a smartphone. Clearly, it is unrealistic to load
models of surface geometry for every point in areas with urban canyons. We observe that, while satellite
positions in the sky are time varying, the surface geometry at a given location L is relatively static. So, we
pre-compute the path inflation, on the cloud, of every point on the street or sidewalk from every possible
location in the sky. As we show later, this scales well in downtown areas of large cities in the world because in
those areas tall buildings limit the portion of the sky visible.
Gnome is implemented (Figure 2.3) as a library on smartphones (our current implementation runs on
Android) which, given a GPS estimate and satellite visibility information, outputs a corrected location estimate.
The library includes other optimizations that permit it to process GPS estimates within tens of milliseconds.
2.3.2 Data sources
Gnome uses three distinct sources of data. First, whenever Gnome needs a position fix, it uses the smartphone
GPS API to obtain the following pieces of information:
Latitude, longitude, and error: The latitude and longitude specify the estimated position, and the error
specifies the position uncertainty (the actual position is within a circle centered at the estimated position, and
with radius equal to the error). Satellite metadata: This information (often called NMEA data [106]) includes
each satellite’s azimuth and elevation, as well as the signal strength represented as the carrier-to-noise density,
denoted C=N
0
[204]. Figure 2.4 shows an example of satellite metadata obtained from a satellite during
our experiments. Each square represents a satellite with the number as satellite ID. The color of the square
indicates the satellite’s signal strength: green is very good (C=N
0
> 35), yellow is fair (25 < C=N
0
< 35), and
red is bad (C=N
0
< 25). The blue line denotes the skyline for a particular street in our data. Notice that satellite
number 6 which is NLOS with respect to the center still has good carrier-to-noise density, so this metric is
not a good discriminator for NLOS satellites. Propagation delay and pseudorange: This contains, for each
satellite, the estimated propagation delay and pseudorange for the received signal. This data is read from
phone’s GPS module and has become accessible since a recent release of Android. This information is crucial
to Gnome, as we shall describe later.
Second, Gnome uses street-level imagery data available through Google Street View. This cloud service,
13
0°
45°
90°
135°
180°
225°
270°
315°
60°
30°
2
6
12
13
15
17
19
24
Skyline
Figure 2.4: The skyline and satellites locations seen by a receiver. The receiver is at the center of the circle and the
squares represent satellites. The numbers represent satellite IDs and color signifies signal strength.
when provided with a location L, returns an panoramic image around L.
Third, but most important, Gnome uses an approximate 3D model available through a separate Street View
cloud service [96]. This service, given a location L returns a 3-D model of all buildings or other structures
around that point. This model is encoded (Figure 2.5 top) as a collection of planes, together with their depth
(distance from L). Intuitively, each plane represents one surface of a building. The depth information is at the
resolution of 0.7
in both azimuth and elevation, and has a maximum range of about 120m [127].
Effectively, the 3D model describes the surface geometry around L, but it has one important limitation.
The maximum height of a plane is 16m, a limitation arising from the range of the Street View scanning device.
This limitation is critical for Gnome, because many buildings in urban canyons are an order of magnitude or
more taller, and a good estimate of plane height is important for accurately determining satellite visibility.
Google Earth [20] also provides a 3D map with models of buildings. As of this writing, extracting these
3D models is labor-intensive [49] and does not scale to large cities. Moreover, its 3D models do not cover most
countries in Asia, Europe, and South America, whereas Street View coverage is available in these continents.
We leave it to future work to include 3D models extracted from Google Earth.
14
Figure 2.5: Upper: the original depth information (each colored plane represents one surface of a building), together
with the missing height information in yellow. Lower: the corresponding panoramic image
Elevation = 0 Intersection P(x,y) Image size (W x H) Azimuth φ = x / W × 360° Elevation θ = (H/2 - y) / H × 180° (b) Azimuth = 0 Viewpoint Lv θ φ (a) d = |Lv - Lplane| Viewpoint Lv Azimuth = 0 (c) D Lplane Lplane Figure 2.6: Building height estimation. (a) shows how the vector D is estimated from the surface geometry. (b) shows
how the panoramic image can be used for skyline detection andq estimation. (c) shows how building height is estimated
using D andq.
2.3.3 Estimating Building Height
Figure 2.5 shows how the 3D model’s height differs from the actual height of a building: the yellow line above
the planes is the actual height. To understand why obtaining height information for planes is crucial, consider
a GPS receiver that is 15m away from a 30m building. All satellites behind the building with an elevation less
than 63
will be blocked from the view of the receiver. However, with the planes we have from Street View, all
satellites above 45
will still be LOS. Thus, our model may wrongly estimate an NLOS satellite as LOS, and
may compensate for the path inflation where no such inflation exists.
Gnome leverages Street View’s panoramic images to solve the issue. The basic idea is to estimate the
building’s height by detecting skylines in the Street View images, and then extending each 3D plane to the
estimated height. After extracting the surface geometry for a given location L, Gnome selects all planes whose
15
L v1
v1
L v2
L v3
v2
v3
Figure 2.7: For robust height estimation, Gnome uses three viewpoints to estimate height, then averages these estimates.
reported height in the surface geometry is more than 13m: with high likelihood, such planes are likely to
correspond to buildings higher than 15m. Next, for each such tall plane, Gnome computes its latitude and
longitude (denoted by L
plane
) on the map. It does this using the location L and the relative location of the plane
from L (available as depth information in the 3D model).
Gnome then selects a viewpoint L
v
near the plane and computes the vector D from L
v
to L
plane
which is
the centroid of the plane. This vector contains two pieces of information: (a) it specifies the heading of the
plane relative to L
v
, and (b) it specifies the distance from L
v
to the plane. Gnome then downloads the Street
View image at L
v
and identifies the plane in the image using the heading information in H. Figure 2.6 shows
an example of this calculation. Figure 2.6(a) shows the satellite view at a given location L and the heading
vector for the plane (the blue box). The heading vector has azimuthf. Figure 2.6(b) shows how Gnome uses
f to find the plane’s horizontal location x in the corresponding panoramic image from Street View.
Now, Gnome runs a skyline detection algorithm on the image. Skyline detection demarcates the sky from
other structures in an image. Figure 2.7 shows the output of skyline detection on some images. Intuitively, the
part of the skyline that intersects with the plane delineates the actual height of the plane. Thus, in the three
images on the right of Figure 2.7, the intersection between the blue skyline and the red plane signifies the top
of the building. Unfortunately, this intersection is visible in a two-dimensional image, whereas we need to
augment the height of a plane in a 3D map.
We use simple geometry to solve this (Figure 2.6(b)). Recall that the vector D also encodes the distance d
from L
v
to the plane L
plane
. To estimate the height, we need to estimate the angle of elevation of the intersection
q. We estimate this using a property of Street View’s panorama images. Specifically, for a W H Street View
panorama image, a pixel at(x;y) corresponds to a ray with azimuth
x
W
360
and elevation
H=2y
H
180
.
So, to estimateq, we first find the pixel(s) in the panoramic image corresponding to the intersection between
16
Figure 2.8: The adjusted depth planes, augmented with the estimated height.
the skyline and the plane. q is then the elevation of that pixel, and we estimate the height of the building as
dtan(q). Figure 2.8 shows the corrected height of all the planes in the surface geometry of one viewpoint.
In practice, because the target plane may be occluded by trees or other obstructions, we run the above
procedure on three viewpoints (Figure 2.7) near L
plane
and estimate the height of the building using the average
of these three estimates.
At the end of this procedure, for a given location L, we will have an accurate surface geometry with better
height estimates than those available with Google Street View.
2.3.4 Estimating Path Inflation
To estimate the path inflation, Gnome uses ray tracing [113]. Consider a satellite S
i
and a reflection plane P
j
.
As Figure 2.9 shows, if the receiver is at position R, Gnome first computes its mirror point with respect to P
j
,
denoted by R
0
. Then, it initializes a ray to S
i
starting from R
0
and intersects with P
j
at point N
i; j
. For this ray to
represent a valid reflection of a signal from S
i
to R on plane P
j
, three properties must hold. First, N
i; j
must be
within the convex hull (or the boundary) of the plane P
j
. Second, the ray N
i; j
to S
i
must not intersect any other
plane. Finally, the ray from N
i; j
to R must not intersect any other plane.
For each ray that represents a valid reflection, Gnome employs geometry to calculate the path inflation
(shown by the solid red arrows in Figure 2.9). For a given plane, a given satellite and a given receiver position,
there can be at most one reflected ray. For a given receiver R, Gnome repeats this computation for every pair
of S
i
and P
j
, and obtains a path inflation in each case. It also computes whether S
i
is within line-of-sight of R:
this is true if the ray from S
i
to R does not intersect any other plane.
These calculations can result in two possibilities for S
i
, with respect to R. If the S
i
is within LOS of R,
it will likely be the case that there may be one or more planes which provide valid reflections of the signal
from S
i
to R. In this case, Gnome ignores these reflected signals, since they constitute multipath, and modern
GPS receivers have multipath rejection capabilities. Thus, when there is a LOS path, path inflation is always
assumed to be zero.
17
NLOS
Signal
LOS
Signal
Extra
path
Reflection
Plane Pj
Receiver R Mirrored R’
Satellite Si
Intersection Nij
Figure 2.9: An example of ray-tracing and path inflation calculation.
The second possibility is that S
i
is not within line-of-sight of R. In this case, R can, in theory, receive
multiple NLOS reflections from different planes. In practice, however, because of the geometry of buildings
and streets, a receiver R will often receive only one reflected signal. This is because building surfaces are either
parallel or perpendicular to the street. If a street runs north-south, and a satellite is on the western sky, then, if
the receiver is on the street, it will receive a reflection only from a plane on the east side of the street. However,
in some cases, it is possible to receive more than one signal. In our example, if there is a gap between two tall
buildings on the west side of the street, then it is possible for that signal to be reflected by a plane on one of
those buildings perpendicular to the street (in addition to the reflection from the east side). In cases like these,
Gnome computes path inflation using the plane nearer to R.
A signal from S
i
can, of course, be reflected off multiple planes before reaching R. In our example, the
signal may first be reflected from a building on the east side, and then again from a building on the west side of
the street. If a signal can reach the receiver after a single reflection and after a double reflection, the calculated
pseudorange will be very close to the single-reflection trace because of multipath mitigation. If the signal
can only reach the receiver after two reflections, the signal strength will be low (< 20dB) and will always
be ignored in position computation [144]. For this reason, Gnome only models single reflections to reduce
computational complexity.
2.3.5 Location Prediction
In this section, we describe how Gnome uses the path inflation estimates to improve GPS positioning accuracy.
Recall that the GPS receiver computes a position estimate from satellite metadata including pseudoranges for
satellites. At a high level, Gnome subtracts the path inflation from the pseudorange for each satellite, then
computes the new GPS position estimate.
However, the reported satellite pseudoranges correspond to the ground truth position of the receiver, which
is not known! More precisely, let the estimated position be L
e
and the ground truth be L
g
. The path inflation
18
of satellite S
i
for these two points can be different. If we use the path inflation from L
e
, but apply it to
pseudoranges calculated at L
g
, we will obtain incorrect position estimates.
2.3.5.1 Searching Ground Truth Candidates
To overcome this, Gnome selects several candidate positions within the vicinity of L
e
(we describe below
how these candidate positions are chosen) and effectively tests whether a candidate location could be a viable
candidate for L
g
, the ground truth position. At each candidate L
c
, it (a) reduces the pseudorange of each
NLOS satellite by the computed path elevation and (b) recomputes the GPS position estimate with the revised
pseudoranges. This gives a new position estimate L
0
c
. Gnome chooses that L
0
c
(as the estimate for L
g
) whose
distance to its corresponding L
c
is least. Figure 2.11 shows the heatmap of the relative distances between L
c
and L
0
c
for candidates in a downtown area: notice how candidates closest to the ground truth have the low
relative distances. In practice, for a reason described below, Gnome actually uses a voting strategy: it picks the
five candidate positions with the lowest relative distance, clusters them, and uses the centroid of the cluster as
the estimated position.
The intuition for this approach is as follows. When a candidate L
c
is close to (within a few meters of) the
ground truth, its reflections are most likely to be correlated with the ground truth location L
g
. In this case, the
candidate’s estimated position L
0
c
will likely converge to the true ground truth position. Because L
c
is close to
L
g
, the distance between L
c
and L
0
c
will be small (e.g., candidate 1 in Figure 2.10). However, when L
c
is far
away from L
g
, Gnome corrects the pseudoranges observed at L
g
with corrections appropriate for reflections
observed at L
c
. In this case, Gnome is likely to go astray since a LOS signal at L
g
may actually be an NLOS
signal at the candidate position, or vice versa. In these cases, Gnome is likely to apply random corrections
at the candidate positions (e.g., candidates 2 and 3 in Figure 2.10). Because these corrections are random,
the relative distances for distant candidate positions are unpredictable: they might range from small to large
values. To filter randomly obtained small relative distances, Gnome uses the voting strategy described above.
For example, in Figure 2.11, the grid point at the bottom right of the heatmap shows up among the top three
candidate positions with the lowest distance, for exactly this reason.
2.3.5.2 Position Tracking
As described until now, each predicted location is independent from the previous. However, most smartphone
positioning applications require continuously tracking a user’s location. So, many navigation services, and,
more generally, most localization algorithms for robots, drones or other mobile targets, smooth successive
position estimates by using a Kalman filter. Gnome uses a similar technique to improve accuracy in tracking
19
Ground
Truth
Candidate 1
Candidate 2
Candidate 3
Origin Output
Adjusted
Output 1
Adjusted
Output 2
Adjusted
Output 3
Figure 2.10: After adjusting pseudoranges, candidate positions nearer the ground truth will have estimates that
converge to the ground truth, while other candidate positions will have random corrections.
Ground truth location Gnome Output Figure 2.11: Heatmap of the relative distance between candidate positions and the revised candidate positions.
Candidates close to ground truth have small relative distances.
the smartphone user. Specifically, a Kalman filter treats the input state as an inaccurate observation and
produces a statistically optimal estimate of the real system state. When Gnome computes a new location, its
Kalman filter takes that location as input and outputs a revised estimate of the actual position. We refer the
reader to [103] for details on Kalman filtering.
2.3.6 Scaling Gnome
As described, algorithms in Gnome to estimate building height and path inflation, as well as to predict location
can be both compute and data intensive. 3D models can run into several tens of gigabytes, and ray-tracing is
a computationally expensive operation. In this section, we describe how we architect Gnome to enable it to
run on a smartphone. We use two techniques for this: pre-processing in the cloud, and scoped refinement for
candidate search.
20
2.3.6.1 Preprocessing in the cloud
Gnome preprocesses surface geometries in the cloud to produce path inflation maps. Because surface
geometries are relatively static, these path-inflation maps can be computed once and reused by all users.
The input to this preprocessing step is a geographic area containing urban canyons (e.g., the downtown
area of a large city). Given this input, Gnome first builds the 3D model of the entire area. To do this, it first
downloads street positions and widths from OpenStreetMaps [108], and then retrieves Street View images
and 3D models every 5m or so along every street in the area, similar to the technique used in [196]. At each
retrieval location, it augments its 3D model with the adjusted building height. At the end of this process,
Gnome has a database of 3D models for each retrieval point.
In the second step of pre-processing, Gnome covers every street and sidewalk with a fine grid of candidate
positions. These candidate positions are used in the location prediction algorithm. Our implementation uses a
grid size of 2m 2m. For each candidate position, Gnome pre-computes the path inflation for every possible
satellite position. Specifically, it does this by using the pre-computed 3D models in the previous step and does
ray tracing for every point (at a resolution of 1
in azimuth and elevation) in a hemisphere centered at the
candidate position to determine the path inflation.
The output of these two steps is a path inflation map: for each candidate position, this map contains the
path inflation from every possible satellite position. This map is pre-loaded onto the smartphone. The map
captures the static reflective environment around the candidate position. This greatly simplifies the processing
on the smartphone: when Gnome needs to adjust the pseudorange for satellite S
i
at candidate position L
c
, it
determines S
i
’s location from the satellite metadata sent as part of the GPS, and uses that to determine the path
inflation from L
c
’s path inflation map.
2.3.6.2 Scoped refinement for candidate search
Even with path inflation maps, Gnome can require significant overhead on a smartphone because it has to
search a potentially large number of candidate positions in its location prediction phase. Gnome first scopes
the candidate positions to be within the error range reported by the GPS device — recall that GPS receivers
report a position estimate and an error radius. In urban canyons, however, the radius can be large and include
hundreds or thousands of candidate positions. To further optimize search efficiency, we use a coarse-to-fine
refinement strategy. We first consider candidate positions at a coarser granularity (e.g., one candidate position
in every 8m 8m grid) and select the best candidate. We then repeat the search on the finer spatial scale
around the best candidate selected in the previous step. This reduces computational overhead by 20.
21
2.4 Evaluation
Using a full-fledged implementation of Gnome, we evaluate its accuracy improvement in 4 major cities in
North America, Europe and Asia. We also quantify the impact of individual design choices, and the overhead
incurred in estimating building height, path inflation, and in location prediction.
2.4.1 Methodology
2.4.1.1 Implementation
Our implementation of Gnome has two components, one on Android and the other on the cloud and requires
2100 lines of code in total. The cloud-side component performs data retrieval (for both depth information and
Street View images), height adjustment, and path inflation computations. The smartphone component pre-loads
path inflation maps and performs pseudorange adjustments at each candidate position, recomputes the revised
location estimate using the Ublox API, and performs voting to obtain the estimated location. Currently, Gnome
directly outputs its readings to a file. Without being rooted, Android does not permit Gnome to be run in the
background by another app. Apps would have to incorporate its source code in order to use Gnome.
2.4.1.2 Metrics
We measured several aspects of Gnome including: positioning accuracy measured as the average distance
between estimated position and ground truth; processing latency of various components, both on the cloud and
mobile; power consumption on the smartphone; and storage usage on the smartphone for the path inflation
maps.
2.4.1.3 Scenarios and Ground Truth collection
To evaluate accuracy, we take measurements using Gnome in the downtown areas of four major cities in three
continents: Los Angeles, New York, Frankfurt, and Hong Kong. In most of these cities, we have measurements
from a smartphone carried by a pedestrian. These devices include Huawei Mate10, Google Pixel, Samsung S8,
and Samsung Note 8. In Los Angeles, we also have measurements from a smartphone on a vehicle, and from
a smartphone on a stationary user. In the same city, we have measurements both on an Android device and
an iPhone. Across our four cities, the total walking distance is 4.7km and the total driving distance is 9.3km.
Figure 2.12 shows the pedestrian traces collected in the four cities.
For the stationary user, we placed the phone at ten fixed known locations for 1 minute in each and recorded
the locations output by Gnome app. For the pedestrian experiment, tester A walks while holding the phone
22
Los Angeles New York Frankfurt Hong Kong
Figure 2.12: The ground truth traces (yellow), Android output (blue), and Gnome output (red) in the four cities.
Los Angeles New York Frankfurt Hong Kong Total
0
5
10
15
20
25
30
10.1
17.2
13.0
15.7
14.4
22.8
30.5
26.8
29.0
27.5
% of observation points w/ satellites >= 4
% of LOS satellites
Figure 2.13: Statistics of NLOS signals in the dataset
running the Gnome app. Meanwhile, another tester B follows A and takes a video of A. We use the video to
manually determine the ground truth position of A: we pinpoint this location manually on a map to determine
the ground truth. For the driving experiment, we place the phone in car’s cup holder and drive in the target
area. To collect the ground truth location of the car, we use a somewhat unusual technique: we attach a stereo
camera [119] to the car and use the 3D car localization algorithm described in [250]. The algorithm can
estimate ground truth positions with sub-meter accuracy, sufficient for our purposes.
2.4.2 Results
Before we describe our results, we describe some statistics about satellite visibility and path inflation in our
dataset. These results quantitatively motivate the need for Gnome, and also give some context for our results.
In our data, only 14.4% of the observation points can see more than four LOS satellites, and only 27.5%
of all the received satellites are LOS satellites (Figure 2.13). Thus, urban canyons in large cities contain
significant dead spots for GPS signal reception. This also motivates our design decision to not omit NLOS
satellites as other work has proposed: since four satellites are required for position fixes, omitting an NLOS
satellite would render unusable nearly 85% of our readings.
Figure 2.14 shows the CDF of path inflation across the 4 cities in our dataset. Depending on the city,
23
Figure 2.14: Extra NLOS signal path distribution.
between 10% and 15% of the observations incur a path inflation of more than 50m. For two of the cities, 7-8%
incur a path inflation of over 100m because the signal is reflected from a building plane that is far from the
receiver and therefore leads to a large increase in pseudorange. This suggests that correcting these inflated
paths can improve accuracy significantly, as we describe next.
2.4.2.1 Accuracy
Figure 2.15 compares the average positioning error of Gnome, Android, and iPhone in four settings (stationary,
walking, cycling, driving) for Los Angeles. We use the track recording app for Android [22] and iPhone [36]
as location recorders. In the stationary setting, Gnome incurs an average error of less than 5m while Android
and iOS incur more than twice that error. This setting directly quantifies the benefits of compensating for
path inflation. More important, this shows how much of an improvement is still left on the table after all the
optimizations that smartphones incorporate, include cell tower and Wi-Fi positioning, and map matching [104].
The gains in the pedestrian setting are relatively high: Gnome incurs only a 6.4m error, while the iPhone’s
error is the highest at 14.4m. In this setting, in addition to compensating for path inflation, Gnome also benefits
from trajectory smoothing using Kalman filters.
Gnome performance while driving in Los Angeles is comparable to that of Android and iOS. We conjecture
that this is because today’s smartphones use dead-reckoning based on inertial sensors [99], and this appears to
largely mask inaccuracy due to NLOS satellites. These benefits aren’t evident in our walking experiments,
where the phone movement likely makes it difficult to dead-reckon using accelerometers and gyroscopes.
Across our other cities also (Figure 2.16), Gnome consistently shows better performance than Android
localization. In New York, Gnome obtains a 30% reduction in error, in Frankfurt, a more than 38% error
reduction, and in Hong Kong, a more than 40% reduction. Equally important, this shows that the methodology
24
Static Walking Driving
0.0
2.5
5.0
7.5
10.0
12.5
15.0
Avg Error (m)
4.9
6.4
6.2
10.2
13.1
7.0
9.5
14.4
6.3
Gnome
Android
iOS
Figure 2.15: Average positioning accuracy in different scenarios.
Figure 2.16: Average accuracy in different cities.
of Gnome is generalizable. In obtaining these results, we did not have to modify the Gnome processing
pipeline in any way. As described before, the Street View 3D models are available for many major cities across
the world, which is the most crucial source of data for Gnome, so Gnome is applicable across a large part of
the globe.
The maximum localization errors for Gnome for the four cities are 37m, 35m, 41m, and 29m respectively.
In comparison, the maximum errors for Android are 51m, 42m, 45m, and 47m. These occur either because (a)
the ground truth location is not within the error radius reported by the GPS receiver, so Gnome cannot generate
candidate points near the ground truth location, or (b) because the error radius is too large and includes some
non-ground-truth candidates where the distance from the candidate point to the pseudorange-adjusted position
is short. The latter confuses Gnome’s voting mechanism of candidate selection. In future work, we plan to
explore mitigations for these corner cases.
25
2.4.2.2 Processing latency
Gnome performs some computationally expensive vision and graphics algorithms. Fortunately, as we now
show below, the latency of these computations is not on the critical path and much of the expensive computation
happens on the cloud. (In this chapter, we have used the term “cloud” as a proxy for server-class computing
resources. In fact, our experiments were carried out on a 12-core 2.4Ghz Xeon desktop with 32GB of memory
and running Ubuntu 16.04).
Module Latency
Mobile: Client-side positioning 77 ms/estimate
Cloud: Street View data retrieval 1.9 s/viewpoint
Cloud: Height adjustment 2.1 s/viewpoint
Cloud: Path inflation calculation 28 s/candidate
Table 2.1: Processing latency measurement
Table 2.1 summarizes the processing latency of several components of Gnome. The critical path of position
estimation (which involves scoped refinement based candidate search) on the smartphone incurs only 77ms.
Thus, Gnome can support up to a 13Hz GPS sampling rate, which is faster than default location update rates
(10 Hz) on both Android and iOS. Retrieval of depth data (a few KBs) and a panoramic image (about 450KB)
for each viewpoint (recall that Gnome samples these at the granularity of 5m) takes a little under 2s, while
adjusting the height of planes at each viewpoint takes an additional 2s. By far the most expensive operation
(29s) is computing the path inflation for each candidate position: this requires ray tracing for all points in
a hemisphere around the candidate position. This also explains why we only compute single reflections:
considering multiple reflections would significantly increase the computational cost. The candidate positions
are arranged in a 2m 2m grid, and it takes about 39 minutes to compute the path inflation maps for a 1-km
street, or about 17 hours for the downtown area of Los Angeles which has 26.8km of roads. It is important to
remember that Gnome’s path inflation maps are compute-once-use-often: a path inflation map for an urban
canyon in a major city need only be computed once. We choose the 22 candidate scale because it achieves
the best balance between the accuracy and runtime latency.
2.4.2.3 Power consumption
In Android 7.0, the battery option in “Settings” provides detailed per-hour power reports for the top-5 highest
power usage apps. Gnome is implemented as an app, so we obtain its power consumption by running it for
three hours. While doing so, we run the app in the foreground with screen brightness set to the lowest. We
compare Gnome’s power consumption with that of the default Android location API (implemented as a simple
app), and our results are averaged over multiple runs. During our experiment, Gnome app consumes 151 mAh
26
and is 31% higher than the default Android location API, which consumes 112 mAh. Most of the additional
energy usage is attributable to the UBlox library that computes adjusted locations. All the four phones we
tested have batteries larger than 2700mAh, so Gnome would deplete the battery by about 5.5% every hour if
used continuously. Our implementation is relatively unoptimized (Gnome is implemented in Python), and we
plan to improve Gnome’s energy efficiency in future work.
2.4.2.4 Storage usage
We have generated path inflation maps for the downtown area of Los Angeles, a 3.9km
2
area. The total length
of road is 26.8km and the average road width (road width is used for sampling candidate positions) is 21.3m.
In this area, there are 2531 Street View viewpoints and Gnome generates 16390 candidate positions at a 2m
2m granularity. The total size of the path inflation maps for Los Angeles is 340 MB in compressed binary
format. While this is significant, smartphone storage has been increasing in recent years, and we envision most
users loading path inflation maps only for downtown areas of the city they live in.
2.4.3 Evaluating Gnome components
Each component of Gnome is crucial to its accuracy. We evaluate several components for Los Angeles.
2.4.3.1 Height adjustment
In Los Angeles, there are nearly 15,000 planes of which about 30% need height adjustment. To understand
how the height adjustment affects the final localization accuracy, we compute the accuracy when Gnome
selects different random subsets of planes for which to perform height adjustment. In Figure 2.17, the x-axis
represents the fraction of planes for which height adjustment is performed. In our experiment, for each data
point, we repeated the random selection five times, and the figure shows the maximum and minimum values
for each data point. Height adjustment is responsible for up to a 3m reduction in error.
Finally, our height estimates themselves can be erroneous. We compared our estimated height with the
actual height for 50 randomly selected buildings in Los Angeles, whose heights are publicly available. Our
height estimates are correct to within 5% at the median and within 14% at the 90th percentile. Our position
estimation accuracy is largely due to the fact that we are able to estimate heights of buildings quite accurately.
There are two causes for the height estimation error: (1) the inaccurate sky detection could recognize the
building’s top as part of the sky, which makes the estimated height lower than the actual value, and (2) massive
obstacles like trees and taller buildings behind the target plane will cause larger height measurement. The
largest errors in both cases are 16% and 23%.
27
Figure 2.17: Accuracy with different level of height adjustment.
0 50 100 150 200 250 300
Avg Latency (ms)
6.0
6.5
7.0
7.5
8.0
8.5
9.0
9.5
Avg Error (m)
(1x1)
(2x2)
(3x3)
(4x4)
(6x6)
(8x8)
Figure 2.18: Latency vs Accuracy with different grid sizes.
28
0 10 20 30 40
Error (m)
0.0
0.2
0.4
0.6
0.8
1.0
CDF
Gnome
Path Similarity
Shadow Matching
Android
Figure 2.19: Accuracy of different candidate ranking approaches.
2.4.3.2 Candidate selection and ranking
Candidate position granularity can also impact the error. Gnome uses a 2m 2m grid. Using a coarser 8m
8m grid would reduce storage requirements by a factor of 16 and could reduce the processing latency on
the smartphone. Figure 2.18 captures the tradeoff between candidate granularity, accuracy and processing
latency. As the figure shows, candidate selection granularity can significantly impact accuracy: an 8m 8m
grid would add almost 3m error to Gnome while reducing processing latency by about 50ms. Our choice of
granularity is at the point of diminishing returns: a finer grid of 1m 1m would more than triple processing
latency while reducing the error by about 10cm.
Position accuracy is also a function of the candidate search strategy. Prior work has considered two
different approaches. The first [265, 232, 194] is based on path similarity. This line of work uses ray-tracing to
simulate the signal path and calculate the difference between the simulated one and the actual travel distance
calculated by GPS module. The candidate whose path difference is least is selected as the output. The second
approach is called shadow matching [180, 298, 120]. It uses satellite visibility as ranking indicator and assumes
that NLOS signal always have worse carrier-to-noise density C=N
0
than LOS signal. It uses C=N
0
to classify
each satellite’s visibility at the ground truth point and compares that with simulated (from 3D models) satellite
visibility at each candidate point. The point with the highest similarity is the estimated position. Figure 2.19
shows the error distribution in Los Angeles: we include the Android error distribution for calibration. While
all approaches improve upon Android, Gnome is distributionally better than the other two approaches. The
crucial difference between Gnome and path similarity is that Gnome revises the candidate positions using
the path inflation maps, and this appears to give it some advantage. Gnome is better than shadow matching
because carrier-to-noise-density is not a good predictor of NLOS signals. In addition to better localization,
29
Figure 2.20: Effectiveness of Kalman filter.
Gnome is more practical than these prior approaches in three ways. First, these approaches use proprietary 3D
models for ray-tracing, which may not be widely available for many cities. In Gnome, we use Google Street
View which is available for most cities (as shown in our evaluation). Second, these approaches work offline
and do not explore efficient online implementations. For instance, path simulation in [265, 232, 194] can take
up to 2 sec on a desktop, rendering them impractical for mobile devices. Finally, these approaches rely on an
external GPS receiver (UBlox) while Gnome is implemented on Android phones.
2.4.3.3 Kalman filter
We also disabled the Kalman filter in Gnome to evaluate its contribution to the final accuracy. As Figure 2.20
shows, Kalman filter improves accuracy by 1.2m for the stationary case and about 0.5m in the other two
scenarios. The image on the right shows how the Kalman filter results in a smoother trace closer to the ground
truth.
30
Chapter 3
ALPS: Accurate Landmark Positioning
at City Scales
3.1 Introduction
Context awareness is essential for ubiquitous computing, and prior work [307, 311] has studied automated
methods to detect objects in the environment or determine their precise position. One type of object that has
received relatively limited attention is the common landmark, an easily recognizable outdoor object which
can provide contextual cues. Examples of common landmarks include retail storefronts, signposts (stop signs,
speed limits), and other structures (hydrants, street lights, light poles). These can help improve targeted
advertising, vehicular safety, and the efficiency of city governments.
In this chapter, we explore the following problem: How can we automatically collect an accurate database
of the precise positions of common landmarks, at the scale of a large city or metropolitan area? The context
aware applications described above require an accurate database that also has high coverage: imprecise
locations, or spotty coverage, can diminish the utility of such applications.
In this chapter, we discuss the design of a system called ALPS (Accurate Landmark Positioning at city
Scales), which, given a set of landmark types of interest (e.g., Subway restaurant, stop sign, hydrant), and a
geographic region, can enumerate and find the precise position of all instances of each landmark type within
the geographic region. ALPS uses a novel combination of two key ideas. First, it uses image analysis to
find the position of a landmark, given a small number of images of the landmark from different perspectives.
Second, it leverages recent efforts, like Google Street View [21], that augment maps with visual documentation
of street-side views, to obtain images of such landmarks. At a high-level, ALPS scours Google Street View
31
for images, applies a state-of-the-art off-the-shelf object detector [252] to detect landmarks in images, then
triangulates the position of the landmarks using a standard least-squares formulation. On top of this approach,
ALPS adds novel techniques that help the system scale and improve its accuracy and coverage.
Contributions Our first contribution is techniques for scaling landmark positioning to large cities. Even a
moderately sized city can have several million Street View images. If ALPS were to retrieving all images, it
would incur two costs, both of which are scaling bottlenecks in ALPS: (1) the latency, network and server
load cost of retrieving the images, and (2) the computational latency of applying object detection to the entire
collection. ALPS optimizes these costs, without sacrificing coverage, using two key ideas. First, we observe
that Street View has a finite resolution of a few meters, so it suffices to sample the geographic region at this
resolution. Second, at each sampling point, we retrieve a small set of images, lazily retrieving additional
images for positioning only when a landmark has been detected in the retrieved set. In addition, the ALPS
system can take location hints to improve scalability: these hints specify where landmarks are likely to be
found (e.g., at street corners), which helps narrow down the search space.
Our second contribution is techniques that improve accuracy and coverage. Object detectors can have
false positives and false negatives. For an object detector, a false positive means that the detector detected a
landmark in an image that doesn’t actually have the landmark. A false negative is when the detector didn’t
detect the landmark in the image that actually does contain the landmark. ALPS can reduce false negatives by
using multiple perspectives: if a landmark is not detected at a sampling point either because it is occluded
or because of poor lighting conditions, ALPS tries to detect it in images retrieved at neighboring sampling
points. To avoid false positives, when ALPS detects a landmark in an image, it retrieves zoomed in versions
of that image and runs the object detector on them, using majority voting to increase detection confidence.
Once ALPS has detected landmarks in images, it must resolve aliases (multiple images containing the same
landmark). Resolving aliases is especially difficult for densely deployed landmarks like fire hydrants, since
images from geographically nearby sampling points might contain different instances of hydrants. ALPS
clusters images by position, then uses the relative bearing to the landmark to refine these clusters. Finally,
ALPS uses least squares regression to estimate the position of the landmark; this enables it to be robust to
position and orientation errors, as well as errors in the position of the landmark within the image as estimated
by the object detector.
Our final contribution is an exploration of ALPS’ performance at the scale of a zip-code, and across several
major cities. ALPS can cover over 92% of Subway restaurants in several large cities and over 95% of hydrants
in one zip-code, and localize 93% of Subways and 87% of hydrants with an error less than 10 meters. Its
localization accuracy is better than Google Places [75] for over 85% of the Subways in large cities. ALPS’s
32
scaling optimizations can reduce the number of retrieved images by over a factor of 20, while sacrificing
coverage only by 1-2%. Its accuracy improvements are significant: for example, removing the bearing-based
refinement (discussed above) can reduce coverage by half.
3.2 Motivation and Challenges
Positioning Common Landmarks Context awareness [242, 297] is essential for ubiquitous computing
since it can enable computing devices to reason about the built and natural environment surrounding a human,
and provide appropriate services and capabilities. Much research has focused on automatically identifying
various aspects of context [307, 311, 209], such as places and locations where a human is or has been, the
objects or people within the vicinity of the human and so forth.
One form of context that can be useful for several kinds of outdoor ubiquitous computing applications is
the landmark, an easily recognizable feature or object in the built environment. In colloquial usage, a landmark
refers to a famous building or structure which is easily identifiable and can be used to give directions. In this
chapter, we focus on common landmarks, which are objects that frequently occur in the environment, yet
can provide contextual cues for ubiquitous computing applications. Examples of common landmarks include
storefronts (e.g., fast food stores, convenience stores), signposts such as speed limits and stop signs, traffic
lights, fire hydrants, and so forth.
Potential Applications Knowing the type of a common landmark (henceforth, landmark) and its precise
position (GPS coordinates), and augmenting maps with this information, can enable several applications.
Autonomous cars [294] and drones [161] both rely on visual imagery. Using cameras, they can detect
command landmarks in their vicinity, and use the positions of those landmarks to improve estimates of their
own position. Drones can also use the positions of common landmarks, like storefronts, for precise delivery.
Signposts can provide context for vehicular control or driver alerts. For example, using a vehicle’s position
and a database of the position of speed limit signs [233], a car’s control system can either automatically
regulate vehicle speed to within the speed limit, or warn drivers when they exceed the speed limit. Similarly, a
vehicular control systems can use a stop sign position database to slow down a vehicle approaching a stop
sign, or to warn drivers in danger of missing the stop sign.
A database of automatically generated landmark positions can be an important component of a smart city
[138]. Firefighters can respond faster to fires using a database of positions of fire hydrants [68]. Cities can
maintain inventories of their assets (street lights, hydrants, trees [300], and signs [134] are example of city
assets) [65, 66]; today, these inventories are generated and maintained manually. Finally, drivers can use a
33
database of parking meter positions, or parking sign positions to augment the search for parking spaces [228],
especially in places where in-situ parking place occupancy sensors have not been installed [64].
Landmark locations can also improve context-aware customer behavior analysis [214]. Landmark locations
can augment place determination techniques [166, 136]. Indeed, a database of locations of retail storefronts
can directly associate place names with locations. Furthermore, landmark locations, together with camera
images taken by a user, can be used to more accurately localize the user itself than is possible with existing
location services. This can be used in several ways. For example, merchants can use more precise position
tracks of users to understand the customer shopping behavior. They can also use this positioning to target users
more accurately with advertisements or coupons, enriching the shopping experience.
Finally, landmark location databases can help provide navigation and context for visually impaired persons
[73, 185]. This pre-computed database can be used by smart devices (e.g. Google Glass) to narrate descriptions
of surroundings (e.g., “You are facing a post office and your destination is on its right, and there is a barbershop
on its left.”) to visually impaired users.
Challenges and Alternative Approaches An accurate database which has high coverage of common
landmark locations can enable these applications. High coverage is important because, for example, a missing
stop sign can result in a missed warning. Moreover, if the database is complete for one part of a city, but
non-existent for another, then it is not useful because applications cannot rely on this information being
available.
To our knowledge, no such comprehensive public database exists today, and existing techniques for
compiling the database can be inaccurate or have low coverage. Online maps (e.g., Google Places [75] or Bing
Maps [71]) contain approximate locations of some retail storefronts (discussed below). Each city, individually,
is likely to have reasonably accurate databases of stores within the city, or city assets. In some cases, this
information is public. For example, the city of Los Angeles has a list of fire hydrant locations [67], but not
many other cities make such information available. Collecting this information from cities can be logistically
difficult. For some common landmarks, like franchise storefronts, their franchiser makes available a list of
franchisee addresses: for example, the list of Subway restaurants in a city can be obtained from ‘subway.com‘.
From this list, we can potentially derive locations through reverse geo-coding, but this approach doesn’t
generalize to the other landmarks discussed above. Prior work has explored two other approaches to collecting
this database: crowdsourcing [78, 83], and image analysis [74]. The former approach relies on users to either
explicitly (by uploading stop signs to OpenStreetMaps) or implicitly (by checking in on a social network) tag
landmarks, but can be inaccurate due to user error, or have low coverage because not all common landmarks
may be visited. Image analysis, using geo-tagged images from photo sharing sites, can also result in an
34
incomplete database.
In this chapter, we ask the following question: is it possible to design a system to automatically compile, at
the scale of a large metropolis, an accurate and high coverage database of landmark positions? Such a system
should, in addition to being accurate and having high coverage, be extensible to different types of landmarks,
and scalable in its use of computing resources. In the rest of the chapter, we describe the design of a system
called ALPS that satisfies these properties.
3.3 The Design of ALPS
3.3.1 Approach and Overview
The input to ALPS is a landmark type (a chain restaurant, a stop sign, etc.) and a geographical region expressed
either using a zip code or a city name. The output of ALPS is a list of GPS coordinates (or positions) at which
the specified type of landmark may be found in the specified region. Users of ALPS can specify other optional
inputs, discussed later.
ALPS localizes landmarks by analyzing images using the following idea. To localize a fire hydrant, for
example, suppose we are given three images of the same fire hydrant, taken from three different perspectives,
and the position and orientation of the camera when each image was taken is also known. Then, if we can
detect the hydrant in each image using an object detector, then we can establish the bearing of the hydrant
relative to each image. From the three bearings, we can triangulate the location of the hydrant. ALPS uses
more complex variants of this idea to achieve accuracy, as discussed below.
To obtain such images, ALPS piggybacks on map-based visual documentation of city streets [76, 72].
Specifically, ALPS uses the imagery captured by Google’s Street View. The vehicles that capture Street
View images have positioning and bearing sensors [80], and the Street View API permits a user to request an
image taken at a given position and with a specified bearing. ALPS’s coverage is dictated in part by Street
View’s coverage, and its completeness is a combination of its coverage, and the efficacy of its detection and
localization algorithms.
Street View (and similar efforts) have large databases, and downloading and processing all images in
a specified geographic region can take time, computing power, and network bandwidth. To scale to large
geographic regions (e.g., an entire zipcode or larger), ALPS employs novel techniques that (a) retrieve just
sufficient images to ensure high coverage, (b) robustly detect the likely presence of the specified landmark,
then (c) drill down and retrieve additional images in the vicinity to localize the landmarks.
Finally, users can easily extend ALPS to new landmark types, and specify additional scaling hints.
35
landmark
type seed
locations seed
locations Adaptive
Image
Retrieval Landmark
Detection Landmark
Positioning landmark
positions in
the region Landmark Localization Landmark
Detection Seed Location Generation Base
Image
Retrieval Image
Clustering geographic
region Figure 3.1: ALPS Components.
ALPS comprises two high-level capabilities (Figure 3.1): Seed location generation takes a landmark
type specified by user as input, and generates a list of seed locations where the landmarks might be located;
and Landmark localization takes seed locations as input and generates landmark positions in the specified
geographic region as output.
In turn, seed location generation requires three conceptual capabilities: (1) base image retrieval which
downloads a subset of all Street View images; (2) landmark detection that uses the state-of-the-art computer
vision object detection [252] to detect and localize landmarks retrieved by base image retrieval, and applies
additional filters to improve the accuracy of detection; (3) image clustering groups detected images that likely
contain the same instance of the landmark. The result of these three steps is a small set of seed locations where
the landmark is likely to be positioned, derived with minimal resources without compromising coverage.
Landmark localization reuses the landmark detection capability, but requires two additional capabilities:
(1) adaptive image retrieval, which drills down at each seed location to retrieve as many images as necessary for
localizing the object; (2) and a landmark positioning capability that uses least squares regression to triangulate
the landmark position.
3.3.2 Base Image Retrieval
ALPS retrieves images from Street View, but does not retrieve all Street View images within the input
geographic region. This brute-force retrieval does not scale, since even a small city like Mountain View can
have more than 10 million images. Moreover, this approach is wasteful, since Street View’s resolution is finite:
in Figure 3.2(a), a Street View query for an image anywhere within the dotted circle will return the image
taken from one of the points within that circle.
ALPS scales better by retrieving as small a set of images as possible, without compromising coverage
(Figure 3.2(b)). It only retrieves two Street View images in two opposing directions perpendicular to the street,
at intervals of 2r meters, where 2r is Street View’s resolution (from experiments, r is around 4 meters). By
36
(a) (b) r r r Figure 3.2: Base Image Retrieval.
using nominal lane [81] and sidewalk [82] widths, Street View’s default angle of view of 60
, it is easy to
show using geometric calculations that successive 8 meter samples of Street View images have overlapping
views, thereby ensuring visual coverage of the entire geographic region.
3.3.3 Landmark Detection
Given an image, this capability detects and localizes the landmark within the image. This is useful both for
seed location generation, as well as for landmark localization, discussed earlier.
Recent advances [210, 177] in deep learning techniques have enabled fast and accurate object detection
and localization. We use a state-of-the-art object detector, called YOLO [252]. YOLO uses a neural network
to determine whether an object is present in an image, and also draws a bounding box around the part of the
image where it believes the object to be (i.e., localizes the object in the image). YOLO needs to be trained,
with a large number of training samples, to detect objects. We have trained YOLO to detect logos of several
chain restaurants or national banks, as well as stop signs and fire hydrants. Users wishing to extend ALPS
functionality to other landmark types can simply provide a neural network trained for that landmark.
Even the best object detection algorithms can have false positives and negatives [70]. False positives
occur when the detector mistakes other objects for the target landmark due to lighting conditions, or other
reasons. False negatives can decrease the coverage and false positives can reduce positioning accuracy. In our
experience, false negatives arise because YOLO cannot detect objects smaller than 50 50 pixels or objects
that are blurred, partially obscured or in shadow, or visually indistinguishable from the background.
ALPS reduces false positives by using Street View’s support for retrieving images at different zoom levels.
Recall that base image retrieval downloads two images at each sampling point. ALPS applies the landmark
detector to each image: if the landmark is detected, ALPS retrieves six different versions of the corresponding
Street View image each at different zoom levels. It determines the tilt and bearing for each of these zoomed
37
Figure 3.3: How zooming-in can help eliminate false positives.
images based on the detected landmark. ALPS then uses two criteria to mark the detection as a true positive:
that YOLO should detect a landmark in a majority of the zoom levels, and that the size of the bounding
box generated by YOLO is proportional to the zoom level. For example, in Figure 3.3, YOLO incorrectly
detected a residence number, when detecting a Subway logo, in the first three zoom levels (the first zoom level
corresponds to the base image). After zooming in further, YOLO was unable to detect the Subway logo in the
last 3 zoomed-in images. In this case, ALPS declares that the image does not contain a Subway logo, because
the majority vote failed. We address false negatives in later steps.
3.3.4 Image Clustering
To generate seed locations, ALPS performs landmark detection on each image obtained by base image retrieval.
However, two different images might contain the same landmark: the clustering step uses image position
and orientation to cluster such images together. In some cases, this clustering can reduce the number of seed
locations dramatically: in Figure 3.4(a), 87 landmarks are detected in the geographic region shown, but a much
smaller fraction of them represent unique landmarks (Figure 3.4(b)).
The input to clustering is the set of images from the base set in which a landmark has been detected. ALPS
clusters this set by using the position and bearing associated with the image, in two steps: first, it clusters by
position, then, within each cluster it distinguishes pairs of images whose bearing is inconsistent.
To cluster by position, we use mean shift clustering [174]: (1) put all images into a candidate pool; (2)
select a random image in the candidate pool as the center of a new cluster; (3) find all images within R meters
(R=50 in our implementation) of the cluster center, put these images into the cluster, and remove them from
38
(a) (b) 20 15 11 5 11 2 23 Figure 3.4: Clustering can help determine which images contain the same landmark.
(a) (b) A B C H I D E G F A B C H I D E G S1 S2 F Figure 3.5: Clustering by bearing is necessary to distinguish between two nearby landmarks.
the candidate pool; (4) calculate the mean shift of the center of all nodes within the cluster, and if the center is
not stable, go to step (3), otherwise go to step (2).
Clustering by position works well for landmarks likely to be geographically separated (e.g., a Subway
restaurant), but not for landmarks (e.g., a fire hydrant) that can be physically close. In the latter case, clustering
by position can reduce accuracy and coverage.
To improve the accuracy of clustering, we use bearing information in the Street View images to refine
clusters generated by position-based clustering. Our algorithm is inspired by the RANSAC [171] algorithm
for outlier detection, and is best explained using an example. Figure 3.5(a) shows an example where ALPS’s
position-based clustering returns a cluster with 9 images A-I. In Figure 3.5(b), images A-E and images F-I
see different landmarks. ALPS randomly picks two images A and D, adds them to a new proto-cluster, and
uses its positioning algorithm (described below) to find the approximate position of the landmark (S1) as
determined from these images. It then determines which other images have a bearing consistent with the
estimated position of S1. H’s bearing is inconsistent with S1, so it doesn’t belong to the new proto-cluster, but
B’s bearing is. ALPS computes all possible proto-clusters in the original cluster, then picks the lowest-error
39
(a) (b) r r Figure 3.6: Adaptive Image Retrieval.
large proto-cluster, outputs this as a refinement of the original cluster, removes these images from the original
cluster, and repeats the process. In this way, images A-E are first output as one cluster, and images F-I as
another.
Each cluster contains images that, modulo errors in position, bearing and location, contain the same
landmark. ALPS next uses its positioning algorithm (discussed below) to generate a seed location for the
landmark.
3.3.5 Adaptive Image Retrieval
A seed location may not be precise, because the images used to compute it are taken perpendicular to the
street (Figure 3.6(a)). If the landmark is offset from the center of the image, errors in bearing and location can
increase the error of the positioning algorithm. Moreover, the landmark detector may not be able to accurately
draw the bounding box around a landmark that is a little off-center. Location accuracy can be improved by
retrieving images whose bearing matches the heading from the sampling point to the seed location (so, the
landmark is likely to be closer to the center of the image, Figure 3.6(b)). (A seed location may not be precise
also because a cluster may have too few images to triangulate the landmark position. We discuss below how
we address this.)
To this end, we use an idea we call adaptive image retrieval: for each image in the cluster, we retrieve one
additional image with the same position, but with a bearing directed towards the seed location ( Figure 3.6(b)).
At this stage, we also deal with false negatives. If a cluster has fewer than k images (k= 4 in our implementa-
tion), we retrieve one image each from neighboring sampling points with a heading towards the seed location,
even if these points are not in the cluster. In these cases, the landmark detector may have missed the landmark
because it was in a corner of the image; retrieving an image with a bearing towards the seed location may
enable the detector to detect the landmark, and enable higher positioning accuracy because we have more
40
perspectives.
3.3.6 Landmark Positioning
Precisely positioning a landmark from the images obtained using adaptive image retrieval is a central capability
in ALPS. Prior work in robotics reconstructs the 3-D position of an object from multiple locations using three
key steps [291, 165]: (1) camera calibration with intrinsic and extrinsic parameters (camera 3-D coordinates,
bearing direction, tilt angle, field of view, focus length, etc.); (2) feature matching with features extracted
by algorithm like SIFT [224]; (3) triangulation from multiple images using a method like singular value
decomposition.
In our setting, these approaches don’t work too well: (1) Street View does not expose all the extrinsic and
intrinsic camera parameters; (2) some of available parameters (GPS as 3-D coordinates, camera bearing) are
noisy and erroneous, which may confound feature matching; (3) Street View images of a landmark are taken
from different directions and may have differing light intensity, which can reduce feature matching accuracy;
(4) panoramic views in Street View can potentially increase accuracy, but there can be distortion at places in
the panoramic views where images have been stitched together [80].
Instead, ALPS (1) projects the landmark (e.g., a logo) onto a 2-dimensional plane to compute the relative
bearing of the landmark and the camera, then (2) uses least squares regression to estimate the landmark
position.
Estimating Relative Bearing ALPS projects the viewing directions onto a 2-D horizontal plane as shown in
Figure 3.7(a). O represents the landmark in 3-dimensions, and O
0
represents the projected landmark on a 2-D
horizontal plane. C
i
and its corresponding C
0
i
represent the camera locations in 3-D and 2-D respectively. Thus,
~
C
i
O is the relative bearing from camera i to landmark O, and
~
C
0
i
O is its projection.
The landmark detector draws a bounding box around the pixels representing the landmark, and for
positioning, we need to be able to estimate the relative bearing of the center of this bounding box relative to
the bearing of the camera itself. In Figure 3.7(b), line
~
AB demarcates the (unknown) depth of the image and
vector
~
C
0
H represents the bearing direction of camera, so O
”
is the image of O
0
on
~
AB. Our goal is to estimate
\O
0
C
0
X or\4, which is the bearing of the landmark relative to x-axis.
To do this, we need to estimate the following three variables: (1) the camera angle of view\AC
0
B or\1,
which is the maximum viewing angle of the camera; (2) the camera bearing direction\HC
0
X or\2, which
is the bearing direction of the camera when the image was taken; (3) the relative bearing direction of the
landmark\O
”
C
0
D or\3, which is the angle between the bearing direction of the camera and the bearing
direction of the landmark.
41
O’ C
1 C
2 C
3 C
1
’
C
2
’
C
3
’
O 1 4 3 C’
O’ O’’ B H D A 2 X (a) (b) Figure 3.7: Landmark Positioning.
\1 and\2 can be directly obtained from image metadata returned by Street View. Figure 3.7(b) illustrates
how to calculate\3= arctan(j
~
DO
”
j=j
~
DC
0
j). Landmark detection returns the image width in pixels and the pixel
coordinates of the landmark bounding box. Thus,j
~
DO
”
j=j
~
AO
”
j
1
2
j
~
ABj. Since tan(
1
2
\1)=j
~
ADj=j
~
DC
0
j, we
can calculatej
~
DC
0
j=
1
2
j
~
ABjtan(
1
2
\1). Then we derive\3 as arctan(j
~
DO
”
j=j
~
DC
0
j). Finally, we can calculate
the bearing direction of the landmark:\4=\2\3.
Positioning using Least Squares Regression For each cluster, using adaptive image retrieval, ALPS retrieves
N images for a landmark and can calculate the relative bearing of the landmark to the camera by executing
landmark detection on each image. Positioning the landmark then becomes an instance of the landmark
localization problem [296, 208], where we have to find the landmark location P=[x
o
;y
o
] given N distinct
viewing locations p
i
=[x
i
;y
i
];i= 1;2;:::;N with corresponding bearing q
i
;i= 1;2;:::;N, where [x
i
;y
i
] is
point p
i
’s GPS coordinates in x-y plane. From first principles, we can writeq (or\4) as follows:
tan(q
i
)=
sin(q
i
)
cos(q
i
)
=
y
o
y
i
x
o
x
i
: (3.1)
Simplifying this equation and combining the equations for all images, we can write the following system
of linear equations:
Gb = h; (3.2)
where b =[x
o
;y
o
]
T
represents the landmark location, G =[g
1
;g
2
;:::;g
N
]
T
, g
i
=[sin(q
i
),cos(q
i
)],
h=[h
1
;h
2
;:::;h
N
]
T
, h
i
=[sin(q
i
)x
i
cos(q
i
)y
i
].
In this system of linear equations, there are two unknowns x
o
and y
o
, but as many equations as images,
resulting in an overdetermined system. However, many of the q
i
s may be inaccurate because of errors in
camera bearing, location, or landmark detection. To determine the most likely position, ALPS approximates
ˆ
b
using least squares regression, which minimizes the squared residuals S(b)=jjGb hjj
2
, as output. If G is
42
full rank, the least squares solution of Equation 3.2 is:
ˆ
b = argmin(S(b))=(G
T
G)
1
G
T
h: (3.3)
3.3.7 Putting it All Together
Given a landmark type and a geographic region, ALPS first retrieves a base set of images for the complete
region, which ensures coverage. On each image in this set, it applies landmark detection, retrieving zoomed-in
versions of the image if necessary to obtain higher confidence in the detection and reduce false positives. It
applies position and bearing based clustering on the images where a landmark was detected. Each resulting
cluster defines a seed location, where the landmark might be.
At each seed location, ALPS adaptively retrieves additional images, runs landmark detection on each
image again to find the bearing of the landmark relative to the camera for each image, and uses these bearings
to formulate a system of linear equations whose least squares approximation represents the position of the
landmark.
3.3.8 Flexibility
ALPS is flexible enough to support extensions that add to its functionality, or improve its scalability.
New landmark types Users can add to ALPS’s library of landmark types by simply training a neural
network to detect that type of landmark. No other component of the system needs to be modified.
Seed location hints To scale ALPS better, users can specify seed location hints in two forms. ALPS can
take a list of addresses and generate seed locations from this using reverse geo-coding. ALPS also takes spatial
constraints that restrict base image retrieval to sampling points satisfying these constraints. For example,
fire hydrants usually can be seen at or near street corners, or on a street midway between two cross-streets.
Therefore, to specify such constraints, ALPS provides users with a simple language with 4 primitives: at_corner
(only at street corners), midway (at the midpoint between two cross-streets, searching_radius (search within a
radius of the points specified by other constraints), and lower_image (the landmark like a fire hydrant only
appears in the lower part of the image). More spatial constraints may be required for other landmarks: we
have left this to future work.
3.4 Evaluation
In this section, we evaluate the coverage, accuracy, scalability and flexibility of ALPS on two different types
of landmarks: Subway restaurants, and fire hydrants.
43
3.4.1 Methodology
Implementation and Experiments We implemented ALPS in C++ and Python and accessed Street View
images using Google’s API [76]. Our implementation is 2708 lines of code.
1
All experiments described in the
chapter are run on a single server with an Intel Xeon CPU at 2.70GHz, 32GB RAM, and one Nvidia GTX
Titan X GPU inside. Below, we discuss the feasibility of parallelizing ALPS’s computations across multiple
servers.
Dataset We evaluate ALPS using images for several geographic regions across five cities of the United
States. In some of our experiments, we use seed location hints to understand the coverage and accuracy at
larger scales.
Ground Truth For both landmark types we evaluate, getting ground truth locations is not easy because
no accurate position databases exist for these. So, we manually collected ground truth locations for these as
follows. For Subway restaurants, we obtained street addresses for each restaurant within the geographic region
from the chain’s website [79]. For fire hydrants, there exists an ArcGIS visualization of the fire hydrants in 2
zipcodes [67] (as an aside, such visualizations are not broadly available for other geographic locations, and
even these do not reveal exact position of the landmark). From these, we obtained the approximate location for
each instance of the landmark. Using this approximate location, we first manually viewed the landmark on
Street View, added a pinpoint on Google Maps at the location where we observed the landmark to be, then
extracted the GPS coordinate of that pinpoint. This GPS coordinate represents the ground truth location for
that instance.
We validated of this method by collecting measurements at 30 of these landmarks using a high accuracy
GPS receiver [57]. The 90th percentile error between our manual labeling and the GPS receiver is 6 meters. In
the three cases where the error was high, we noticed that a sunshade obstructed our view of the sky, so the
GPS receiver is likely to have obtained an incorrect position fix.
Metrics To measure the performance of ALPS, we use three metrics. Coverage is measured as the fraction
of landmarks discovered by ALPS from the ground truth (this measures recall of our algorithms). Where
appropriate, we also discuss the false positive rate of ALPS, which can be used to determine ALPS’s precision.
The accuracy of ALPS is measured by its positioning error, the distance between ALPS’s position and ground
truth. For scalability, we quantify the processing speed of each module in ALPS and the number of retrieved
images. (We use the latter as a proxy for the time to retrieve images, which can depend on various factors like
the Street View image download quota (25,000 images per day per user [76]) and access bandwidth that can
vary significantly).
1
Available athttps://github.com/USC-NSL/ALPS
44
3.4.2 Coverage and Accuracy
To understand ALPS’s coverage and accuracy, we applied ALPS to the zip-code 90004 whose area is 4 sq.
km., to localize both Subway restaurants and fire hydrants. To understand ALPS’s performance at larger scales,
we used seed location hints to run ALPS at the scale of large cities in the US.
Zip-code 90004 Table 3.1 shows the coverage of the two landmark types across the entire zip-code. There
are seven Subways in this region and ALPS discovers all of them, with no false positives. Table 3.3 shows that
ALPS localizes all Subways within 6 meters, with a median error of 4.7 meters. By contrast, the error of the
GPS coordinates obtained from Google Places is over 10 meters for each Subway and nearly 60 meters in
one case. Thus, at the scale of a single zip-code, ALPS can have high coverage and accuracy for this type of
landmark.
Fire hydrants are much harder to cover because they are smaller in shape, lower in position so can be
occluded, can blend into the background or be conflated with other objects. As Table 3.1 shows, ALPS finds
262 out of 330 fire hydrants for an 79.4% coverage. Of the ones that ALPS missed, about 16 were not visible
to the naked eye in any Street View image, so no image analysis technique could have detected these. In 12
of these 16, the hydrant was occluded by a parked vehicle (which is illegal, Figure 3.8 b)) in the Street View
image, and the remaining 4 simply did not exist in the locations indicated in [67]. Excluding these, ALPS’s
coverage increases to about 83.4%. ALPS can also position these hydrants accurately. Figure 3.9 shows the
cumulative distribution function (CDF) of errors of ALPS for fire hydrants in 90004. It can localize 87% of
the hydrants within 10 meters, and its median error is 4.96 meters.
We then manually inspected the remaining 52 fire hydrants visible to the human eye but not discovered by
ALPS. In 6 of these cases, the fire hydrant was occluded by a car in the image downloaded by base image
retrieval: a brute-force image retrieval technique would have discovered these (see below). The remaining 46
missed hydrants fell roughly evenly into two categories. First, 24 of them were missed because of shortcomings
of the object detector we use. In these cases, even though the base image retrieval downloaded images with
hydrants in them, the detector did not recognize the hydrant in any of the images either because of lighting
conditions (e.g., hydrant under the shade of a tree, Figure 3.8 a)), or the hydrant was blurred in the image. The
remaining 22 false negatives occurred because of failures in the positioning algorithm. This requires multiple
perspectives (multiple images) to triangulate the hydrant, but in these cases, ALPS couldn’t obtain enough
perspectives either because of detector failures or occlusions. Finally, the 21 false positives were caused by the
object detector misidentifying other objects (such as a bollard, Figure 3.8 c)) as hydrants. Future improvements
to object detection, or better training and parametrization of the object detector, can reduce both false positives
and false negatives. We have left this to future work.
45
Figure 3.8: (a) Hydrant occluded by a parked vehicle. (b) Detection failure because hydrant is under the shade of a
tree. (c) False positive detection of bollard as hydrant.
Type # landmark # visible # ALPS Coverage
Subway 7 7 7 100%
Hydrant 330 314 262 83.4%
Table 3.1: Coverage of ALPS
Finally, both false positives and negatives in ALPS can be eliminated by using competing services like
Bing Streetside [72] which may capture images when a landmark is not occluded, or under different lighting
conditions, or from perspectives that eliminate false positive detections. To evaluate this idea, we downloaded
images from Bing Streetside near fire hydrants that were not detected using Google Street View. By combining
both image sources, ALPS detected 300 out of 314 visible fire hydrants, resulting in 95.5% coverage in this
area.
City-Scale Positioning To understand the efficacy of localizing logos, like that of Subway, over larger
scales, we evaluated ALPS on larger geographic areas on the scale of an entire city. At these scales, ALPS
will work, but we did not wish to abuse the Street View service and download large image sets. So, we
explored city-scale positioning performance by feeding seed-location hints in the form of addresses for Subway
restaurants, obtained from the chain’s web page.
Table 3.2 shows the coverage with seed locations in different areas. Across these five cities, ALPS achieves
more than 92% coverage. With seed location hints, ALPS does not perform base image retrieval, so errors arise
for other reasons. We manually analyzed the causes for errors in these five cities. In all cities, the invisible
Subways were inside a building or plaza, so image analysis could not have located them. The missed Subways
in Los Angeles, Mountain View, San Diego and Redmond were either because: (a) the logo detector failed to
detect the logo in any images (because the image was partly or completely occluded), or (b) the positioning
algorithm did not, in some clusters, have enough perspectives to localize.
ALPS does not exhibit false positives for Subway restaurants. For hydrants, all false positives arise because
46
City # Subway # Visible # ALPS Coverage Median error(m)
Los Angeles 123 118 115 97% 4.8
Mountain View 38 26 24 92% 5.1
San Francisco 49 39 39 100% 4.2
San Diego 57 44 41 93% 5.0
Redmond 31 25 24 96% 4.8
Table 3.2: Coverage with Seed Locations
Subway # 1 2 3 4 5 6 7
Error of Google (m) 10.06 11.53 14.10 30.38 59.48 16.60 14.90
Error of ALPS (m) 2.03 4.53 6.78 2.93 7.39 5.94 3.33
Table 3.3: Error of ALPS and Google for localizing Subways
the landmark detector mis-detected other objects as hydrants. The Subway sign is distinctive enough that, even
though the landmark detector did have some false positives, these were weeded out by the rest of the ALPS
pipeline.
At city-scales also, the accuracy of ALPS is high. Figure 3.10 shows the CDF of errors of ALPS and
Google Places locations for all of the Subways (we exclude the Subways that are not visible in any Street View
image). ALPS can localize 93% of the Subways within 10 meters, and its median error is 4.95 meters while
the median error from Google places is 10.17 meters. Moreover, for 87% of the Subways, Google Places has a
higher error than ALPS in positioning. These differences might be important for high-precision applications
like drone-based delivery.
0 5 10 15 20 25
0
0.2
0.4
0.6
0.8
1
error(m)
CDF
ALPS
Figure 3.9: Distribution of position
errors for hydrants in 90004 zip-
code
0 10 20 30 40 50 60
0
0.2
0.4
0.6
0.8
1
error(m)
CDF
ALPS
Google
Figure 3.10: Distribution of errors
for Subways in five cities
0 5 10 15 20
0
0.2
0.4
0.6
0.8
1
error(m)
CDF
ALPS w/o ideal detector
ALPS w/ ideal detector
Figure 3.11: Distribution of posi-
tion errors for ALPS on Subway w/
and w/o ideal detector in Los Ange-
les
3.4.3 Scalability: Bottlenecks and Optimizations
Processing Time To understand scaling bottlenecks in ALPS, Table 3.4 breaks down the time taken by each
component for the 90004 zip-code experiment (for both Subways and hydrants). In this experiment, base
47
Module Base Retrieval Base Detection Cluster
Time (s) 3528 8741 0.749
Module Adaptive Retrieval Adaptive Detection Positioning
Time (s) 715 1771 0.095
Table 3.4: Processing time of each module
image retrieval, which retrieved nearly 150 thousand images, was performed only once (since that component
is agnostic to the type of landmark being detected). Every other component was invoked once for each type of
landmark.
Of the various components, clustering and positioning are extremely cheap. ALPS thus has two bottlenecks.
The first is image retrieval, which justifies our optimization of this component (we discuss this more below).
The second bottleneck is running the landmark detector. On average, it takes 59 milliseconds for the landmark
detector to run detection on an image, regardless of whether the image contains the landmark or not. However,
because we process over 150 thousand images, these times become significant. (Landmark detection is
performed both on the base images to determine clusters, and on adaptively retrieved images for positioning,
hence the two numbers in the table). Faster GPUs can reduce this time.
Fortunately, ALPS can be scaled to larger regions by parallelizing its computations across multiple servers.
Many of its components are trivially parallelizable, including base image retrieval which can be parallelized
by partitioning the geographic region, and adaptive image retrieval and positioning which can be partitioned
across seed location. Only clustering might not be amenable to parallelization, but clustering is very fast. We
have left an implementation of this parallelization to future work.
The Benefit of Adaptive Retrieval Instead of ALPS’s two phase (basic and adaptive) retrieval strategy,
we could have adopted two other strategies: (a) a naive strategy which downloads images at very fine spatial
scales of 1 meter, (b) a one phase strategy which downloads 6 images, each with a 60
viewing angle so ALPS
can have high visual coverage. For the 90004 zip-code experiment, the naive strategy retrieves 24 more
images than ALPS’s two-phase strategy, while one-phase retrieves about 3 as many. The retrieval times are
roughly proportional to the number of images retrieved, so ALPS’s optimizations provide significant gains.
These gains come at a very small loss in coverage: one-phase has 1.91% higher coverage than two-phase for
hydrants mostly because the former has more perspectives: for example, hydrants that were occluded in the
base images can be seen in one-phase images.
Seed Location Hints We have already seen that seed location hints helped us scale ALPS to large cities.
These hints provide similar trade-offs as adaptive retrieval: significantly fewer images to download at the
expense of slightly lower coverage. For hydrants in 90004, using hints that tell ALPS to look at street corners
or mid-way between intersections and in the lower half of the image enabled ALPS to retrieve 3 fewer
48
YOLO HoG+SVM SIFT
Precision 85.1% 74.2% 63.7%
Recall 87.4% 80.5% 40.6%
Speed (sec/img) 0.059 0.32 0.65
Table 3.5: Evaluation of different object detection methods
images, while only detecting 5% fewer hydrants.
3.4.4 Accuracy and Coverage Optimizations
Object Detection Techniques The accuracy of the object detector is central to ALPS. We evaluated the recall,
precision, and processing time of several different object detection approaches: YOLO, HoG+SVM, and
keypoint matching [18] with SIFT [224] features. For HoG+SVM, we trained LIBSVM [77] with HoG [160]
features and a linear kernel. Table 3.5 shows that YOLO outperforms the other two approaches in both recall
and precision for recognizing the Subway logo. YOLO also has the fastest processing time due to GPU
acceleration.
Street View Zoom ALPS uses zoomed Street View images to increase detection accuracy. To quantify
this, after using zoomed in images the landmark detector had a precision of 96.2% and a recall of 86.8%. In
comparison, using only YOLO without zoomed in images had a precision of 85.1% and recall of 87.4%.
To understand how the object detector affects the accuracy of ALPS, we manually labeled the position of
the Subway logo in all the images in the dataset of Subways in LA. We thus emulated an object detector with
100% precision and recall. This ideal detector finds the three missing Subways (by design), but with position
accuracy comparable to YOLO (Figure 3.11).
Importance of Bearing-based Clustering We used the fire hydrant dataset to understand the incremental
benefit of bearing-based cluster refinement. Without this refinement, ALPS can only localize 141 fire hydrants
of 314 visible ones, while the refinement increases coverage by nearly 2 to 262 hydrants. Moreover, without
bearing-based refinement, position errors can be large (in one case, as large as 80 meters) because different
hydrants can be grouped into one cluster.
49
Chapter 4
Caesar: Cross-Camera Complex
Activity Recognition
4.1 Introduction
Being able to automatically detect activities occurring in the view of a single camera is an important challenge
in machine vision. The availability of action data sets [5, 59] has enabled the use of deep learning for this
problem. Deep neural networks (DNNs) can detect what we call atomic actions occurring within a single
camera. Examples of atomic actions include “talking on the phone”, “talking to someone else”, “walking” etc.
Prior to the advent of neural networks, activity detection relied on inferring spatial and temporal relation-
ships between objects. For example, consider the activity “getting into a car”, which involves a person walking
towards the car, then disappearing from the camera view. Rules that specify spatial and temporal relationships
can express this sequence of actions, and a detection system can evaluate these rules to detect such activities.
In this chapter, we consider the next frontier in activity detection research, exploring the near real-time
detection of complex activities potentially occurring across multiple cameras. A complex activity comprises
two or more atomic actions, some of which may play out in one camera and some in another: e.g., a person
gets into a car in one camera, then gets out of the car in another camera and hands off a bag to a person.
We take a pragmatic, systems view of the problem, and ask: given a collection of (possibly wireless)
surveillance cameras, what architecture and algorithms should an end-to-end system incorporate to provide
accurate and scalable complex activity detection?
Future cameras are likely to be wireless and incorporate onboard GPUs. However, activity detection using
DNNs is too resource intensive for embedded GPUs on these cameras. Moreover, because complex activities
50
may occur across multiple cameras, another device may need to aggregate detections at individual cameras.
An edge cluster at a cable head end or a cellular base station is ideal for our setting: GPUs on this edge cluster
can process videos from multiple cameras with low detection latency because the edge cluster is topologically
close to the cameras (Figure 4.1).
Even with this architecture, complex activity detection poses several challenges: (a) How to specify
complex activities occurring across multiple cameras? (b) How to partition the processing of the videos
between compute resources available on the camera and the edge cluster? (c) How to reduce the wireless
bandwidth requirement between the camera and the edge cluster? (d) How to scale processing on the edge
cluster in order to multiplex multiple cameras on a single cluster while still being able to process cameras in
near real-time?
Contributions. In addressing these challenges, Caesar makes three important contributions.
First, it adopts a hybrid approach to complex activity detection where some parts of the complex activity
use DNNs, while others are rule-based. This architectural choice is unavoidable: in the foreseeable future,
purely DNN-based complex activity detection is unlikely, since training data for such complex activities is
hard to come by. Moreover, a hybrid approach permits evolution of complex activity descriptions: as training
data becomes available over time, it may be possible to train DNNs to detect more atomic actions.
Second, to support this evolution, Caesar defines a language to describe complex activities. In this language,
a complex activity consists of a sequence of clauses linked together by temporal relationships. A clause can
either express a spatial relationship, or an atomic action. Caesar users can express multiple complex activities
of interest, and Caesar can process camera feeds in near real-time to identify these complex activities.
Third, Caesar incorporates a graph matching algorithm that efficiently matches camera feeds to complex
activity descriptions. This algorithm leverages these descriptions to optimize wireless network bandwidth and
edge cluster scaling. To optimize wireless network bandwidth, it performs object detection on the camera,
then, at the edge cluster, lazily retrieves images associated with the detected objects only when needed (e.g., to
identify whether an object has appeared in another camera). To scale the edge cluster computation, it lazily
invokes the action detection DNNs (the computational bottleneck) only when necessary.
Using a publicly available multi-camera data set, and an implementation of Caesar on an edge cluster, we
show that, compared to a strawman approach which does not incorporate our optimizations, Caesar has 1-2
orders of magnitude lower detection latency and requires an order of magnitude less on-board camera memory
(to support lazy retrieval of images). Caesar’s graph matching algorithm works perfectly, and its accuracy is
only limited by the DNNs we use for action detection and re-identification (determining whether two human
images belong to the same person).
51
While prior work has explored the single-camera action detection [293, 277, 323], tracking of people
across multiple overlapping cameras [313, 237, 280] and non-overlapping cameras [262, 290, 153], to our
knowledge, no prior work has explored a near real-time hybrid system for multi-camera complex activity
detection.
4.2 Background and Motivation
Goal and requirements. Caesar detects complex activities across multiple non-overlapping cameras. It must
support accurate, efficient, near real-time detection while permitting hybrid activity specifications. In this
section, we discuss the goal and these requirements in greater detail.
17 Edge Server User Defined Action “A person gets on a car
then leaves with a bag” Detected frames and objects Figure 4.1: The high-level concept of a complex activity detection system: the user defines the rule then the system
monitors incoming videos and outputs the matched frames.
Atomic and complex activities. An atomic activity is one that can be succinctly described by a single word
label or short phrase, such as “walking”, “talking”, “using a phone”. In this chapter, we assume that atomic
activities can be entirely captured on a single camera.
A complex activity (i) involves multiple atomic activities (ii) related in time (e.g., one occurs before or after
another), space (e.g., two atomic activities occur near each other), or in the set of participants (e.g., the same
person takes part in two atomic activities), and (iii) can span multiple cameras whose views do not overlap. An
example of a complex activity is: “A person walking while talking on the phone in one camera, and the same
person talking to another person at a different camera a short while later”. This statement expresses temporal
relationships between activities occurring in two cameras (“a short while later”) and spatial relationships
between participants (“talking to another person”).
Applications of complex activity detection. Increasingly, cities are installing surveillance cameras on light
poles or mobile platforms like police cars and drones. However, manually monitoring all cameras is labor
intensive given the large number of cameras [61], so today’s surveillance systems can only deter crimes and
enable forensic analysis. They cannot anticipate events as they unfold in near real time. A recent study [183]
shows that such anticipation is possible: many crimes share common signatures such as “a group of people
52
walking together late at night” or “a person getting out of a car and dropping something”. Automated systems
to identify these signatures will likely increase the effectiveness of surveillance systems.
The retail industry can also use complex activity detection. Today, shop owners install cameras to prevent
theft and to track consumer behavior. A complex activity detection system can track customer purchases and
browsing habits, providing valuable behavioral analytics to improve sales and design theft countermeasures.
Caesar architecture. Figure 4.1 depicts the high-level functional architecture of Caesar. Today, video
processing and activity detection are well beyond the capabilities of mobile devices or embedded processors on
cameras. So Caesar will need to leverage edge computing, in which these devices offload video processing to a
nearby server cluster. This cluster is a convenient rendezvous point for correlating data from non-overlapping
cameras.
Caesar requirements. Caesar should process videos with high throughput and low end-to-end latency.
Throughput, or the rate at which it can process frames, can impact Caesar’s accuracy and can determine if it is
able to keep up with the video source. Typical surveillance applications process 20 frames per second. The
end-to-end latency, which is the time between when a complex activity occurs and when Caesar reports it, must
be low to permit fast near real-time response to developing situations. In some settings, such as large outdoor
events in locations with minimal infrastructure [33], video capture devices might be un-tethered so Caesar
should conserve wireless bandwidth when possible. To do this, Caesar can leverage significant on-board
compute infrastructure: over the past year, companies have announced plans to develop surveillance cameras
with onboard GPUs [12]. Since edge cluster usage is likely to incur cost (in the same way as cloud usage),
Caesar should scale well: it should maximize the number of cameras that can be concurrently processed on a
given set of resources. Finally, Caesar should have high precision and recall detecting complex activities.
The case for hybrid complex activity detection. Early work on activity detection used a rule-based ap-
proach [247, 287]. A rule codifies relationships between actors (people); rule specifications can use ontolo-
gies [287] or And-Or Graphs [247]. Activity detection algorithms match these rule specifications to actors and
objects detected in a video.
More recent approaches are data-driven [293, 277, 323], and train deep neural nets (DNNs) to detect
activities. These approaches extract tubes (sequences of bounding boxes) from video feeds; these tubes contain
the actor performing an activity, as well as the surrounding context. They are then fed into a DNN trained on
one or more action data sets (e.g., A V A [5], UCF101 [58], and VIRAT [59]), which output the label associated
with the activity. Other work [229] has used a slightly different approach. It learns rules as relationships
between actors and objects from training data, then applies these rules to match objects and actors detected in
a video feed.
53
Edge Clusters
Mobile GPU
Laptop
Fixed
Camera
ReID &
Tracking
Activity
Detection
Object
Detect
Data Control
Physical Machines
Input:
Action Definition
Output:
Detected Actions
Figure 4.2: The high-level design of Caesar. Dots with different colors represent different DNN modules for specific
tasks.
While data-driven approaches are preferable over rule-based ones because they can generalize better,
complex activity detection cannot use purely data-driven approaches. By definition, a complex activity
comprises individual actions combined together. Because there can be combinatorially many complex
activities from a given set of individual activities, and because data-driven approaches require large amounts of
training data, it will likely be infeasible to train neural networks for all possible complex activities of interest.
Thus, in this chapter, we explore a hybrid approach in which rules, based on an extensible vocabulary,
describe complex activities. The vocabulary can include atomic actions: e.g., “talking on a phone”, or “walking
a dog”. Using this vocabulary, Caesar users can define a rule for “walking a dog while talking on the phone”.
Then, Caesar can detect a more complex activity over this new atomic action: “walking a dog while talking on
the phone, then checking the postbox for mail before entering a doorway”. (For brevity of description, a rule
can, in turn, use other rules in its definition.)
Challenges. Caesar uses hybrid complex activity detection to process feeds in near real-time while satisfying
the requirements described above. To do this, it must determine: (a) How to specify complex activities across
multiple non-overlapping cameras? (b) How to optimize the use of edge compute resources to permit the
system to scale to multiple cameras? (c) How to conserve wireless bandwidth by leveraging on-board GPUs
near the camera?
4.3 Caesar Design
In Caesar, users first specify one or more rules that describe complex activities (Figure 4.5): this rule definition
language includes elements such as objects, actors, and actions, as well as spatial and temporal relationships
between them.
Cameras generate video feeds, and Caesar processes these using a three-stage pipeline (Figure 4.2). In
the object detection stage, Caesar generates bounding boxes of actors and objects seen in each frame. For
54
Input Output
Object
Detection
Image Object Bounding Boxes
Track & ReID
Object Bounding Boxes
Image
Object TrackID
Action
Detection
Object Boxes & TrackID
Image
Actions
Table 4.1: Input and output content of each module in Caesar.
wireless cameras, Caesar can leverage on board mobile GPUs to run object detection on the device; subsequent
stages must run on the edge cluster. The input to, and output of, object detection is the same regardless of
whether it runs on the mobile device or the edge cluster. A re-identification and tracking module processes
these bounding boxes. It (a) extracts tubes for actors and objects by tracking them across multiple frames and
(b) determines whether actors in different cameras represent the same person. Finally, a graph matching and
lazy action detection module determines: (a) whether the relationships between actor and object tubes match
pre-defined rules for complex activities and (b) when and where to invoke DNNs to detect actions to complete
rule matches. Table 4.1 shows the three modules’ data format.
Figure 4.3 shows an example of Caesar’s output for a single camera. It annotates the live camera feed
with detected activities. In this snapshot, two activities are visible: one is a person who was using a phone
in another camera, another is a person exiting a car. Our demonstration video
1
shows Caesar’s outputs for
multiple concurrent camera feeds.
Caesar meets the requirements and challenges described in as follows: it processes streams continuously, so
can detect events in near-real time; it incorporates robustness optimizations for tracking, re-identification, and
graph matching to ensure accuracy; it scales by lazily detecting actions, thereby minimizing DNN invocation.
4.3.1 Rule Definition and Parsing
Caesar’s first contribution is an extensible rule definition language. Based on the observation that complex
activity definitions specify relationships in space and time between actors, objects, and/or atomic actions
(henceforth simply actions), the language incorporates three different vocabularies (Figure 4.4).
Vocabularies. An element vocabulary specifies the list of actors or objects (e.g., “person”, “bag”, “bicycle”)
and actions (e.g., “talking on the phone”). As additional detectors for atomic actions become available (e.g.,
“walking a dog”) from new DNNs or new action definition rules, Caesar can incorporate corresponding
1
Caesar’s demo video: https://vimeo.com/330176833
55
Green Box:
Vehicles
Yellow Box:
Person
Detected
Complex activity
White Box:
Other Objects
Figure 4.3: The output of Caesar with annotations.
Talking Use-Phone ...
...
Sitting Carry-Bag DNN Actions:
Near Move Spatial Actions: Close
... Then And Logic Relation: Or
ReID
Not
Overlap
( )
Figure 4.4: Examples of the vocabulary elements.
vocabulary extensions for these.
A spatial operator vocabulary defines spatial relationships between actors, objects, and actions. Spatial
relationships use binary operators such as “near” and “approach”. For example, before a person p1 can talk
to p2, p1 must “approach” p2 and then come “near” p2 (or vice versa). Unary operators such as “stop” or
“disappear” specify the dispensation of participants or objects. For example, after approaching p2, p1 must
“stop” before he or she can talk to p2. Another type of spatial operator is for describing which camera an actor
appears in. The operator “re-identified” specifies an actor recognized in a new camera. The binary operator
“same-camera” indicates that two actors are in the same camera.
Finally, a temporal operator vocabulary defines concurrent as well as sequential activities. The binary
operator “then” specifies that one object or action is visible after another, “and” specifies that two objects or
actions are concurrently visible, while “or” specifies that they may be concurrently visible. The unary operator
“not” specifies the absence of a corresponding object or action.
A complex activity definition contains three components (Figure 4.5). The first is a unique name for the
activity, and the second is a set of variable names representing actors or objects. For instance, p1 and p2
might represent two people, and c a car. The third is the definition of the complex activity in terms of these
variables. A complex activity definition is a sequence of clauses, where each clause is either an action (e.g.,p1
56
Action Name:
use_phone_and_cross_cam
_with_bag
Subjects:
Person p1
Bag b
Action Definition:
(p1 use_phone) and (p1
move) and (p1 overlap b)
then (p1 reid) then (p1 move)
and (p1 overlap b)
Action Name:
get_on_car
Subjects:
Person p1
Car c
Action Definition:
(p1 approach c) and
(p1 near c) and (c
stop) then (p1 close c)
and (p1 disappear)
and (c stop)
Figure 4.5: Two examples of action definition using Caesar syntax.
1. Get on car
Then
People as P, Bag as B
2. Use phone and
go across
cameras with bag
ReID(P)
Overlap(P, B)
And
Use-Phone(P)
Overlap(P, B)
Then
Move(P)
And
Approach(P, C) Near(P, C) Stop(P)
And And
Overlap(P, C)
Then
Disappear(P)
And
Stop(P)
And
People as P, Car as C
Figure 4.6: Two examples of parsed complex activity graphs.
use-phone), or a unary or binary spatial operator (e.g.,(p1 close p2), or(p1 move)). Temporal
operators link two clauses, so a complex activity definition is a sequence of clauses separated by temporal
operators. Figure 4.5 shows examples of two complex actions, one to describe a person getting into a car, and
another to describe a person who is seen, in two different cameras, talking on the phone while carrying a bag.
The rule parser. Caesar parses each rule to extract an intermediate representation suitable for matching. In
Caesar, that representation is a directed acyclic graph (or DAG), in which nodes are clauses and edges represent
temporal relationships. Figure 4.6 shows the parsed graphs of the definition rules. At runtime, Caesar’s graph
matching component attempts to match each complex activity DAG specification to the actors, objects, and
actions detected in the video feeds of multiple cameras.
4.3.2 Object Detection
On-camera object detection. The first step in detecting a complex activity is detecting objects in frames.
This component processes each frame, extracts a bounding box for each distinct object within the frame,
Input
Frames
Object Detection
Image Cache
T1 T2 T3
Obj BBox
Obj Type
Frame ID
Device ID
Images
Start Time
End Time
Figure 4.7: Workflow of object detection on mobile device.
and emits the box coordinates, the cropped image within the bounding box, and the object label. Today,
DNNs like YOLO [256] and SSD [219] can quickly and accurately detect objects. These detectors also have
stripped-down versions that permit execution on a mobile device. Caesar allows a camera with on-board GPUs
to run these object detectors locally. When this is not possible, Caesar schedules GPU execution on the edge
cluster. (The next step in our pipeline involves re-identifying actors across multiple cameras, and cannot be
easily executed on the mobile device).
Optimizing wireless bandwidth. When the mobile device runs the object detector, it may still be necessary
to upload the cropped images for each of the bounding boxes (in addition to the bounding box coordinates and
the labels). Surveillance cameras can see tens to hundreds of people or cars per frame, so uploading images
can be bandwidth intensive. In Caesar, the mobile device maintains a cache of recently seen images and the
edge cluster lazily retrieves images from the mobile device to reduce this overhead.
Caesar is able to perform this optimization for two reasons. First, for tracking and re-identification, not all
images might be necessary; for instance, if a person appears in 20 or 30 successive frames, Caesar might need
only the cropped image of the person from one of these frames for re-identification. Second, while all images
might be necessary for action detection, Caesar minimizes invocation of the action detection module, reducing
the need for image transfer.
4.3.3 Tracking and Re-Identification
The tube abstraction. Caesar’s expressivity in capturing complex activities comes from the tube abstraction.
A tube is a sequence of bounding boxes over successive frames that represent the same object. As such, a
tube has a distinct start and end time, and a label associated with the object. Caesar’s tracker (Algorithm (1))
takes as input the sequence of bounding boxes from the object detector, and assigns, to each bounding box a
globally unique tube ID. In its rule definitions, Caesar detects spatial and temporal relationships between tubes.
Tubes also permit low overhead re-identification, as we discuss below.
Algorithm 1 Cross-Camera Tracking and Re-Identification
1: INPUT : list o f bounding boxes
2: for each person box Bin bounding boxes do
3: ID
B
= update_tubes(existing_tubes; B)
4: if ID
B
notin existing_tubes then
5: f rame = get_ f rame_ f rom_camera()
6: F
B
= get_DNN_ f eature( f rame; B)
7: for ID
local
in local_tubes do
8: if f eature_dist(ID
local
; ID
B
) then
9: ID
B
= ID_local; U pdate(ID_local); return
10: for ID
other
in other_tubes do
11: if f eature_dist(ID
other
; ID
B
) then
12: ID
B
= ID_other; U pdate(ID_other); return
13: U pdate(ID_B);
Tracking. The job of the tracking sub-component is to extract tubes. This sub-component uses a state-of-
the-art tracking algorithm called DeepSORT [302] that runs on the edge server side (line 6, Algorithm (1)).
DeepSORT takes as input bounding box positions and extracts features of the image within the bounding
box. It then tracks the bounding boxes using Kalman filtering, as well as judging the image feature and the
intersection-over-union between successive bounding boxes.
Caesar receives bounding boxes from the object detector and passes it to DeepSORT, which either associates
the bounding box with an existing tube or fails to do so. In the latter case, Caesar starts a new tube with this
bounding box. As it runs, Caesar’s tracking component continuously saves bounding boxes and their tube ID
associations to a distributed key-value store within the edge cluster, described below, that enables fast tube
matching in subsequent steps.
Caesar makes one important performance optimization. Normally, DeepSORT needs to run a DNN for
person re-identification features. This is feasible when object detection runs on the edge cluster. However,
when object detection runs on the mobile device, feature extraction can require additional compute and network
resources, so Caesar relies entirely on DeepSORT’s ability to track using bounding box positions alone. This
design choice permits Caesar to conserve wireless network bandwidth by transmitting only bounding box
positions instead of uploading the whole frame.
Robust tracking. When in the view of the camera, an object or actor might be partially obscured. If this
happens, the tracking algorithm detects two distinct tubes. To be robust to partial occlusions, Caesar retrieves
the cropped image corresponding to the first bounding box in the tube (line 5, Algorithm (1)). Then, it applies
a re-identification DNN (described below) to match this tube with existing tubes detected in the local camera
(lines 7-10, Algorithm (1)). If it finds a match, Caesar uses simple geometric checks (e.g., bounding box
59
continuity) before assigning the same identifier to both tubes.
Cross-camera re-identification. Cross camera re-identification is the ability to re-identify a person or object
between two cameras. Caesar uses an off-the-shelf DNN [301] which, trained on a corpus of images, outputs a
feature vector that uniquely identifies the input image. Two images belong to the same person if the distance
between the feature vectors is within a predefined threshold.
To perform re-identification, Caesar uses the image retrieved for robust tracking, and searches a distributed
key-value store for a matching tube from another camera. Because the edge cluster can have multiple servers,
and different servers can process feeds from different cameras, Caesar uses a fast in-memory distributed
key-value store [47] to save tubes.
Re-identification can incur a high false positive rate. To make it more robust, we encode the camera
topology [249] in the re-identification sub-component. In this topology, nodes are cameras, and an edge exists
between two cameras only if a person or a car can go from one camera to another without entering the field of
view of any other non-overlapping camera. Given this, when Caesar tries to find a match for a tube seen at
camera A, it applies the re-identification DNN only to tubes at neighbors of A in the topology. To scope this
search, Caesar uses travel times between cameras [249].
4.3.4 Action Detection and Graph Matching
In a given set of cameras, users may want to detect multiple complex activities, and multiple instances of
each activity can occur. Caesar’s rule parser generates an intermediate graph representation for each complex
activity, and the graph matching component matches tubes to the graph in order to detect when a complex
activity has occurred. For reasons discussed below, graph matching dynamically invokes atomic action
detection, so we describe these two components together in this section.
Node matching. The first step in graph matching is to match tubes to nodes in one or more graphs. Recall
that a node in a graph represents a clause that describes spatial actions or spatial relationships. Nodes consist
of a unary or binary operator, together with the corresponding operands. For each operator, Caesar defines
algorithm to evaluate the operator.
Matching unary operators. For example, consider the clausestop c, which determines whether the carc is
stationary. This is evaluated to true if the bounding box forc has the same position in successive frames. Thus,
a tube belonging to a stationary car matches the nodestop c in each graph, and Caesar bindsc to its tube
ID.
Similarly, the unary operatordisappear determines if its operand is no longer visible in the camera.
The algorithm to evaluate this operator considers two scenarios: an object or person disappearing (visible in
one frame and not visible in the next) by (a) entering the vehicle or building, or (b) leaving the camera’s field
of view. When either of these happen, the object’s tube matches the corresponding node in a graph.
Matching binary operators. For binary operators, node matching is a little more involved, and we explain this
using an example. Consider the clausep1 near p2, which asks: is there a person near another person? To
evaluate this, Caesar checks each pair of person tubes to see if there was any instant at which the corresponding
persons were close to each other. For this, it divides up each tube into small chunks of duration t (1 second in
our implementation), and checks for the proximity of all bounding boxes pairwise in each pair of chunks.
To determine proximity, Caesar uses the following metric. Consider two bounding boxes x and y. Let
d(x;y) be the smallest pixel distance between the outer edges of the bounding box. Let b(x) (respectively b(y))
be the largest dimension of bounding box x (respectively, y). Then, if either
d(x;y)
b(x)
or
d(x;y)
b(y)
is less than a fixed
thresholdd we say that the two bounding boxes are proximate to each other. Intuitively, the measure defines
proximity with respect to object dimensions: two large objects can have a larger pixel distance between them
than two small objects, yet Caesar may declare the larger objects close to each other, but not the smaller ones.
Finally, p1 near p2 is true for two people tubes if there is a chunk within those tubes in which a
majority of bounding boxes are proximate to each other. We use the majority test to be robust in bounding box
determinations in the underlying object detector.
Caesar includes similar algorithms for other binary spatial operators. For example, the matching algorithm
forp1 approaches p2 is a slight variant ofp1 near p2: in addition to the proximity check, Caesar
also detects whether bounding boxes in successive frames decrease in distance just before they come near each
other.
Time of match. In all of these examples, a match occurs within a specific time interval (t
1
;t
2
). This time
interval is crucial for edge matching, as we discuss below.
Edge matching. In Caesar’s intermediate graph representation, an edge represents a temporal constraint. We
permit two types of temporal relationships: concurrent (represented byand which requires that two nodes
must be concurrent, andor which specifies that two nodes may be concurrent), and sequential (one node
occurs strictly after another).
To illustrate how edge matching works, consider the following example. Suppose there are two matched
nodesa andb. Each node has a time interval associated with the match. Thena andb are concurrent if their
time intervals overlap. Otherwise,a then b is true ifb’s time interval is strictly aftera’s.
Detecting atomic actions. Rule-based activity detection has its limits. Consider the atomic action “talking on
the phone”. One could specify this action using the rulep1 near m, wherep1 represents a person andm
represents a mobile phone. Unfortunately, phones are often too small in surveillance videos to be captured
61
by object detectors. DNNs, when trained on a large number of samples of people using phones, can more
effectively detect this atomic action.
Action matching. For this reason, Caesar rules can include clauses matched by a DNN. For example, the clause
talking_phone(p1) tries to find a person tube by applying each tube to a DNN. For this, Caesar uses
the DNN described in [293]. We have trained this on Google’s A V A [5] dataset which includes 1-second
video segments from movies and action annotations. The training process ensures that the resulting model can
detect atomic actions in surveillance videos without additional fine-tuning; see [293] for additional details.
The model can recognize a handful of actions associated with person tubes, such as: “talking on the phone”,
“sitting”, and “opening a door”. For each person tube, it uses the Inflated 3D features [150] to extract features
which represent the temporal dynamics of actions, and returns a list of possible action labels and associated
confidence levels. Given this, we say that a person tube matchestalking_phone(p1) if there is a video
chunk in which “talking on the phone” has a higher confidence value than a fixed thresholdt.
Efficiency considerations. In Caesar’s rule definition language, an action clause is a node. Matching that node
requires running the DNN on every chunk of every tube. This is inefficient for two reasons. The first is GPU
inefficiency: the DNN takes about 40 ms for each chunk, so a person who appears in the video for 10 s would
require 0.4 s to process (each chunk is 1 s) unless Caesar provisions multiple GPUs to evaluate chunks in
parallel. The second is network inefficiency. To feed a person tube to the DNN, Caesar would need to retrieve
all images for that person tube from the mobile device.
Lazy action matching. To address these inefficiencies, Caesar matches actions lazily: it first tries to match all
non-action clauses in the graph, and only then tries to match actions. To understand how this works, consider
the rule definitiona then b then c, wherea andc are spatial clauses, andb is a DNN-based clause.
Now, supposea occurs at time t
1
andc at t
2
, Caesar executes the DNN on all tubes that start after t
1
and end
before t
2
in order to determine if there is a match. This addresses both the GPU and network inefficiency
discussed above, since the DNN executes fewer tubes and Caesar retrieves fewer images.
Algorithm (2) shows the complete graph matching algorithm. The input contains the current frame number
as timestamp and a list of positions of active tubes with their tube IDs and locations. If Caesar has not
completed assembling a tube (e.g., the tube’s length is less than 1 second), it appends the tube images to the
temporary tube videos and records the current bounding box locations (lines 2-3). When a tube is available,
Caesar attempts to match the spatial clauses in one or more complex activity definition graphs (line 5). Once it
determines a match, Caesar marks the vertex in that graph as done, and checks the all its neighbor nodes in
the graph. If the neighbor is a DNN node, it adds a new entry to the DNN checklist of the graph, and moves
on to its neighbors. The entry contains the tube ID, DNN action label, starting time, and the ending time.
Action Detection Module
New action:
- Talk(p1, p2)
- Run(p3)
Action Graph
Sit → Talk → Run
P1
Sit Talk(p3)
P1
Sit
Talk(p3)
Talk(p2)
P2 P2
P3
Sit Talk(p1)
P3
Sit Talk(p1) Run
Figure 4.8: An example of graph matching logic: the left three graphs are unfinished graphs of each tube, and the right
three graphs are their updated graphs.
DNN Speed(FPS)
Object Detection [219, 256] 4060
Tracking & ReID [302] 3040
Action Detection [293] 5060 (per tube)
Table 4.2: The runtime frame rate of each DNN model used by Caesar (evaluated on a single desktop GPU [43]).
The starting time is the time when the node is first visited. The end time is the time when Caesar matches
its next node. In our example above, whena is matched at timestamp T 1, Caesar creates an entry forb in
this graph, with the starting time as T 1. Whenc is matched at T 2, the algorithm adds T 2 tob’s entry as the
ending timestamp.
Algorithm 2 Activity Detection with Selective DNN Activation
1: INPUT : incoming tube
2: if tube_cachenot f ull then
3: tube_cache:add(tube); return
4: spatial_acts = get_spatial_actions(tube_cache)
5: for sain spatial_acts do
6: for gin tube_graph_mapping[sa:tube_id] do
7: if sanotin g:next_acts then
8: continue
9: if g:has_pending_nn_act then
10: nn_acts = get_nn_actions(g:nn_start; cur_time())
11: if g:pending_nn_actnotin nn_acts then
12: continue
13: g:next_acts = sa:neighbors()
14: tube_graph_mapping:update()
15: if g:last_node_matched then
16: add g to out put activities
4.4 Evaluation
In this section, we evaluate Caesar’s accuracy and scalability on a publicly available multi-camera data set.
63
4.4.1 Methodology
Implementation and experiment setup. Our experiments use an implementation of Caesar which contains
(a) object detection and image caching on the camera, (b) tracking, re-identification, action detection, and
graph matching on the edge cluster. Caesar’s demo code is available at https://github.com/USC-NSL/Caesar.
In our experiments, we use multiple cameras equipped with Nvidia’s TX2 GPU boards [44]. Each platform
contains a CPU, a GPU, and 4 GB memory shared between the CPU and the GPU. Caesar runs a DNN-based
object detector, SSD-MobilenetV2 [268], on the camera. As described earlier, Caesar caches the frames on the
camera, as well as cropped images of the detected bounding boxes. It sends box coordinates to the edge cluster
using RPC [23]. The server can subsequently request a camera for additional frames, permitting lazy retrieval.
A desktop with three Nvidia RTX 2080 GPUs [43] runs Caesar on the server side. One of the GPUs
runs the state-of-the-art online re-identification DNN [301] and the other two execute the action detection
DNNs [293]. Each action DNN instance has its own input data queue so Caesar can load-balance action
detection invocations across these two for efficiency. We use Redis [47] as the in-memory key-value storage.
Our implementation also includes Flask [19] web server that allows users to input complex activity definitions,
and visualize the results.
DNN model selection. The action detector and the ReID DNN require 7.5 GB of memory, far more than the
4 GB available on the camera. This is why, as described earlier, our mobile device can only run DNN-based
object detection. Among the available models for object detection, we have evaluated four that fit within the
camera’s GPU memory. Table 4.3 shows the accuracy and speed (in frames per second) of these models on
our evaluation dataset. Our experiments use SSD-Mobilenet2 because it has good accuracy with high frame
rate, which is crucial for Caesar because a higher frame rate can lead to higher accuracy in detecting complex
activities.
We use [301] for re-identification because it is lightweight enough to not be the bottleneck. Other ReID
models [284, 326] are more accurate than [301] on our dataset (tested offline), but are too slow (< 10 fps) to
use in Caesar. For atomic actions, other options [176, 282] have slight higher accuracy than [293] on the A V A
dataset, but are not publicly available yet and their performance is not reported.
Dataset. We use DukeMTMC [13] for our evaluations. It has videos recorded by eight non-overlapping
surveillance cameras on campus. The dataset also contains annotations for each person, which gives us the
ground truth of each person tube’s position at any time. We selected 30 minutes of those cameras’ synchronized
videos for testing Caesar. There are 624 unique person IDs in that period, and each person shows up in the
view of 1.7 cameras on average. The average number of people showing up in all cameras is 11.4, and the
64
1 2 3 4 5 6 7 8 Figure 4.9: Camera placement and the content of each camera.
maximum is 69.
Atomic action ground truth. DukeMTMC was originally designed for testing cross-camera person tracking
and ReID, so it does not have any action-related ground truth. Therefore, we labeled the atomic actions in
each frame for each person, using our current action vocabulary. Our action ground truth contains the action
label, timestamp, and the actor’s person ID. We labeled a total of 1,289 actions in all eight cameras. Besides
the atomic actions, we also labeled the ground truth traces of cars and bikes in the videos. Figure 4.9 shows
the placement of these cameras and a sample view from each camera.
Complex activity ground truth. We manually identified 149 complex activities. There are seven different
categories of these complex activities as shown in Table 4.4. This table also lists two other attributes of the
complex activity type and the dataset. The third column of the table shows the number of instances in the data
set of the corresponding complex activity, broken down by how many of them are seen on a single camera vs.
multiple cameras. Thus, for the first complex activity, the entry 12/1 means that our ground-truth contains 12
instances that are visible only on a single camera, and one that is visible across multiple cameras.
These complex activities are of three kinds. #1’s clauses, labeled “NN-only”, are all atomic actions
detected using a NN. #2 through #5, labeled “Mixed”, have clauses that are either spatial or are atomic actions.
The last two, #6 and #7, labeled “Spatial-only”, have only spatial clauses.
Metrics. We replay the eight videos at 20 fps to simulate the realtime camera input on cameras. Caesar’s
server takes the input from all mobile nodes, runs the action detection algorithm for a graph, and outputs
the result into logs. The results contain the activity’s name, timestamp, and the actor or object’s bounding
box location when the whole action finishes. Then we compare the log with the annotated ground truth. A
true positive is when the detected activity matches the ground truth’s complex activity label, overlaps with
the ground truth’s tubes for the complex activity, and has timestamp difference within a 2-second threshold.
We report: recall, which is the fraction of the ground truth classified as true positives, and precision, which
is the fraction of true positives among all Caesar-detected complex activities. We also evaluate the Caesar’s
scalability, as well as the impact of its performance optimizations; we describe the corresponding metrics for
65
DNN Speed(FPS) Accuracy(mAP)
SSD [219] 3.7 91
YOLOv3 [256] 4.1 88
TinyYOLO [253] 8.5 84
SSD-MobileNetv2 [268] 11.2 83
Table 4.3: Speed and accuracy of different DNNs on mobile GPU.
(a) (b)
Wrong Action Wrong Tracking/ReID Miss Detection
“Same person
different IDs”
“Missed talking
and use_phone”
“Missed
the bicycle”
Act-1 Act-2 Act-3
Act-4 Act-6 Act-7
(c)
Figure 4.10: Caesar’s (a) recall rate and (b) precision rate with different action detection and tracker accuracy. (c)
The statistics and sample images of failures in all complex activities.
these later.
4.4.2 Accuracy
Overall. Table 4.5 shows the recall and precision of all complex activities. #1 (using the phone and then
talking to a person) and #4 (walking together then stopping to talk) have the lowest recall at 46.2% and the
lowest precision at 36.4%. At the other end, #5’s two instances achieve 100% recall and precision. Across all
complex activities, Caesar has a recall of %61.0 and a precision of %59.5 precision.
Understanding the accuracy results. Our results show that most NN-only and Mixed activities have lower
position and recall than those in the Spatial-only category. Recall that Caesar uses off-the-shelf neural networks
for action detection and re-identification. This suggests that the action detection DNN, used in the first two
categories but not in the third, is the larger source of detection failures than the re-identification DNN. Indeed,
the reported mean average precision for these two models are respectively 45% and 65% in our dataset.
We expect the overall accuracy of complex activity detection to increase in the future for two reasons. We
use off-the-shelf networks for these activities that are not customized for this camera. There is significant
evidence that customization can improve the accuracy of neural networks [251] especially for surveillance
66
ID ComplexActivity
#ofSamples
(Single/Multi)
Type
1 Use phone then talk 12 / 1 NN-only
2 Stand, use phone then open door 9 / 2 Mixed
3 Approach and give stuff 10 / 0 Mixed
4 Walk together then stop and talk 6 / 2 Mixed
5 Load stuff and get on car 2 / 0 Mixed
6 Ride with bag in two cams 0 / 8 Spatial-only
7 Walk together in two cams 0 / 97 Spatial-only
Table 4.4: Summary of labeled complex activities
cameras since their viewing angles are often different from the images used for training these networks.
Furthermore, these are highly active research areas, so with time we can expect improvements in accuracy.
Two natural questions arise: (a) as these neural networks improve, how will the overall accuracy increase?
and (b) to what extent does Caesar’s graph matching algorithm contribute to detection error? We address both
of these questions in the following analysis.
Projected accuracy improvements. We evaluate Caesar with the tracker and the action detector at different
accuracy levels. To do this, for each of these components, we vary the accuracy p (expressed as a percentage) as
follows. We re-run our experiment, but instead of running the DNNs when Caesar invokes the re-identification
and action detection components, we return the ground truth for that DNN (which we also have) p% of the
time, else return an incorrect answer. By varying p, we can effectively simulate the behavior of the system as
DNN accuracy improves. When p is 100%, the re-identification DNN works perfectly and Caesar always gets
the correct person ID in the same camera and across cameras, so tracking is 100% accurate. Similarly, when
the action DNN has 100% accuracy, it always captures every atomic action correctly for each person. We then
compare the end-to-end result with the complex activity ground-truth to explore precision and accuracy.
Figure 4.10(a) and Figure 4.10(b) visualize the recall and precision of Caesar with different accuracy in
the tracker and the atomic action detector. In both of these figures, the red box represents Caesar’s current
performance (displayed for context).
The first important observation from these graphs is that, when the action detection and re-identification
DNNs are perfect, Caesar achieves 100% precision and recall. This means that the rest of the system, which is
a key contribution of the chapter, works perfectly; this includes matching the spatial clauses, and the temporal
relationships, while lazily retrieving frames and bounding box contents from the camera, and using shared
key-value store to help perform matching across multiple cameras.
67
ActionID 1 2 3 4 5 6 7
Recall(%) 46.2 54.5 60 50 100 50 65.3
Precision(%) 43.8 41.7 42.8 36.4 100 100 63
Table 4.5: Caesar’s recall and precision on the seven complex activities shown in Table 4.4.
The second observation from this graph is that the re-identification DNN’s accuracy affects overall
performance more than that of the action detector. Recall that the re-identification DNN tracks people both
within the same camera and across cameras. There are two reasons for why it affects performance more than
the action detector. The first is the dependency between re-identification, action detection, and graph matching.
If the re-identification is incorrect, then regardless of whether action detection is correct or not, matching
will fail since those actions do not belong to the same tube. Thus, when re-identification accuracy increases,
correct action detection will boost overall accuracy. This is also why, in Figure 4.10(b), the overall precision is
perfect even when the action detector is less than perfect. Second, 70% (105) samples of complex activities in
the current dataset are Spatial-only, which relies more heavily on re-identification, making the effectiveness of
that DNN more prominent.
From those two figures, for a complex activity detector with >90% in recall and precision, the object
detector/tracker must have>90% accuracy and the action detector should have>80% accuracy. We observe
that object detectors (which have been the topic of active research longer than activity detection and re-
identification) have started to approach 90% accuracy in recent years.
Finally, as an aside, note that the precision projections are non-monotonic (Figure 4.10(b)): for a given
accuracy of the Re-ID, precision is non-monotonic with respect to accuracy of atomic action detection. This
results from the dependence observed earlier: if Re-ID is wrong, then even if action detection is accurate,
Caesar will miss the complex activity.
Failure analysis. To get more insight into these macro-level observations, we examined all the failure cases,
including false positives and false negatives; Figure 4.10(c) shows the error breakdown for each activity (#5 is
not listed because it does not have an error case).
The errors fall into three categories. First, the object detection DNN is not perfect and can miss the boxes of
some actors or objects such as bags and bicycles, affecting tube construction and subsequent re-identification.
This performance is worse in videos with rapid changes in the ratio of object size to image scale, large
within-class variations of natural object classes, and background clutter [288]. Figure 4.10(c) shows the object
detector missing a frontal view of a bicycle.
Re-ID either within a single camera, or across multiple cameras, is error-prone. Within a camera, a person
may be temporarily occluded. Tracking occluded people is difficult especially in a crowded scene. Existing
1 2 3 4 5 6 7 8
Number of cameras
10
1
10
2
10
3
Latency (s)
Strawman:All
Caesar:Mix
Caesar:Spatial
(a)
0 2 4 6 8
Number of cameras
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
FPS (per camera)
Strawman:All
Caesar:Mix
Caesar:Spatial
(b)
2 4 6 8
Number of cameras
10
1
10
2
10
3
Max cache size on mobile device (MB)
Strawman:All
Caesar:Mix
Caesar:Spatial
(c)
Figure 4.11: (a) Latency of Caesar and the strawman solution with different number of inputs. (b) Throughput of
Caesar and the strawman solution with different number of Inputs. (c) Maximum cache size needed for Caesar and the
strawman solution to reach the best accuracy, with different number of Inputs.
tracking algorithms and Re-ID DNNs have not completely solved the challenge, and result in many ID-switches
within a single camera. Similarly, Re-ID can fail across cameras even with our robustness enhancements which
encode camera topology. This occurs because of different lighting conditions, changes in pose between different
cameras, different percentage of occlusions, and different sets of detectable discriminative features [281, 133].
An incorrect re-identification results in Caesar missing all subsequent atomic actions performed by a person.
Action detection is the third major cause of detection failures. Blurry frames, occlusions, and an insufficient
number of frames in a tube can all cause high error rates for the action DNN, resulting from incorrect labeling
[245]. As described earlier, errors in other modules can propagate to action detection: object detection failures
can result in shorter tubes for action detection; the same is true of re-identification failures within a single
camera.
Object detection failure is the least important factor, although it affects #6 because it requires detecting
a bicycle. For the graphs that require DNN-generated atomic actions, the action detection error is more
influential than tracking. In the Spatial-Only cases, the tracking issue is the major cause of errors.
Caesar performance on more complex activities. Since the MTMC dataset has few complex activities,
we recorded another dataset to test Caesar on a variety of cross-camera complex activities Table 4.6. We
placed three non-overlapping cameras in a plaza on our campus. V olunteers acted out activities not included
in Table 4.4, such as “shake hands”, “eat”, and “take a photo”; these invoke our action DNN. We observe
(Table 4.6) that Caesar can capture these complex activities with precision >80% (100% for 5 of the 7
activities) and recall75% for 5 of the activities. All failures are caused by incorrect re-ID due to varying
lighting conditions.
69
Complex Activity
# of Samples
(Single/Multi)
Recall(%)
/Precision(%)
Eat then drink 5/0 80/80
Shake hands then sit 6/0 83.3/100
Use laptop then take photo 2/2 75/100
Carry bag and sit then read 2/4 66.7/80
Use laptop, read then drink 0/4 75/100
Read, walk then take photo 0/4 75/100
Carry bag, sit, then eat, then
drink, then read, then take photo
0/4 50/100
Table 4.6: Activities in a three-camera dataset, and Caesar’s accuracy.
4.4.3 Scalability
We measure the scalability of Caesar by the number of concurrent videos it can support with fixed server-side
hardware, assuming cameras have on-board GPUs for running action detection.
Strawman approach for comparison. Caesar’s lazy action detection is an important scaling optimization.
To demonstrate its effectiveness, we evaluate a strawman solution which disables lazy action detection. The
strawman retrieves images from the camera for every bounding box, and runs action detection on every tube.
Recall that lazy invocation of action detection does not help for NN-only complex activities. Lazy
invocation first matches spatial clauses in the graph, then selectively invokes action detection. But, NN-only
activities do not contain spatial clauses, so Caesar invokes action detection on all tubes, just like the strawman.
However, for Mixed or Spatial-only complex activities lazy invocation conserves the use of GPU resources.
To highlight these differences, we perform this experiment by grouping the complex activities into these
three groups: Strawman (which is the same as NN-only for the purposes of this experiment), Mixed, and
Spatial-Only. Thus, for example, in the Mixed group experiment, Caesar attempts to detect all Mixed complex
activities.
Latency. Figure 4.11(a) shows Caesar’s detection latency is a function of the number of cameras for each of
these three alternatives. The detection latency is the time between when a complex activity completes in a
camera to when Caesar detects it.
The results demonstrate the impact of the scaling optimization in Caesar. As the number of cameras
increases, detection latency can increase dramatically for the strawman, going up to nearly 1000 seconds with
eight cameras. For Mixed complex activities, the latency is an order of magnitude less at about 100 seconds;
this illustrates the importance of our optimization, without which Caesar’s performance would be similar to
the Strawman, an order of magnitude worse. For Spatial-only activities that do not involve action detection,
the detection latency is a few seconds; Caesar scales extremely well for these types of complex activities.
70
Recall that in these experiments, we fix the number of GPU resources. In practice, in an edge cluster,
there is likely to be some elasticity in the number of GPUs, so Caesar can target a certain detection latency by
dynamically scaling the number of GPUs assigned for action detection. We have left this to future work.
Finally, we note that up to 2 cameras, all three approaches perform the same; in this case, the bottleneck is
the object detector on the camera with a frame rate of 20 fps.
Frame rate. Figure 4.11(b) shows the average frame rate at which Caesar can process these different
workloads, as a function of the number of cameras. This figure explains why Strawman’s latency is high: its
frame rate drops precipitously down to just two frames per second with eight concurrent cameras. Caesar
scales much more gracefully for other workloads: for both Mixed and Spatial-only workloads, it is able to
maintain over 15 frames per second even up to eight cameras.
These results highlight the importance of hybrid complex activity descriptions. Detecting complex activities
using neural networks can be resource-intensive, so Caesar’s ability to describe actions using spatial and
temporal relationships while using DNNs sparingly is the key to enabling scalable complex activity detection
system.
Cache size. Caesar maintains a cache of image contents at the camera. The longer the detection latency,
the larger the cache size. Thus, another way to examine Caesar’s scalability is to understand the cache size
required for different workloads with different number of concurrent cameras. The camera’s cache size limit is
4 GB.
Figure 4.11(c) plots the largest cache size observed during an experiment for each workload, as a function
of the number of cameras. Especially for the Strawman, this cache size exceeded the 4 GB limit on the camera,
so we re-did these experiments on a desktop with more memory. The dotted-line segment of the Strawman
curve denotes these desktop experiments. When Caesar exceeds memory on the device, frames can be saved
on persistent storage on the mobile device, or be transmitted to the server for processing, at the expense of
latency and bandwidth.
Strawman has an order of magnitude higher cache size requirements precisely because its latency is an
order of magnitude higher than the other schemes; Caesar needs to maintain images in the cache until detection
completes. In contrast, Caesar requires a modest and fixed 100 MB cache for Spatial-only workloads on
the camera: this supports lazy retrieval of frames or images for re-identification. The cache size for Mixed
workloads increases in proportion to the increasing detection latency for these workloads.
71
0 1 2 3 4 5 6 7
Camera ID
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Data Uploaded (GB)
Strawman:All Caesar:Mix Caesar:Spatial
Figure 4.12: The total amount of data to be uploaded from each camera, with different uploading schemes.
TX1 TX2
0
500
1000
1500
2000
2500
Consumed Energy (mAh)
89.5
65.8
2088.9
1861.5
384.4
345.0
286.8
252.6
Idle
Strawman:All
Caesar:Mix
Caesar:Spatial
Figure 4.13: The average energy consumption of cameras in Caesar, with different uploading scheme and action
queries.
4.4.4 Data Transfer
Caesar’s lazy retrieval of images conserves wireless bandwidth, and to quantify its benefits, we measure the
total amount of data uploaded from each camera (the edge cluster sends small control messages to request
image uploads; we ignore these). In Figure 4.12, the strawman solution simply uploads the whole 30-min
video with metadata (more than 1.5 GB for each camera). Mixed transfers>3 fewer data, and Spatial-only is
an order of magnitude more efficient than Strawman. Caesar’s data transfer overhead can be further optimized
by transferring image deltas, leveraging the redundancy in successive frames or images within successive
bounding boxes; we have left this to future work.
72
4.4.5 Energy Consumption
Even for wireless surveillance cameras with power sources, it may be important to quantify the energy required
to run these workloads on the camera. The TX2 GPUs’ onboard power supply chipset provides instantaneous
power consumption readings at 20 Hz. We plot, in Figure 4.13, the total energy required for each workload by
integrating the power consumption readings across the duration of the experiment. For context, we also plot
the idle energy consumed by the device when run for 30 mins.
Strawman consumes 1800 mAh for processing our dataset, comparable to the battery capacity of modern
smart phones. For Mixed and Spatial workloads, energy consumption is, respectively, 6 to 10 lower, for
two reasons: (a) these workloads upload fewer images, reducing the energy consumed by the wireless network
interface; (b) Strawman needs to keep the device on for longer to serve retrieval requests because its detection
latency is high.
73
Chapter 5
Grab: A Cashier-Free Shopping
System
5.1 Introduction
While electronic commerce continues to make great strides, in-store purchases are likely to continue to be
important in the coming years: 91% of purchases are still made in physical stores [116, 115] and 82% of
millennials prefer to shop in these stores [107]. However, a significant pain point for in-store shopping is the
checkout queue: customer satisfaction drops significantly when queuing delays exceed more than four minutes
[8]. To address this, retailers have deployed self-checkout systems (which can increase instances of shoplifting
[37, 48]), and expensive vending machines.
The most recent innovation is cashier-free shopping, in which a networked sensing system automatically
(a) identifies a customer who enters the store, (b) tracks the customer through the store, (c) and recognizes what
they purchase. Customers are then billed automatically for their purchases, and do not need to interact with a
human cashier or a vending machine, or scan items by themselves. Over the past year, several large online
retailers like Amazon and Alibaba [1, 55] have piloted a few stores with this technology, and cashier-free
stores are expected to take off in the coming years [7, 62]. Besides addressing queue wait times, cashier-free
shopping is expected to reduce instances of theft, and provide retailers with rich behavioral analytics.
Not much is publicly known about the technology behind cashier-free shopping, other than that stores
need to be completely redesigned [1, 55] which can require significant capital investment. In this chapter, we
ask: Is cashier-free shopping viable without having to completely redesign stores? To this end, we observe
that many stores already have, or will soon have, the hardware necessary to design a cashier-free shopping
74
system: cameras deployed for in-store security, sensor-rich smart shelves [63] that are being deployed by large
retailers [32] to simplify asset tracking, and RFID tags being deployed on expensive items to reduce theft.
This chapter explores the design and implementation of a practical cashier-free shopping system called Grab
1
using this infrastructure, and quantifies its performance.
Grab needs to accurately identify and track customers, and associate each shopper with items he or she
retrieves from shelves. It must be robust to visual occlusions resulting from multiple concurrent shoppers,
and to concurrent item retrieval from shelves where different types of items might look similar, or weigh the
same. It must also be robust to fraud, specifically to attempts by shoppers to confound identification, tracking,
or association. Finally, it must be cost-effective and have good performance to achieve acceptable accuracy:
specifically, we show that, for vision-based tasks, slower than 10 frames/sec processing can reduce accuracy
significantly.
Contributions An obvious way to architect Grab is to use deep neural networks (DNNs) for each individual
task in cashier-free shopping, such as identification, pose tracking, gesture tracking, and action recognition.
However, these DNNs are still relatively slow and many of them cannot process frames at faster than 5-8 fps.
Moreover, even if they have high individual accuracy, their effective accuracy would be much lower if they
were cascaded together.
Grab’s architecture is based on the observation that, for cashier-free shopping, we can use a single vision
capability (pose detection) as a building block to perform all of these tasks. A recently developed DNN library,
OpenPose [45] accurately estimates body "skeletons" in videos at high frame rates.
Grab’s first contribution is to develop a suite of lightweight identification and tracking algorithms built
around these skeletons. Grab uses the skeletons to accurately determine the bounding boxes of faces to enable
feature-based face detection. It uses skeletal matching, augmented with color matching, to accurately track
shoppers even when their faces might not be visible, or even when the entire body might not be visible. It
augments OpenPose’s elbow-wrist association algorithm to improve the accuracy of tracking hand movements
which are essential to determining when a shopper may pickup up items from a shelf.
Grab’s second contribution is to develop fast sensor fusion algorithms to associate a shopper’s hand with
the item that he picks up. For this, Grab uses a probabilistic assignment framework: from cameras, weight
sensors and RFID receivers, it determines the likelihood that a given shopper picked up a given item. When
multiple concurrent actions occur, it uses an optimization framework to associate hands with items.
Grab’s third contribution is to improve the cost-effectiveness of the overall system by multiplexing multiple
cameras on a single GPU. It achieves this by avoiding running OpenPose on every frame, and instead using a
lightweight feature tracker to track the joints of the skeleton between successive frames.
1
A shopper only needs to grab items and go.
75
Smart
Shelf
Shopper
Identity
Tracking
GPU
Multiplexing
Action
Recognition
Output
User:
Alice
Bought:
Coke x 1
Cups x 1
Customer
Registration
Figure 5.1: Grab is a system for cashier-free shopping and has four components: registration, identity tracking, action
recognition, and GPU multiplexing.
Using data from a pilot deployment in a retail store, we show that Grab has 93% precision and 91% recall
even when nearly 40% of shopper actions were adversarial. Grab needs to process video data at 10 fps or
faster, below which accuracy drops significantly: a DNN-only design cannot achieve this capability. Grab
needs all three sensing modalities, and all of its optimizations: removing an optimization, or a sensor, can drop
precision and recall by 10% or more. Finally, Grab’s design enables it to multiplex up to 4 cameras per GPU
with negligible loss of precision.
5.2 Grab Design
Grab addresses these challenges by building upon a vision-based keypoint-based pose tracker DNN for
identification and tracking, together with a probabilistic sensor fusion algorithm for recognizing item pickup
actions. These ensure a completely non-intrusive design where shoppers are not required to scan item codes or
pass through checkout gates while shopping. Grab consists of four major components (Figure 5.1).
Identity tracking recognizes shoppers’ identities and tracks their movements within the store. Action
recognition uses a probabilistic algorithm to fuse vision, weight and RFID inputs to determine item pickup or
dropoff actions by a customer. GPU multiplexing enables processing multiple cameras on a single GPU. Grab
also has a fourth, offline component, registration. Customers must register once online before their first store
visit. Registration involves taking a video of the customer to enable matching the customer subsequently.
5.2.1 Identity tracking
Identity tracking consists of two related sub-components. Shopper identification determines who the shopper
is among registered users. Shopper tracking, determines (a) where the shopper is in the store at each instant of
time, and (b) what the shopper is doing at each instant.
Requirements and Challenges In designing Grab, we require first that customer registration be fast, even
76
Original skeleton output Frontal face bounding box
Bounding box w/o
direction adjustment
Bounding box w/
direction adjustment
(b) (c)
(e) (d)
(a)
Figure 5.2: (a) Sample OpenPose output. (b,c,d,e) Grab’s approach adjusts the face’s bounding box using the keypoints
detected by OpenPose. (The face shown is selected from OpenPose project webpage [45])
though it is performed only once: ideally, a customer should be able to register and immediately commence
shopping. Identity tracking requires not just identifying the customer, but also detecting each person’s pose,
such as hand position and head position. For each of these tasks, the computer vision community has developed
DNNs. However, in our setting, computing efficiency is important for cost reasons: dedicating a GPU for each
task can be prohibitively expensive. Running all of these DNNs on a single GPU can compromise accuracy.
Approach In this chapter, we make the following observation: we can build end-to-end identity tracking using
a state-of-the-art fast (15 fps) pose tracker. Specifically, we use, as a building block, a keypoint based body
pose tracker, called OpenPose [45]. Given an image frame, OpenPose detects keypoints for each human in the
image. Keypoints identify distinct anatomical structures in the body (Figure 5.2(a)) such as eyes, ears, nose,
elbows, wrists, knees, hips etc. We can use these skeletons for identification, tracking and gesture recognition.
However, fundamentally, since OpenPose operates only on a single frame, Grab needs to add identification,
tracking and gesture recognition algorithms on top of OpenPose to continuously identify and tracks shoppers
and their gestures. The rest of this section describes these algorithms.
Shopper Identification Grab uses fast feature-based face recognition to identify shoppers. While prior work
has explored other approaches to identification such as body features [132, 154] or clothing color [226], we
use faces because (a) face recognition has been well-studied by vision researchers and we are likely to see
continued improvements, (b) faces are more robust for identification than clothing color, and (c) face features
have the highest accuracy in large datasets.
Feature-based face recognition When a user registers, Grab takes a video of their face, extracts features, and
builds a fast classifier using these features. To identify shoppers, Grab does not directly use a face detector on
the entire image because non-DNN detectors [217] can be inaccurate, and DNN-based face detectors such as
77
MTCNN [319] can be slow. Instead, Grab identifies a face’s bounding box using keypoints from OpenPose,
specifically, the five keypoints of the face from the nose, eyes, and ears (Figure 5.2(b)). Then, it extracts
features from within the bounding box and applies the trained classifier.
Fast Classification Registration is performed once for each customer, in which Grab extracts features from
the customer’s face. To do this, we evaluated several face feature extractors [135, 15], and finally selected
ResNet-34’s feature extractor [15] which produces a 128-dimension feature vector, performs best in both speed
and accuracy.
With these features, we can identify faces by comparing feature distances, build classifiers, or train a
neural network. After experimenting with these options, we found that a k nearest neighbor (kNN) classifier, in
which each customer is trained as a new class, worked best among these choices. Grab builds one kNN-based
classifier for all customers and uses it across all cameras.
Tightening the face bounding box During normal operation, Grab extracts facial features within a bounding
box (derived from OpenPose keypoints) around each customer’s face. Grab infers the face’s bounding box
width using the distance between two ears, and the height using the distance from nose to neck. This works
well when the face points towards the camera (Figure 5.2(c)), but can have inaccurate bounding box when
customers face slightly away from the camera (Figure 5.2(d)). This inaccuracy can degrade classification
performance.
To obtain a tighter bounding box, we estimate head pitch and yaw using the keypoints. Consider the line
between the nose and neck keypoints: the distance of each eye and ear keypoint to this axis can be used to
estimate head yaw. Similarly, the distance of the nose and neck keypoints to the axis between the ears can be
used to estimate pitch. Using these, we can tighten the bounding box significantly (Figure 5.2(e)). To improve
detection accuracy when a customer’s face is not fully visible in the camera, we also use face alignment [135],
which estimates the frontal view of the face.
Shopper Tracking A user’s face may not always be visible in every frame, since customers may intentionally
or otherwise turn their back to the camera. However, Grab needs to be able to identify the customer in frames
where the customer’s face is not visible, for which it uses tracking.
Skeleton-based Tracking Existing human trackers use bounding box based approaches [286, 303], which can
perform poorly in in-store settings with partial or complete occlusions.
Instead, we use the skeleton generated by OpenPose to develop a tracker that uses geometric properties of
the body frame. We use the term track to denote the movements of a distinct customer (whose face may or
may not have been identified). Suppose OpenPose identifies a skeleton in a frame: the goal of the tracker is to
associate the skeleton with an existing track if possible. Grab uses the following to track customers. It tries to
78
Approach
A is close, B is far
A
B
A
B B
A
Occlusion
B no longer detected
Keep B’s track after A
Resume
Update B with new detection
Frame 2 Frame 1 Frame 3
Figure 5.3: When a shopper is occluded by another, Grab resumes tracking after the shopper re-appears (lazy tracking).
align each keypoint in the skeleton with the corresponding keypoint in the last seen skeleton in each track, and
selects that track whose skeleton is the closest match (the sum of match errors is smallest). Also, as soon as it
is able to identify the face, Grab associates the customer’s identity with the track (to be robust to noise, Grab
requires that the customer’s face is identified in 3 successive frames).
Dealing with Occlusions In some cases, a shopper may be obscured by another. Grab uses lazy tracking in this
case (Figure 5.3). When an existing track disappears in the current frame, Grab checks if the track was close
to the edge of the image, in which case it assumes the customer has moved out of the camera’s field of view
and deletes the track. Otherwise, it marks the track as blocked. When the customer reappears in a subsequent
frame, it reactivates the blocked track.
Shopper Gesture Tracking Grab must recognize the arms of each shopper in order to determine which item
he or she purchases. OpenPose has a built-in limb association algorithm, which associates shoulder joints to
elbows, and elbows to wrists. We have found that this algorithm is a little brittle in our setting: it can miss an
association (Figure 5.4(a)), or mis-associate part of a limb of one shopper with another (Figure 5.4(b)).
How limb association in OpenPose works OpenPose first uses a DNN to associate with each pixel confidence
value of it being part of an anatomical key point (e.g., an elbow, or a wrist). During image analysis, OpenPose
also generates vector fields (called part affinity fields [149]) for upper-arms and forearms whose vectors are
aligned in the direction of the arm. Having generated keypoints, OpenPose then estimates, for each pair of
keypoints, a measure of alignment between an arm’s part affinity field, and the line between the keypoints (e.g.,
elbow and wrist). It then uses a bipartite matching algorithm to associate the keypoints.
Improving limb association robustness One challenge for OpenPose’s limb association is that the pixels for
the wrist keypoint are conflated with pixels in the hand (Figure 5.4(a)). This likely reduces the part affinity
alignment, causing limb association to fail. To address this, for each keypoint, we filtered outlier pixels by
removing pixels whose distance from the mediod [240] was greater than the 85th percentile.
The second source of brittleness is that OpenPose’s limb association treats each limb independently,
resulting in cases where the key point from one person’s elbow may get associated with another person’s
79
Connection Failure Wrong Assignment
(a) (b )
Figure 5.4: OpenPose can (a) miss an assignment between elbow and wrist, or (b) wrongly assign one person’s joint to
another.
wrist (Figure 5.4(b)). To avoid this failure mode, we modify OpenPose’s limb association algorithm to treat
one person’s forearms or upper-arms as a pair. To identify forearms (or upper-arms) as belonging to the
same person, we measure the Euclidean distance ED(:) between color histograms F(:) belonging to the two
forearms, and treat them as a pair if the distance is less than an empirically-determined threshold thresh. We
can formulate this as an optimization problem:
max
i; j
å
i2E
å
j2W
A
i; j
z
i; j
s:t:
å
j2W
z
i; j
1 8i2 E;
å
i2E
z
i; j
1 8 j2 W;
ED(F(i; j);F(i
0
; j
0
))< thresh 8 j; j
0
2 W i;i
0
2 E
where E and W are the sets of elbow and wrist joints, and A
i; j
is the alignment measure between the i-th
elbow and the j-th wrist, while z
i; j
is an indicator variable indicating connectivity between the elbow and the
wrist. The third constraint models whether two elbows belong to the same body, using the Euclidean distance
between the color histograms of the body color. This formulation reduces to a max-weight bipartite matching
problem, and we solve it with the Hungarian algorithm [211].
Tag RSSI
Matching
Visual
Feature
Matching
Output
User:
Alice
Bought:
Coke x 1
Cups x 1
Customer
Pose Tracks
Proximity
Event
Detection
Weight
Change
Matching
Figure 5.5: Grab recognizes the items a shopper picks up by fusing vision with smart-shelf sensors including weight
and RFID.
80
5.2.2 Shopper Action Recognition
When a shopper is being continuously tracked, and their hand movements accurately detected, the next
step is to recognize hand actions, specifically to identify item(s) which the shopper picks up from a shelf.
Vision-based hand tracking alone is insufficient for this in the presence of multiple shoppers concurrently
accessing items under variable lighting conditions. Grab leverages the fact that many retailers are installing
smart shelves [63, 29] to deter theft. These shelves have weight sensors and are equipped with RFID readers.
Weight sensors cannot distinguish between items of similar weight, while not all items are likely to have RFID
tags for cost reasons. So, rather than relying on any individual sensor, Grab fuses detections from cameras,
weight sensors, and RFID tags to recognize hand actions.
Modeling the sensor fusion problem In a given camera view, at any instant, multiple shoppers might be
reaching out to pick items from shelves. Our identity tracker tracks hand movement, the goal of the action
recognition problem is to associate each shopper’s hand with the item he or she picked up from the shelf. We
model this association between shopper’s hand k and item m as a probability p
k;m
derived from fusing cameras,
weight sensors, and RFID tags (Figure 5.5). p
k;m
is itself derived from association probabilities for each of the
devices, in a manner described below. Given these probabilities, we then solve the association problem using a
maximum weight bipartite matching. In the following paragraphs, we discuss details of each of these steps.
Proximity event detection Before determining association probabilities, we need to determine when a
shopper’s hand approaches a shelf. This proximity event is determined using the identity tracker module’s
gesture tracking. Knowing where the hand is, Grab uses image analysis to determine when a hand is close to a
shelf. For this, Grab requires an initial configuration step, where store administrators specify camera view
parameters (mounting height, field of view, resolution etc.), and which shelf/shelves are where in the camera
view. Grab uses a threshold pixel distance from hand to the shelf to define proximity, and its identity tracker
reports start and finish times for when each hand is within the proximity of a given shelf (a proximity event).
(When the hand is obscured, Grab estimates proximity using the position of other skeletal keypoints, like the
ankle joint).
Association probabilities from the camera When a proximity event starts, Grab starts tracking the hand and
any item in the hand. It uses the color histogram of the item to classify the item. To ensure robust classification,
Grab performs (Figure 5.6(a)) (a) background subtraction to remove other items that may be visible and (b)
eliminates the hand itself from the item by filtering out pixels whose color matches typical skin colors. Grab
extracts a 384 dimension color histogram from the remaining pixels.
During an initial configuration step, Grab requires store administrators to specify which objects are on
which shelves. Grab then builds, for each shelf (a single shelf might contain 10-15 different types of items),
81
builds a feature-based kNN classifier (chosen both for speed and accuracy). Then, during actual operation,
when an item is detected, Grab runs this classifier on its features. The classifier outputs an ordered list of
matching items, with associated match probabilities. Grab uses these as the association probabilities from the
camera. Thus, for each hand i and each item j, Grab outputs the camera-based association probability.
Background
subtraction
Original image with
hand detection
Hand
Elimination
Visual
Features
(a)
Proximity
Threshold
Hand-A Hand-B
Weight Drop 1
Weight Drop 2
(b)
(c)
Figure 5.6: (a) Vision based item detection does background subtraction and removes the hand outline. (b) Weight
sensor readings are correlated with hand proximity events to assign association probabilities. (c) Tag RSSI and hand
movements are correlated, which helps associate proximity events to tagged items.
Association probabilities from weight sensors In principle, a weight sensor can determine the reduction in
total weight when an item is removed from the shelf. Then, knowing which shopper’s hand was closest to the
shelf, we can associate the shopper with the item. In practice, this association needs to consider real-world
behaviors. First, if two shoppers concurrently remove two items of different weights (say a can of Pepsi and
a peanut butter jar), the algorithm must be able to identify which shopper took which item. Second, if two
shoppers are near the shelf, and two cans of Pepsi were removed, the algorithm must be able to determine if a
single shopper took both, or each shopper took one. To increase robustness to these, Grab breaks this problem
down into two steps: (a) it associates a proximity event to dynamics in scale readings, and (b) then associates
scale dynamics to items by detecting weight changes.
Associating proximity events to scale dynamics Weight scales sample readings at 30 Hz. At these rates, we
have observed that, when a shopper picks up an item or deposits an item on a shelf, there is a distinct "bounce"
(a peak when an item is added, or a trough when removed) because of inertia (Figure 5.6(b)). If d is the
duration of this peak or trough, and d
0
is the duration of the proximity event, we determine the association
probability between the proximity event and the peak or trough as the ratio of the intersection of the two to
the union of the two. As Figure 5.6(b) shows, if two shoppers pick up items at almost the same time, our
algorithm is able to distinguish between them. Moreover, to prevent shoppers from attempting to confuse Grab
by temporarily activating the weight scale with a finger or hand, Grab filters out scale dynamics where there is
high frequency of weight change.
Associating scale dynamics to items The next challenge is to measure the weight of the item removed or
deposited. Even when there are multiple concurrent events, the 30 Hz sampling rate ensures that the peaks and
82
troughs of two concurrent actions are likely distinguishable (as in Figure 5.6(b)). In this case, we can estimate
the weight of each item from the sensor reading at the beginning of the peak or trough w
s
and the reading at
the end w
e
. Thusjw
s
w
e
j is an estimate of the item weight w. Now, from the configuration phase, we know
the weights of each type of item on the shelf. Defined
j
asjw w
j
j where w
j
is the known weight of the j-th
type of item in the shelf. Then, we say that the probability that the item removed or deposited was the j-th
item is given by
1=d
j
å
i
(1=d
i
)
. This definition accounts for noise in the scale (the estimates for w might be slightly
off) and for the fact that some items may be very similar in weight.
Combining these association probabilities From these steps, we get two association probabilities: one
associating a proximity event to a peak or trough, another associating the peak or trough to an item type. Grab
multiplies these two to get the probability, according to the weight sensor, that hand i picked item j.
Association probabilities from RFID tag For items which have an RFID tag, it is trivial to determine which
item was taken (unlike with weight or vision sensors), but it is still challenging to associate proximity events
with the corresponding items. For this, we leverage the fact that the tag’s RSSI becomes weaker as it moves
away from the RFID reader. Figure 5.6(c) illustrates an experiment where we moved an item repeatedly closer
and further away from a reader; notice how the changes in the RSSI closely match the distance to the reader. In
smart shelves, the RFID reader is mounted on the back of the shelf, so that when an object is removed, its tag’s
RSSI decreases. To determine the probability that a given hand caused this decrease, we use probability-based
Dynamic Time Warping [139], which matches the time series of hand movements with the RSSI time series
and assigns a probability which measures the likelihood of association between the two. We use this as the
association probability derived from the RFID tag.
Putting it all together In the last step, Grab formulates an assignment problem to determine which hand to
associate with which item. First, it determines a time window consisting of a set of overlapping proximity
events. Over this window, it first uses the association probabilities from each sensor to define a composite
probability p
k;m
between the k-th hand and the m-th item: p
k;m
is a weighted sum of the three probabilities
from each sensor (described above), with the weights being empirically determined.
Then, Grab formulates the assignment problem as an optimization problem:
max
k;m
å
p
k;m
z
k;m
s:t:
å
k2H
z
k;m
1 8m2 I;
å
l2I
t
z
k;l
u
l
8k2 H
83
where H is the set of hands, I is the set of items, and I
t
is the set of item types, and z
k;m
is an indicator variable
that determines if hand k picked up item m. The first constraint models the fact that each item can be removed
or deposited by one hand, and the second models the fact that sometimes shoppers can pick up more than one
item with a single hand: u
l
is a statically determined upper bound on the number of items of the l-th item that
a shopper can pick up using a single hand (e.g., it may be physically impossible to pick up more than 3 bottles
of a specific type of shampoo). This formulation is a max-weight bipartite matching problem, which we can
optimally solve using the Hungarian [211] algorithm.
5.2.3 GPU Multiplexing
Because retailer margins can be small, Grab needs to minimize overall costs. The computing infrastructure
(specifically, GPUs) is an important component of this cost. In what we have described so far, each camera in
the store needs a GPU.
Grab actually enables multiple cameras to be multiplexed on one GPU. It does this by avoiding running
OpenPose on every frame. Instead, Grab uses a tracker to track joint positions from frame to frame: these
tracking algorithms are fast and do not require the use of the GPU. Specifically, suppose Grab runs OpenPose
on frame i. On that frame, it computes ORB [264] features around every joint. ORB features can be computed
faster than previously proposed features like SIFT and SURF. Then, for each joint, it identifies the position of
the joint in frame i+ 1 by matching ORB features between the two frames. Using this it can reconstruct the
skeleton in frame i+ 1 without running OpenPose on that frame.
Grab uses this to multiplex a GPU over N different cameras. It runs OpenPose from a frame on each
camera in a round-robin fashion. If a frame has been generated by the k-the camera, but Grab is processing
a frame from another (say, the m-th) camera, then Grab runs feature-based tracking on the frame from the
k camera. Using this technique, we show that Grab is able to scale to using 4 cameras on one GPU without
significant loss of accuracy.
5.3 Evaluation
We now evaluate the end-to-end accuracy of Grab and explore the impact of each optimization on overall
performance.
2
2
Demo video of Grab: https://vimeo.com/245274192
84
a) b)
RFID
Antenna
RFID Module
Arduino MCU
RFID Tag
Weight Scale
ADC
Arduino
MCU
(a)
(b)
Figure 5.7: (a) Left: Weight sensor hardware, Right: RFID hardware; (b) Grab sample output.
5.3.1 Grab Implementation
Weight-sensing Module To mimic weight scales on smart shelves, we built scales costing $6, with fiberglass
boards and 2 kg, 3 kg, 5 kg pressure sensors. The sensor output is converted by the SparkFun HX711 load cell
amplifier [30] to digital serial signals. An Arduino Uno Micro Control Unit (MCU) [4] (Figure 5.7(a)-left)
batches data from the ADCs and sends it to a server. The MCU has nine sets of serial Tx and Rx so it can
collect data from up to nine sensors simultaneously. The sensors have a precision of around 510 g, with an
effective sampling rate of 30 Hz
3
.
RFID-sensing Module For RFID, we use the SparkFun RFID modules with antennas and multiple UHF
passive RFID tags [51] (Figure 5.7(a)-right). The module can read up to 150 tags per second and its maximum
detection range is 4 m with and antenna. The RFID module interfaces with the Arduino MCU to read data
from tags.
Video input We use IP cameras [3] for video recording. In our experiments, the cameras are mounted on
merchandise shelves and they stream 720p video using Ethernet. We also tried webcams and they achieved
similar performance (detection recall and precision) as IP cameras.
Identity tracking and action recognition These modules are built on top of the OpenPose [45] library’s
skeleton detection algorithm (in C++). As discussed earlier, we use a modified limb association algorithm.
Our other algorithms are implemented in Python, and interface with OpenPose using a boost.python wrapper.
Our implementation has over 4K lines of code.
5.3.2 Methodology, Metrics, and Datasets
In-store deployment To evaluate Grab, we collected traces from an actual deployment in a retail store. For
this trace collection, we installed the sensors described above in two shelves in the store. First, we placed two
3
The HX711 can sample at 80 Hz, but the Arduino MCU, when used with several weight scales, limits the sampling rate to 30 Hz.
85
cameras at the ends of an aisle so that they could capture both the people’s pose and the items on the shelves.
Then, we installed weight scales on each shelf. Each shelf contains multiple types of items, and all instances
of a single item were placed on a single shelf at the beginning of the experiment (during the experiment, we
asked users to move items from one shelf to another to try to confuse the system, see below). In total, our
shelves contained 19 different types of items. Finally, we placed the RFID reader’s antenna behind the shelf,
and we attached RFID tags to all instances of 8 types of items.
Trace collection We then recorded five hours worth of sensor data from 41 users who registered their faces
with Grab. We asked these shoppers to test the system in whatever way they wished to (Figure 5.7(b)). The
shoppers selected from among the 19 different types of items, and interacted with the items (either removing
or depositing them) a total of 307 times. Our cameras saw an average of 2.1 shoppers and a maximum of 8
shoppers in a given frame. In total, we collected over 10GB of video and sensor data, using which we analyze
Grab’ performance.
Adversarial actions During the experiment, we also asked shoppers to perform three kinds of adversarial
actions. (1) Item-switching: The shopper takes two items of similar color or similar weight then puts one back,
or takes one item and puts it on a different scale; (2) Hand-hiding: The shopper hides the hand from the camera
and grabs the item; (3) Sensor-tampering: The shopper presses the weight scale with hand. Nearly 40% of the
307 recorded actions were adversarial: 53 item-switching, 34 hand-hiding, and 31 sensor-tampering actions.
Metrics To evaluate Grab’s accuracy, we use precision and recall. In our context, precision is the ratio of true
positives to the sum of true positives and false positives. Recall is the ratio of true positives to the sum of true
positives and false negatives. For example, suppose a shopper picks items A, B, and C, but Grab shows that
she picks items A, B, D, and E. A and B are correctly detected so the true positives are 2, but C is missing and
is a false negative. The customer is wrongly associated with D and E so there are 2 false positives. In this
example, recall is 2/3 and precision is 2/4.
5.3.3 Accuracy of Grab
Overall precision and recall Figure 5.8(a) shows the precision and recall of Grab, and quantifies the impact
of using different combinations of sensors: using vision only (V Only), weight only (W only), RFID only (R
only) or all possible combinations of two of these sensors. Across our entire trace, Grab achieves a recall of
nearly 94% and a precision of over 91%. This is remarkable, because in our dataset nearly 40% of the actions
are adversarial. We dissect Grab failures below and show how these are within the loss margins that retailers
face today due to theft or faulty equipment.
86
Figure 5.8: Grab has high precision and recall across our entire trace (a), relative to other alternatives that only use
a subset of sensors (W: Weight; V: Vision; R: RFID), even under adversarial actions such as (b) Item-switching; (c)
Hand-hiding; (d) Sensor-Tampering .
Using only a single sensor
4
degrades recall by 12-37% and precision by 16-36% (Figure 5.8(a)). This
illustrates the importance of fusing multiple sensors for associating proximity events with items. The biggest
loss of accuracy comes from using only the vision sensors to detect items. RFID sensors perform the best,
since RFID can accurately determine which item was selected
5
. Even so, an RFID-only deployment has 12%
lower recall and 16% lower precision. Of the sensor combinations, using weight and RFID sensors together
comes closest to the recall performance of the complete system, losing only about 3% in recall, but 10% in
precision.
Adversarial actions Figure 5.8(b) shows precision and recall for only those actions in which users tried to
switch items. In these cases, Grab is able to achieve nearly 90% precision and recall, while the best single
sensor (RFID) has 7% lower recall and 13% lower precision, and the best 2-sensor combination (weight and
RFID) has 5% lower precision and recall. As expected, using a vision sensor or weight sensor alone has
unacceptable performance because the vision sensor cannot distinguish between items that look alike and the
weight sensor cannot distinguish items of similar weight.
Figure 5.8(c) shows precision and recall for only those actions in which users tried to hide the hand from
the camera when picking up items. In these cases, Grab estimates proximity events from the proximity of the
ankle joint to the shelf and achieves a precision of 80% and a recall of 85%. In the future, we hope to explore
cross-camera fusion to be more robust to these kinds of events. Of the single sensors, weight and RFID both
have more than 24% lower recall and precision than Grab. Even the best double sensor combination has 12%
lower recall and 20% lower precision.
Finally, Figure 5.8(d) shows precision and recall only for those items in which the user trying to tamper
with the weight sensors. In these cases, Grab is able to achieve nearly 87% recall and 80% precision. RFID,
4
For computing the association probabilities. Cameras are still used for identity tracking and proximity event detection.
5
In general, since RFID is expensive, not all objects in a store will have RFID tags. In our deployment, a little less than half of the
item types were tagged, and these numbers are calculated only for tagged items.
87
the best single sensor, has more than 10% lower precision and recall, while predictably, vision and RFID have
the best double sensor performance with 5% lower recall and comparable precision to Grab.
In summary, Grab has slightly lower precision and recall for the adversarial cases and these can be improved
with algorithmic improvements, its overall precision and recall on a trace with nearly 40% adversarial actions
is over 91%. When we analyze only the non-adversarial actions, Grab has a precision of 95.8% and a recall of
97.2%.
Taxonomy of Grab failures Grab is unable to recall 19 of the 307 events in our trace. These failures fall into
two categories: those caused by identity tracking, and those by action recognition. Five of the 19 failures are
caused either by wrong face identification (2 in number), false pose detection (2 in number) (Figure 5.7(c)),
or errors in pose tracking (one). The remaining failures are all caused by inaccuracy in action recognition,
and fall into three categories. First, Grab uses color histograms to detect items, but these can be sensitive
to lighting conditions (e.g., a shopper takes an item from one shelf and puts it in another when the lighting
condition is slightly different) and occlusion (e.g., a shopper deposits an item into a group of other items which
partially occlude the items). Incomplete background subtraction can also reduce the accuracy of item detection.
Second, our weight scales were robust to noise but sometimes could not distinguish between items of similar,
but not identical, weight. Third, our RFID-to-proximity event association failed at times when the tag’s RFID
signal disappeared for a short time from the reader, possibly because the tag was temporarily occluded by
other items. Each of these failure types indicates directions of the future work.
Contextualizing the results From the precision/recall results, it is difficult to know if Grab is within the realm
of feasibility for use in today’s retail stores. Grab’s failures fall into two categories: Grab associates the wrong
item with a shopper, or it associates an item with the wrong shopper. The first can result in inventory loss, the
second in overcharging a customer. A survey of retailers [42] estimates the inventory loss ratio (if a store’s
total sales are $100, but $110 worth of goods were taken from the store, the inventory loss rate is 10%) in
today’s stores to be 1.44%. In our experiments, Grab’s failures result in only 0.79% inventory loss. Another
study [54] suggests that faulty scanners can result in up to 3% overcharges on average, per customer. In our
experiments, we see a 2.8% overcharge rate. These results are encouraging and suggest that Grab may be
with the realm of feasibility, but larger scale experiments are needed to confirm this. Additional sensors and
algorithm improvements, could further improve Grab’s accuracy.
5.3.4 The Importance of Efficiency
Grab is designed to process data in near real-time so that customers can be billed automatically as soon as
they leave the store. For this, computational efficiency is important to lower cost, but also to achieve high
88
Figure 5.9: Grab needs a frame rate of at least 10 fps for sufficient accuracy, reducing identity switches and identification
delay.
processing rates in order to maintain accuracy.
Impact of lower frame rates If Grab is unable to achieve a high enough frame rate for processing video
frames, it can have significantly lower accuracy. At lower frame rates, Grab can fail in three ways. First, a
customer’s face may not be visible at the beginning of the track in one camera. It usually takes several seconds
before the camera can capture and identify the face. At lower frame rates, Grab may not capture frames where
the shopper’s face is visible to the camera, so it might take longer for it to identify the shopper. Figure 5.9(a)
shows that this identification delay decreases with increasing frame rate approaching sub-second times at
about 10 fps. Second, at lower frame rates, the shopper moves a greater distance between frames, increasing
the likelihood of identity switches when the tracking algorithm switches the identity of the shopper from one
registered user to another. Figure 5.9(b) shows that the ratio of identity switches approaches negligible values
only after about 8 fps. Finally, at lower frame rates, Grab may not be able to capture the complete movement
of the hand towards the shelf, resulting in incorrect determination of proximity events and therefore reduced
overall accuracy. Figure 5.9(c) shows precision
6
approaches 90% only above 10 fps.
Infeasibility of a DNN-only architecture In we argued that, for efficiency, Grab could not use separate DNNs
for different tasks such as identification, tracking, and action recognition. To validate this argument, we ran
the state-of-the-art open-source DNNs for each of these tasks on our data set. These DNNs were at the top of
the leader-boards for various recent vision challenge competitions [11, 39, 40]. We computed both the average
frame rate and the precision achieved by these DNNs on our data (Table 5.1).
For face detection, our accuracy measures the precision of face identification. The OpenFace [125] DNN
can process 15 fps and achieve the precision of 95%. For people detection, our accuracy measures the recall
6
In this and subsequent sections, we focus on precision, since it is lower than recall, and so provides a better bound on Grab
performance.
89
of bounding boxes between different frames. Yolo [254] can process at a high frame rate but achieves only
91% precision, while Mask-RCNN [187] achieves 97% precision, but at an unacceptable 5 fps. The DNNs for
people tracking showed much worse behavior than Grab, which can achieve an identity switch rate of about
0.027 at 10 fps, while the best existing system, DeepSORT [303] has a higher frame rate but a much higher
identity switch rate. The fastest gesture recognition DNN is OpenPose [149] (whose body frame capabilities
we use), but its performance is unacceptable, with low (77%) accuracy. The best gesture tracking DNN,
PoseTrack [200], has a very low frame rate.
Thus, today’s DNN technology either has very low frame rates or low accuracy for individual tasks. Of
course, DNNs might improve over time along both of these dimensions. However, even if, for each of the four
tasks, DNNs can achieve, say, 20 fps and 95% accuracy, when we run these on a single GPU, we can at best
achieve 5 fps, and an accuracy of 0:95
4
= 0:81. By contrast, Grab is able to process a single camera on a
single GPU at over 15 fps (Figure 5.10), achieving over 90% precision and recall (Figure 5.8(a)).
FaceDetection FPS Accuracy
OpenFace [125] 15 95.1
RPN [184] 5.8 95.1
Peopledetection FPS Accuracy
YOLO-9000 [254] 35 91.0
Mask-RCNN [187] 5 97.4
Peopletracking FPS Avg ID switch
MDP [192] 1.43 1.3
DeepSORT [303] 17 0.8
GestureRecognition FPS Accuracy*
OpenPose [149] 15.5 77.3
DeeperCut [199] 0.09 88
GestureTracking FPS Avg ID switch
PoseTrack [200] 1.6 1.8
Table 5.1: State-of-the-art DNNs for many of Grab’s tasks either have low frame rates or insufficient accuracy. (*
Average pose precision on MPII Single Person Dataset)
5.3.5 GPU multiplexing
In the results presented so far, Grab processes each camera on a separate GPU. The bottleneck in Grab is pose
detection, which requires about 63 ms per frame: our other components require less than 7 ms each.
In previous sections, we discussed an optimization that uses a fast feature tracker to multiplex multiple
cameras on a single GPU. This technique can sacrifice some accuracy, and we are interested in determining
the sweet spot between multiplexing and accuracy. Figure 5.10 quantifies the performance of our GPU
multiplexing optimization. Figure 5.10(a) shows that Grab can support up to 4 cameras with a frame rate of 10
fps or higher with fast feature tracking; without it, only a single camera can be supported on the GPU (the
90
horizontal line in the figure represents 10 fps). Up to 4 cameras, Figure 5.10(b) shows that the precision can
be maintained at nearly 90% (i.e., negligible loss of precision). Without fast feature tracking, multiplexing
multiple cameras on a single GPU reduces the effective frame rate at which each camera can be processed,
reducing accuracy for 4 cameras to under 60%. Thus, with GPU multiplexing using fast feature tracking, Grab
can reduce the investment in GPUs by 4.
Figure 5.10: GPU multiplexing can support up to 4 multiple cameras at frame rates of 10 fps or more, without
noticeable lack of accuracy in action detection.
91
Chapter 6
TAR: Enabling Fine-Grained Targeted
Advertising in Retail Stores
6.1 Introduction
Digital interactions influence 49% of in-store purchases, and over half of them take place on mobile devices [60].
With this growing trend, brick-and-mortar retailers have been evolving their campaigns to effectively reach
people with mobile devices, showcase products, and ultimately, influence their in-store purchase. Among
them, sending targeted advertisements (ads) to user’s mobile devices has emerged as a frontrunner [105].
To send well-targeted information to the shopper, the retailers (and advertisers) should correctly understand
customers’ interest. The key to learning the customer’s interest is to accurately track and recognize the
customer during her stay in the store. Therefore, the retailers need a practical system for shopper tracking and
identification with real-time performance and high accuracy. For example, the retailer’s advertising system
would require aisle-level or meter-level accuracy in tracking a shopper to infer customer’s dwelling time at a
certain aisle. Moreover, the advertising should be able to reflect the customer’s position change fast, because
people usually stay at, or walk by, a specific shelf in just a few seconds.
Some sensor-based indoor tracking metrics are invented, such as Wi-Fi localization [168, 84], Bluetooth
localization [285, 198], stereo cameras [91, 118, 100, 276], and thermal sensors [102]. However, such
approaches are either expensive in hardware cost or inaccurate for retail scenarios. Some commercial
solutions [16, 69] send customers the entire store’s information when they enter the store zone. Such
promotions are coarse-grained and can hardly trigger customers’ interests.
Recently, live video analytics has become a promising solution for accurate shopper tracking. Companies
92
like Amazon Go [87] and Standard Cognition [53] use close-sourced algorithms to identify customers and
track their in-store movement. The opensource community also has proposed many accurate metrics for people
(customer) identification and tracking.
For people (re)identification, there are two mainstream approaches: face recognition and body feature
classification. Today’s face recognition solutions ([186, 292, 227]) can reach up to 95% of precision on public
datasets, thanks to the advance of deep neural networks (DNN). However, the customer’s face is not always
available in the camera, and the face image may be blurry and dark due to poor lighting and long distance. The
body-feature-based solutions [132, 154, 322] do not deliver high accuracy (< 85%) and also suffer from bad
video quality.
For people tracking, the retailer needs both single-camera tracking and cross-camera tracking to understand
the walking path of each customer. Recent algorithms for single-camera tracking [304, 130, 230, 170] leverage
both the person’s visual feature and past trajectory to track her positions in following frames. However, such
algorithms cannot perform well in challenging environments, e.g., similar clothes, long occlusion, and crowded
scene. Existing cross-camera tracking algorithms [262, 275, 290, 153] use the camera network’s topology to
estimate the cross-camera temporal-spatial similarity and match each customer’s trace across cameras. Such
solutions face challenges like unpredictable people movement (between the surveillance zones).
In this chapter, we propose TAR to overcome the above limitations. As summarized above, existing
indoor localization solutions are not accurate enough in practice and usually require the deployment of
complicated and expensive equipment. Instead, this chapter proposes a practical end-to-end shopper tracking
and identification system. TAR is based on two fundamental ideas: Bluetooth proximity sensing and video
analytics.
To infer a shopper’s identity, TAR looks into Bluetooth Low Energy (BLE) signal broadcasted from
the user’s device. BLE has recently gained popularity with numerous emerging applications in industrial
Internet-of-Thing (IoT) and home automation. Proximity estimation is one of the most common use cases
of BLE beacons [26]. Apple iBeacon [88], Android EddyStone [95], and open-sourced AltBeacon [86] are
available options. Several retail giants (e.g., Target, Macy’s) have already deployed them in stores to create a
more engaging shopping experience by identifying items in proximity to customers [101, 198, 285].
TAR takes a slightly different perspective from the above scenario in that shoppers carry BLE-equipped
devices and TAR utilizes BLE signals to enhance tracking and identify shoppers. In a high level, TAR achieves
identification by attaching the sensed BLE identity to a visually tracked shopper. TAR notices the pattern
similarity between shopper’s BLE proximity trace and her visual movement trajectories. Therefore, the
identification problem converts to a trace matching problem.
In solving this matching problem, TAR encounters four challenges. First, pattern matching in real-time
93
is challenging due to different coordinate systems and noisy trace data. TAR transforms both traces into
the same coordinates with camera homography projection and BLE signal processing. Then, TAR devises a
probabilistic matching algorithm that based on Dynamic Time Warping (DTW) [14] to match the patterns. To
enable the real-time matching, TAR applies a moving window to match trace segments and uses the cumulative
confidence score to judge the matching result.
Next, assigning the highest-ranked BLE identity to the visual trace is often incorrect. Factors like short
visual traces could significantly increase the assignment uncertainty. To solve this problem, TAR uses a
linear-assignment-based algorithm to correctly determine the BLE identity. Moreover, instead of focusing on a
single trace, TAR looks at all visual-BLE pairs (i.e., a global view) and assigns IDs for all visual traces in a
camera.
Third, a single user’s visual tracking trace can frequently break upon occlusions. To solve this issue, TAR
implements a rule-based scheme to differentiate ambiguous visual tracks during the assignment process and
connects broken tracks, regarding each BLE ID.
Finally, it is non-trivial to track people across cameras with different camera positions and angles. Existing
works [280, 313, 262, 275] either work offline or require overlapping camera coverage to handle a transition
from one camera to the other. However, overlapping coverage is not guaranteed in most shops. To overcome
this issue, TAR proposes an adaptive probabilistic model that tracks and identifies shoppers across cameras
with little constraint.
We have deployed TAR in an office and a retail store environment, and analyzed TAR’s performance with
various settings. Our evaluation results show that the system achieves 90% accuracy, which is 20% better than
the state-of-the-art multi-camera people tracking algorithm. Meanwhile, TAR achieves a mean speed of 11
frame-per-second (FPS) for each camera, which enables the live video analytics in practice.
The main contributions of our work are listed below:
development of TAR, a system for multi-camera shopper tracking and identification. TAR can be
seamlessly integrated with existing surveillance systems, incurring minimal deployment overhead;
introduction of four key elements to design TAR;
a novel vision and mobile device association algorithm with multi-camera support; and
implementation, deployment, and evaluation of TAR. TAR runs in real-time and achieves over 90%
accuracy.
94
6.2 Motivation
Retail trends: While the popularity of e-commerce continues to surge, offline in-store commerce still
dominates in today’s market. Studies in [116, 94] show that 91% of the purchases are made in physical shops.
In addition, [85] indicates that 82% of the Millennials prefer to shop in brick and mortar stores. As online
shopping evolves rapidly, it is crucial for offline shops to change the form and offer better shopping experience.
Therefore, it is essential for offline retailers to understand shoppers’ demands for better service given that
today’s customers are more informed about the items they want.
The need for shopper tracking and identification: By observing where the shopper is and how long she
visits each area, retailers can identify the customer’s shopping interest, and hence, provide a customized shop-
ping experience for each people. For example, many large retail stores (e.g., Nordstrom [27], Family Dollar,
Mothercare [28]) are already adopting shopper tracking solutions (e.g., Wi-Fi localization). These retailers
then use the gathered data to help implement store layouts, product placements, and product promotions.
Existing solutions: Several companies [50, 276, 259] provide solutions for shopper behavior tracking by
primarily using surveillance camera feeds. The solutions include features like shopper counting, the spatial-
temporal distribution of customers, and shoppers’ aggregated trajectory. However, they are not capable of
understanding per-shopper insight (or identity). Services like Facebook [16, 69] offer targeted advertisement
for retail stores. They leverage coarse-grained location data and the shopper’s online browsing history to
identify the store-level information (which store is the customer visiting) and push relevant advertisements.
Therefore, such solutions can hardly recognize the shopper’s in-store behavior.
Camera-based tracking with face recognition can be used to infer the shoppers’ indoor activities, but it also
introduces several concerns – privacy, availability, and accuracy. First, the face is usually the privacy-sensitive
information, and collecting such information might increase user’s privacy concern (or even violation of law).
Second, the face in the surveillance camera is sometimes unavailable due to various camera angles and body
poses. Moreover, face recognition algorithms are known to be vulnerable to factors like image quality. Finally,
face recognition requires the user’s face image to train the model, which adds overhead to shoppers; asking
them to submit a good face image and verifying the photo authenticity (e.g., offline ID confirmation) are not
easy.
Our proposed approach: TAR adopts a vision-based tracking metric but extends it to enable shopper
identification with BLE. We exploit the availability of BLE signals from the shopper’s smartphone and
combine the signal with vision-based technologies to achieve good tracking accuracy across cameras.
Modern smartphones equips with Bluetooth Low Energy (BLE) chip, and there are many BLE-based
applications and hardware developed. A typical usage of BLE is to act as a beacon, which broadcasts BLE
95
People Detection Tracking with Vision & BLE Single Camera Tracking ID Filtering ID Assignment Cross-camera Tracking & Identification ... BLE Sensing Output: Track-1 = BLE-A Output: Track-3 = BLE-B Probabilistic Update TAR Server Single Camera Tracking Figure 6.1: System Overview for TAR
signal at a particular frequency. The beacon can serve as a unique identifier for the device and can be used
to estimate the proximity to the receiver [246]. Our approach assumes the availability of BLE signals from
shoppers, and this assumption becomes popular via incentive mechanism (e.g. mobile apps for coupons).
Therefore, in addition to our customized vision-based detection and tracking algorithms, we carefully
integrate them with BLE proximity information to achieve high accuracy for tracking and identification across
cameras.
In designing the system, we aim to achieve the following goals:
Accurate: TAR should outperform the accuracy of existing multi-camera tracking systems. It should
also be precise in distinguishing people’s identity.
Real-time: TAR should recognize each customer’s identity in a few seconds since a shopper might be
highly mobile across multiple cameras. Meanwhile, TAR should detect the appearance of people with
high frame per second (FPS).
Practical: TAR should not need any expensive hardware or complex deployment. It can leverage
existing surveillance camera systems and the user’s smartphone.
6.3 TAR Design
This section presents the design of TAR. We begin with an overview of the TAR’s design and the motivating
use cases of the design. Then, we explain the detailed components that address technical challenges specific to
the retail environment.
96
6.3.1 Design Overview
Figure 6.1 depicts the design of TAR, and it consists of two major parts: 1) mobile Bluetooth Low Energy
(BLE) library that enables smart devices as BLE beacons in the background, and 2) server backend that collects
the real-time BLE signals and video data as well as performs customer tracking and identification. First, we
assume customers usually carry their smartphones with a store application installed [114]. The store app
equips with TAR’s mobile library that broadcasts BLE signal as a background thread. Note that the BLE
protocol is designed with minimum battery overhead [90], and the broadcasting process does not require
customer’s intervention. Next, TAR’s server backend includes several hardware and software components. We
assume each surveillance camera equips with a BLE receiver for BLE sensing. Both the camera feed and the
BLE sensing data are sent to TAR for real-time processing.
TAR is composed of several key components to enable accurate tracking and identification. It has a deep
neural network (DNN) based tracking algorithm to track users with vision trace, and then, incorporates a
BLE proximity algorithm to estimate the user’s movement. In addition, TAR adopts a probabilistic matching
algorithm based on Dynamic Time Warping (DTW) [14] to associate both vision and BLE data and find out
the user’s identity. However, external factors such as people occlusion could harm the accuracy of sensed
data and relying solely on the matching algorithm usually results in the error. To handle this issue, TAR
uses a stepwise matching algorithm based on cumulative confidence score. After that, TAR devises an ID
assignment algorithm to determine the correct identity from the global view. As the vision-based trace might
frequently break, sewing them together is essential to learning user interests. We propose a rule-based scheme
to identify ambiguous user traces and properly connect them. Finally, the start of the probabilistic matching
process will encounter more uncertainty due to the limited trace’s length. Therefore, TAR considers each
user’s cross-camera temporal-spatial relationship and carefully initializes its initial confidence level to improve
the identification accuracy.
6.3.2 A Use Case
Figure 6.2 illustrates an example of how TAR works. A grocery store is equipped with two video cameras that
cover different aisles, as shown in Figure 6.2(a). Assume a customer with her smartphone enters the store and
the app starts broadcasting BLE signal. The customer is looking for some snacks and finally finds the snack
aisle. During her stay, two cameras can capture her trajectory. Briefly, camera-1 (bottom) sees the user at first
and senses her BLE signals. Then the starts matching the user’s visual trace to estimated BLE proximity trace.
TAR maintains a confidence score for the tracked customer’s BLE identity. When the user exits the camera-1
zone and enters camera-2 region (top), TAR considers various factors including temporal-spatial relationship
97
Sn a c k
A r ea
M ea t
A r ea
Dairy
A r ea
TAR
PUSH
BLE ID: 7F D4
ID: 7F D4 proximity
Chips
15%
OFF
Cam1
Cam2
(a) (b) (c)
Figure 6.2: A Targeted Ad Working Example in Store
and visual feature similarity, and then, adjusts the initial confidence score for the customer in the new zone.
Then, camera-2 starts its own tracking and identification progress and concludes the customer’s identity (7FD4
in Figure 6.2(b)). TAR then continuously learns her dwell-time and fine-grained trajectory on each shelf.
In following sections, we detail core components of TAR to realize the features above and other use cases
of fine-grained tracking and identification.
6.3.3 Vision-based Tracking (VT)
We design a novel vision-based tracking metric (VT) that consists of three components: people detection and
visual feature extraction, visual object tracking, and physical trajectory estimation.
6.3.3.1 People Detection and Deep Visual Feature
Recent development in DNN provides us with accurate and fast people detector. It detects people in each
frame and marks the detected positions with bounding boxes. Among various proposals, we choose Faster-
RCNN [258] as TAR’s people detector because it achieves high accuracy as well as a reasonable speed.
In addition to the detection, TAR extracts and uses the visual feature of the detected bounding box to
improve inter-frame people tracking. Briefly, once a person’s bounding box is detected, TAR extracts its visual
feature using DNN. The ideal visual feature could accurately classify each people under different people
poses and lighting conditions. Recently, DNN-based feature extractors have been proposed and outperform
other features (e.g., color histogram, SIFT [225]) regarding the classification accuracy. We have evaluated the
state-of-art feature extractors, including CaffeNet, ResNet, VGG16, and GoogleNet [109], and have identified
that the convolution neural network (CNN) version of GoogleNet [327] delivers the best performance in the
tradeoff of speed and accuracy. After that, we have further trained the model with two large-scale people
reidentification datasets together (MARS [324] and DukeReID [178]), with over 1,100,000 images of 1,261
98
pedestrians.
6.3.3.2 People Tracking in Consecutive Video Frames
The tracking algorithm in TAR is inspired by DeepSort [304], a state-of-the-art online tracker. In a high level,
DeepSort combines each people’s visual feature with a standard Kalman filter, which matches objects based
on squared Mahalanobis distance [162]. DeepSort optimizes the matching process by minimizing the cosine
distance between deep features. However, it often fails when multiple people collide in a video. The cause
is that the size of a detection bounding box, covering colliding people, becomes large, and the deep visual
feature calculated from the bounding box cannot accurately represent the person inside.
To overcome this problem, TAR leverages the geometric relationship between objects. When multiple
people are close to each other and their bounding boxes have a large intersection-over-union (IOU) ratio, TAR
will not differentiate those persons using DNN-generated visual features. Instead, those people’s visual traces
will be regarded as "merged" until some people start leaving the group. When the bounding boxes’ IOU values
become lower than a certain threshold (set to 0.3), they will be regarded as "departed" and TAR will resume
the visual-feature-enabled tracking.
The hybrid metric above also faces some challenges. When two users with similar color clothes come
across each other, the matching algorithm sometimes fails because the users’ IDs (or tracking IDs) are switched.
To avoid this error, we propose a kinematic verification component for our matching algorithm. The idea is
that people’s movement is likely to be constant in a short period. Therefore, we compute the velocity and
the relative orientation of each detected object in the current frame, and then compare it to existing tracked
objects’ velocity and orientation. This component serves as a verification module that triggers the matching
only for objects whose kinematic conditions are similar. TAR avoids the confusion, as the two users above
show different velocity and orientation.
The people tracking algorithm in TAR synthesizes the temporal-spatial relationship and visual feature
distance to track each person (her ID) accurately. First, it adopts a Kalman filter to predict the moving direction
and speed of each person (called, track), and then, predicts tracks’ position in the next frame. In the next frame,
TAR computes a distance between the predicted position and each detection box’s position. Second, TAR
calculates each bounding box’s intersection area with the last few positions of each track. Larger IOU ratio
means higher matching probability. Third, TAR extracts a deep visual feature of the detected object, and then,
compares the feature with the track’s. Here, TAR can filter out tracks with the kinematic verification, and then
apply all the three matching metrics. Finally, it assigns each detection to a track. If a detection cannot match
any track with enough confidence, TAR will search one frame backward to find any matched track. On the
99
other hand, if a track is not matched for a long time (a person moves out of a camera’s view), it is regarded as
“missing”, and hence, will be deleted.
6.3.3.3 Physical Trajectory Estimation
Once finishes the visual tracking, TAR then converts the results to physical trajectories by applying the
homography transformation [24]. Specifically, TAR infers people’s physical location by using both visual
tracking results and several parameters of the camera. Assuming the surveillance cameras are stationary and
well calibrated, TAR can estimate the height and the facing direction of detected objects in world coordinates.
Moreover, these cameras can provide information about their current resolution and angle-of-view. With these
calibration properties, TAR calculates a projective transformation matrix [24] H that maps each pixel in the
frame to the ground location in the world coordinates. As a person (or track) moves, TAR can associate its
distance change with a timestamp, yielding physical trajectory.
However, the homography mapping process introduces a unique challenge; it needs to project entire pixels
in a detected bounding boxes (bbox) to estimate physical distance, but the bbox size may vary frame by frame.
For example, a person’s bbox may cover her entire body in one frame, and then, it might only include an
upper body in the next. Moreover, transforming the whole pixels in the bbox imposes an extra burden on
computation. To deal with this challenge, TAR chooses a single reference pixel for each detected person,
while ensuring spatial consistency of the reference pixel even in changing bbox. Specifically, TAR picks a
pixel that is crossing between the bbox and ground, i.e., a person’s feet position. TAR uses this bottom-center
pixel of the bbox to represent its reference pixel. One may argue that the bbox’s bottom may not always be a
foot position (e.g., when the customers’ lower body is blocked). TAR leverages the fact that a person’s width
and height show a ratio around 1:3. With this intuition, TAR checks whether a detected bbox is "too short" –
blocked – and, if so, TAR extends the bottom side of the bbox, based on the ratio. Our evaluation shows that
TAR’s physical trajectory estimation achieves less than 10% of an error, even in a crowded area.
6.3.4 People Tracking with BLE
In addition to VT, TAR relies on BLE proximity to accurately estimate people’s trajectories. We first introduce
BLE beacons and then explain TAR’s proximity estimation algorithm.
BLE background. BLE beacon represents a class of BLE devices. It periodically broadcasts its identifier to
nearby devices. A typical BLE beacon is powered by a coin cell battery and could have 13 years of lifetime.
Today’s smartphones support Bluetooth 4.0 protocol so they can operate as a BLE beacon (transmitter).
Similarly, any device that supports Bluetooth 4.0 can be used as BLE receiver. TAR’s mobile component
100
Detection Bounding Box
Physical distance
BLE proximity
(a)
(b)
Figure 6.3: Relationship between BLE proximity and physical distance
enables a customer’s smartphone as a BLE beacon. This component is designed as a library, and other
applications (e.g., store app) can easily integrate it and run as a background process.
Proximity Trace Estimation. The BLE proximity trace is estimated by collecting BLE beacons’ time series
proximity data. Through our extensive evaluation, we select the proximity algorithm in [86] to estimate the
distance from BLE beacon to the receiver. There are two ways to calculate the proximity using BLE Received
Signal Strength (RSS): (1) d = exp((E RSS)= 10n) where E is transmission power (dBm) and n is the
fading coefficient; (2) The beacon’s transmission power ts defines the expected RSS for a receiver that is one
meter away from the beacon. We denote the actual RSS as rs. Then we get rt =
rs
ts
. The distance can be
estimated with rt< 1:0 ? rt
10
: c
1
rt
c
2
+ c
3
, in which c
1
, c
2
and c
3
are coefficients from data regression. We
implement both algorithms and compare their performance on the collected data. We find the second option is
more sensitive to movement and therefore reflects the movement pattern more timely and accurately.
In practice, these coefficients depend on the receiver device manufacturers. For example, Nexus 4 and
Nexus 5 use same Bluetooth chip from LG, so they have the same parameters. In TAR, we have full knowledge
of our receivers, so we regress our coefficients accordingly. Since TAR also controls the beacon side, the
transmission power of each beacon is known to TAR. Notice that the BLE RSS reading is inherently erroneous,
so we apply the RSS noise filtering strategy similar to [152] for the original signal and then calculate the
current rs with the above formula. TAR takes the time series of BLE proximity as the BLE trace for each
device and its owner. We assume each customer has one device with TAR installed, while the case that one
user carries multiple devices or other people’s device is left for the future work.
6.3.5 Real-time Identity Matching
The key to learning the user’s interest and pushing ads is accurate user tracking and identification. By tracking
the customer, we know where she visits and what she’s interested in. By identifying the user, we know who
she is and whom to send the promotion. In practice, it is unnecessary to know the user’s real identity. Instead,
101
(a)
(b) (c)
0 2 4 6 8
Time (sec)
10
20
30
40
50
60
DTW Distance
0.1
0.2
0.3
0.4
0.5
Matching Probability
BLE-1
BLE-2
BLE-3
DTW Distance Matching Probability --------
(d)
Figure 6.4: (a) Example of a visual trace; (b) Sensed BLE proximity traces; (c) DTW cost matrix for successful
matching;(d) Matching Process Illustration.
recognizing the smart devices carried by users achieves the same goal. We find the BLE universally unique
ID (UUID) can serve as the identifier for the device. If we associate the BLE UUID to the visually tracked
user, we will successfully identify her and learn her specific interest by looking back at her trajectories. On the
other hand, we notice that for a particular user, her BLE proximity trace usually correlates with her physical
movement trajectory and her visual movement. Figure 6.3 shows the example traces of a customer and the
illustration of the BLE proximity and the physical distance. Therefore, TAR aims to associate the user’s
visually tracked trace to the sensed BLE proximity trace. Inspired by the observation above, We propose a
similarity-based association algorithm with movement pattern matching for TAR.
6.3.5.1 Stepwise Trace Matching
In the matching step, we first need to decide how the traces should be matched. We notice that the BLE
proximity traces are usually continuous, but the visual tracks could easily break, especially in occlusion. With
this observation, we use visual tracking trace to match BLE proximity traces. The BLE trace continuity, on
the other hand, can help correct the real-time visual tracking. To match the time series data, we devised our
algorithm based on Dynamic Time Warping (DTW). DTW matches each sample in one pattern to another
using dynamic programming. If the two patterns change together, their matched sample points will have
a shorter distance, and vice versa. Therefore, shorter DTW distance means higher similarity between two
traces. Based on the DTW distance, we define confidence score to quantify the similarity. Mathematically,
assume dt
i j
is the DTW distance between visual track v
i
and BLE proximity trace b
j
, the confidence score is
f
i j
= exp(
dt
i j
100
). There are some other ways to compare the trace similarity such as Euclidean distance, cosine
distance, etc.
DTW is a category of algorithms for aligning and measuring the similarity between two time series. There
are three challenges to apply DTW to synchronize the BLE proximity and visual traces. First, DTW normally
processes the two traces offline. However, both traces are extending continuously in real-time in TAR. Second,
102
DTW relies on computing a two-dimensional warping cost matrix which has size increasing quadratically with
the number of samples. Considering the BLE data’s high frequency and nearly 10 FPS video processing speed,
the computation overhead can increase dramatically over time. Finally, DTW calculates the path with absolute
values in two sequences, but the physical movement estimated from the BLE proximity and the vision-based
tracking trace are inconsistent and inherently erroneous. Computing DTW directly on their absolute values
will cause adverse effects in matching.
First, to deal with the negative effect of absolute value input, TAR adopts the data differential strategy
similar to [152, 278]. We filter out the high-frequency points in the trace and calculate the differential of
current data point by subtracting the prior with time divided. Through this operation, either data sequence is
independent of the absolute value and can be compared directly.
Second, a straightforward way to reduce the computation overhead is to minimize the input data size. TAR
follows this path and designs a moving window algorithm to prepare the input for DTW. More concretely, we
set a sliding window of three seconds and update the windowed data every second. We choose this window
size for the balance between latency and accuracy. If the window is too short, the BLE trace and visual trace
will be too short to be correctly matched. If the window is too long, we may miss some short tracks. As the
window moves, TAR performs the matching process in real-time, thus solves the DTW offline issue. The
window triggers the computation once the current time window is updated. Although we get the confidence
score with window basis, connecting the matching windows for a specific visual track remains an issue. For
example, a visual track v
i
has a higher confidence to match BLE ID-1 at window 1, but BLE ID-2 at window 2.
To deal with this, TAR uses cumulative confidence score to connect the windows for the visual track. TAR
accumulates the confidence scores for consecutive windows of a visual trace and uses the cumulated the
confidence score as the current confidence score for ID assignment.
We use Figure 6.4 as one example to demonstrate the algorithm. In this case, a customer’s moving
trace is shown on the top of Figure 6.4(a). Due to the aisle occlusion and pose change, our vision tracking
algorithm obtains two visual tracks for him. Figure 6.4(b) shows the sensed BLE proximity during this period.
TAR tries to match the visual tracked trace to one of those BLE proximity traces. Figure 6.4(c) shows the
calculation process of DTW for a visual track and a BLE proximity trace, where the path goes almost diagonal.
To illustrate our confidence score calculation process, we show the computation process for this example in
Figure 6.4(d). The x-axis shows the time, left y-axis shows the DTW score (solid lines) for each moving
window, while the right y-axis shows the cumulative confidence score (dotted lines). We can see that BLE
trace 2 has better confidence score at the beginning, but falls behind the correct BLE trace 1 after four seconds.
103
ID 1 2 ... n
Track a p
a1
p
a2
... p
an
Track b p
b1
p
b2
... p
bn
Track c p
c1
p
c2
... p
cn
Table 6.1: ID-matching matrix
6.3.5.2 Identity Assignment
To identify the user, TAR needs to match the BLE proximity trace to the correct visual track. Ideally, for each
trace, the best cumulative confidence score decides the correct matching. However, there are two problems.
First, as stated earlier, BLE proximity estimation is not accurate enough to differentiate some users. In practice,
we sometimes see two BLE proximity traces are too similar to assign them to one user confidently. Second,
visual tracks break easily in challenging scenarios, which often results in short tracks. For example, the visual
track of the user in Figure 6.4(a) breaks in the middle, leading to two separate track traces. Although the deep
feature similarity can help in some scenarios, it fails when the view angle or body pose changes. As TAR
intends to learn the user interest, there needs a way to connect these intermittent visual track traces.
ID Assignment. To tackle the first challenge, TAR proposes a global ID assignment algorithm based on linear
assignment [203]. TAR computes the confidence score for every track-BLE pair. At any time for one camera,
all the visible tracks and their candidate BLE IDs will form a matrix called ID-matching matrix, where row i
stands for track i and column j is for BLE ID j. The element(i; j) of the matrix is Prob(BLE
i j
). Table 6.1
shows the matrix structure. Note that each candidate ID only belongs to some of the tracks, so its matching
probability is zero with other tracks.
When the matrix is ready, TAR will assign one BLE ID for the track in each row. The goal of the assignment
is to maximize the total sum of confidence score. We use Hungarian algorithm [211] to solve the assignment
problem in polynomial time. The assigned ID will be treated as the track’s identity in the current time slot. As
visual tracks and BLE proximity traces change with the time window, TAR will update the assignment with
updated matrix accordingly. If a track is not updated in the current window, it will be temporarily removed
from the matrix as well as its candidate. When a track stops updating for a long time (> 20 sec), the system
will treat the track as "terminated" and archives the last BLE ID assigned to the track.
6.3.5.3 Visual Track Sewing
The identity matching process is still insufficient for identity tracking in practice: the vision-based tracking
technique is so vulnerable that one person’s vision track may break into multiple segments. For example,
upon a long period of occlusion, one person’s trajectory in the camera may be split into several short tracks
104
(see Figure 6.4(a)). Another case is that the customer may appear for a very short time in the camera (enters
the view and quickly leaves). These short traces make the ID assignment result ambiguous as the physical
distance pattern can be similar to many BLE proximity traces in that short period.
TAR proposes a two-way strategy to handle this. First, TAR tries to recognize the “ambiguous” visual
track in real-time. In our design, a track will be considered as “ambiguous” when it meets either of the two
rules: 1) its duration has not reached three seconds; 2) its confidence score distinction among candidate BLE
IDs is vague. Explicitly, the two candidates are considered similar when the rank 2’s score is more than
80% of the rank 1’s.
When there is an ambiguous track in assignment, TAR will first consider if the track belongs to an inactive
track due to the occlusion. To verify this, TAR will search the inactive local tracks (not matched in the current
window but is active within 20 seconds) and check if their assigned IDs are also top-ranked candidate IDs of
the ambiguous track. If TAR cannot find such inactive tracks, that means the current track has no connection
with previous tracks so the current one will be treated as a regular track to be identified with ID assignment
process.
When a qualified inactive track is found, TAR will check if the two tracks have spatial conflict. The spatial
conflict means the two temporally-neighbored segments locate far from each other. For example, with the
same assigned BLE ID, one track v
1
ends at position P
1
and the next track v
2
starts at position P
2
. Suppose
the gap time between two tracks is t, and the average moving speed of T
1
is v. In TAR, T
1
and T
2
will have a
spacial conflict ifjP
1
P
2
j> 5vt. The intuition behind is that a person cannot travel too fast from one place
to another.
With the conflict check finished, TAR connects the inactive track with the ambiguous track. The trace
during the gap time between two tracks is automatically fulfilled with the average speed. The system assumes
that the people moves from the first track’s endpoint to the second track’s starting point with constant velocity
during the occlusion. Then the combined track will replace the ambiguous track in the assignment matrix.
After linear assignment, TAR will check if the combined track receives the same ID that is previously assigned
to the inactive track. If yes, this means the track combination is successful and the ambiguous track is the
extension of the inactive track. Otherwise, TAR will try to combine the ambiguous track with other qualified
inactive tracks until successful ID assignment. If no combination wins, the ambiguous track will be treated as
a regular track for the ID assignment process.
105
6.3.5.4 Multi-camera Calibration
In the discussion above, one problem with the matching process for the single camera is that the confidence
score could be inaccurate when the tracks are short. This is due to limited amount of visual track data and the
big size of candidate BLE IDs. For each visual track, we should try to minimize the number of its candidate
BLE IDs. It is necessary because more candidates not only increase the processing time but also decrease the
ID assignment accuracy. Therefore, TAR proposes Cross-camera ID Selection (CIS) to prepare the list of valid
BLE IDs for each camera.
The task of CIS is to determine which BLE ID is currently visible in each camera. First, we observe that
15 meter is usually the maximum distance from the camera to a detectable device. Therefore TAR will ignore
beacons with BLE proximity larger than 15 meters. However, the 15-meter scope can still cover more than 20
IDs in real scenarios. The reason is that the BLE receiver senses devices in all directions while the camera has
fixed view angle. Therefore, some non-line-of-sight beacon IDs can pass the proximity filter. For example,
two cameras are mounted on the two sides of a shelf (which is common in real shops). They will sense very
similar BLE proximity to nearby customers while a customer can only show up in one of them.
To solve the problem, TAR leverages the positions of the camera and the shop’s floorplan to abstract the
camera connectivity into an undirected graph. In the graph, a vertex represents a camera, and an edge means
customers can travel from one camera to another. Figure 6.5(a) shows a sample topology where there are four
cameras covering all possible paths within the area. A customer ID must be sensed hop-by-hop. With this
knowledge, TAR filters ID candidates with the following rules: 1) At any time, the same person’s track cannot
show up in different cameras if the cameras do not have the overlapping view. In this case, if an ID is already
associated with a track in one camera with high confidence, it cannot be used as a candidate in other cameras
(Figure 6.5(b)). 2) A customer’s graph trajectory cannot "skip" node. For example, an unknown customer
sensed by cam-2 must have shown up in cam-1 or cam-3, because cam-2 locates in the middle of the path from
cam-1 to cam-3, and there’s no other available path (Figure 6.5(c)). 3) The travel time between two neighbor
cameras cannot be too short. We set the lower bound of travel time as 1 second (Figure 6.5(d)).
CIS runs as a separate thread on the TAR server. In every moving window, it collects all cameras’ BLE
proximity traces and visual tracks. CIS checks each BLE ID in the camera’s candidate list and removes the
ID if it violates any of the rules above. The filtered ID list will be sent back to each camera module for ID
assignment.
106
A
B
C
D
A D
B C
T i me T i me
A
B
A
C
B or D is missing
(a) (b) (c)
Figure 6.5: (a) Camera Topology;(b) One ID cannot show in two cameras;(c) BLE ID must be sensed sequentially in
the network;(d) It takes time to travel between cameras
6.4 Evaluation
In this section, we describe how TAR works in the real scenario and evaluate each of its components.
6.4.1 Methodology and Metrics
TAR Implementation. Our implementation of TAR contains three parts: the BLE broadcasting and sensing,
the live video detection, and tracking and the identity matching.
BLE broadcasting is designed as a mobile library supporting both iOS and Android. TAR implements the
broadcasting library with the CoreBluetooth framework on iOS [34] and AltBeacon [86] on Android. The
BLE RSS sensing module sits in the backend. In our experiments, we use Nexus 6 and iPhone 6 for BLE
signal receiving. The Bluetooth data is pushed from the devices to the server through TCP socket. In TAR, we
set both the broadcasting and sensing frequency at 10 Hz.
The visual tracking module (VT) consists of a DNN-based people detector and a DNN feature extractor.
One VT processes the video from one camera. TAR uses the Tensorflow version of Faster-RCNN [155, 258]
as people detector and our modified GoogleNet [109] as the deep feature extractor. We train the Faster-RCNN
model with VOC image dataset [169] and train the GoogleNet with two pedestrian datasets: Market-1501 [325]
and DukeMTMC [328]. The detector returns a list of bounding boxes (bbox), which are fed to the feature
extractor. The extractor outputs 512-dim feature vector for each bbox. We choose FastDTW [267] for DTW
algorithm and its code can be downloaded from [17].
Since each VT needs to run two DNNs simultaneously, we cannot support multiple VTs on single GPU.
To ensure performance, we dedicate one GPU for each VT instance in TAR, while leaving further scalability
optimization to the future works. The tracking algorithm and identity matching algorithm is implemented with
Python and C++. To ensure real-time process, all modules run in parallel through multi-threading.
Our server equips with Intel Xeon E5-2610 v2 CPU and Nvidia Titan Xp GPU. In the runtime, TAR
107
1
2
3
cam5
cam4
cam6
cam3
cam2
cam1
3.5m
(a) Office Setup
cam2
cam1
cam3
20 m
50m
30m 30m
(b) Retail Store Setup
Figure 6.6: Experiment Deployment Layout: (a) Office;(b) Retail store.
Cam 1 Cam 2 Cam 3 Cam 4 Cam 5 Cam 6
A
B
Figure 6.7: Same person’s figures under different camera views (office).
occupies 5.3GB of GPU memory and processes video at around 11FPS. Double VT instances on one GPU
will not overflow the memory but will reduce the FPS by around half.
As cross-camera tracking and identification require collaboration among different cameras, TAR shares the
data by having one machine as the master server and running Redis cache. Then each VT machine can access
the cache to upload its local BLE proximity and tracking data. The server runs cross-camera ID selection with
the cached data and writes filtred ID list to each VT’s Redis space.
TAR Experiment Setup. We evaluate TAR’s performance by deploying the system in two different envi-
ronments: an office building (Figure 6.6(a)) and a retail store (Figure 6.6(b)). We use Reolink IP camera
(RLC-410S) in our setup. The test area for the office deployment is 50m 30m with the average path width of
3.5m, while the retail store is 20m 30m.
We deploy six cameras in the office building as shown in the layout, and three cameras in the retail store.
All the cameras are mounted at about 3m height, pointing 20
-30
down to the sidewalk. There are 20 different
participants involved in the test, 12 in office deployment and 8 in retail store deployment. Besides the recruited
volunteers, TAR also records other pedestrians and it captures up to 27 people simultaneously in the cameras.
Each participant has TAR installed in their devices and walks around randomly based on their interest. To
quantify the TAR performance, we record all the trace data in two deployment scenarios for later comparison.
We’ve collected around 1-hour data for each deployment, including 30GB video data and 10MB BLE RSS
logs. Figure 6.7 shows the same person’s appearance in different cameras. We can see that some snapshots are
dark and blurry, which makes it hard to identify people only with vision approach.
For cross-camera tracking and identification, we mainly use IDF1 Score [261], a standard metric to evaluate
108
Camera 4 Camera 2
Figure 6.8: Screenshots for Cross Camera Calibration
the performance of multi-camera tracking system. IDF1 is the ratio of correctly identified detections over
the average number of ground-truth, which equals (Correctly identified people in all frames) / (All identified
people in all frames). For example, if one camera records three people A, B, and C. If an algorithm returns
two traces: one on A with ID=A, and another on C with ID=B. In this case, we only have one person correctly
detected, so the IDF1=33%.
6.4.2 TAR Runtime
Before discussing our trace-based evaluation, we show the benefits of TAR’s matching algorithm and opti-
mization in the runtime.
We first show TAR’s ID assignment process
1
. In the beginning, with only detection and bbox tracking,
we cannot tell the user identity. We consider the user movement estimated from the visual track and the
BLE proximity traces and then apply our stepwise matching algorithm. After that, we use our ID assignment
algorithm to report the user’s possible identity. Although the user’s identity is not correct at first, the real
identity emerges as time window moves. This proves the effectiveness of our identity matching algorithm.
We also demonstrate how TAR’s track sewing works in runtime
2
. As the first part of the video shows, in
the case of broken visual tracks, the user may not get correctly identified after the break. By applying our
track sewing algorithm, the user’s tracks get correctly recognized much faster. Therefore, TAR’s track sewing
algorithm benefits those scenarios.
Figure 6.8 shows two cameras’ screenshots in the office settings at different time. In this trace, one user
(orange bbox) walks from camera 2 to camera 4. Meanwhile, there are around 7 BLE IDs sensed. With the
user enters camera 4, TAR uses temporal-spatial relationship and deep feature distance to filter out unqualified
BLE IDs, and then assigns the highest-ranked identity to the user. As shown in camera 4’s screenshot, the user
1
https://vimeo.com/246368580
2
https://vimeo.com/246388147
109
TAR *MCT
+ReID
*MCT
Only
ReID
Only
20
30
40
50
60
70
80
90
IDF1 (%)
91.7
70.3
53.5
58.0
88.2
66.9
45.3
53.1
Office
Shop
(a)
Tracking
ReID
ReID (Part)
ReID (Single)
ReID (Multi)
43. 8%
Tracking
19. 6%
ReID (Single)
21. 1%
ReID (Part)
15. 4%
ReID
46. 6%
Tracking
53. 4%
TAR
MCT + ReID
(b) (b)
Partial
Occlusion
(23.1%)
Other
(15.0%)
Blurry / Low Contrast
(45.5%)
Similar
Figure
(16.4%)
(c)
Figure 6.9: (a) Multi-cam tracking comparison against state-of-the-art solutions (*offline solution);(b) Error statistics
of TAR and MCT+ReID;(c) Error statistics of re-identification in MCT+ReID and example images.
is correctly identified.
6.4.3 TAR Performance
6.4.3.1 Comparing with Existing Multi-cam Tracking Strategies
Figure 6.9(a) shows the accuracy of TAR. The y-axis represents IDF1 accuracy. As a comparison, we also
evaluate the IDF1 of existing state-of-the-art algorithms from vision community:
(1) MCT+ReID: We use the work from DeepCC [262], an open-sourced algorithm that reaches top accuracy
in MOT Multi-Camera Tracking Challenge [41]. The solution uses DNN-generated visual features for people
re-identification (ReID) and uses single-camera tracking and cross-camera association algorithms for identity
tracking. The single-camera part of DeepCC runs a multi-cut algorithm for detections in recent frames and
calculates best traces to minimize the assignment cost. For cross-camera identification, it not only considers
visual feature similarity but also estimates the movement trajectory of each person in the camera topology to
associate two tracks, which has the similar idea of TAR in cross-camera ID selection.
(2) MCT-Only: We also tested MCMT [261], the previous work of DeepCC [262], which shares similar
logic for tracking as DeepCC (both single-camera and multi-camera) but does not have DNN for people
re-identification.
(3) ReID-Only: We directly run DeepCC’s DNN to extract each people’s visual feature in each frame
and classify each person to be one of the registered users. This will show the accuracy of tracking with
re-identification only.
Analysis: We can see that TAR outperforms existing best offline algorithm (MCT+ReID) by 20%. Therefore,
we analyze the failures in both TAR and MCT+ReID to understand why TAR gains much higher accuracy.
There are two types of failures: erroneous single-camera tracking and wrong re-identification. Note that the
re-identification is BLE-vision matching in TAR’s case.
110
TAR TAR w/ DEEPSORT *TAR w/ LMP
50
60
70
80
90
IDF1 (%)
91.7
80.6
92.5
88.2
75.5
90.3
Office
Shop
Figure 6.10: Importance of Tracking Components in TAR (*offline solution).
10 20 30
FPS
86
88
90
92
94
Recall(%)
82
84
86
88
90
92
94
96
Precision(%)
Mask-RCNN
Faster-RCNN
OpenPose
YOLO-9000
Recall: Precision:
Figure 6.11: Recall, precision, and FPS of state-of-the-art people detectors.
As Figure 6.9(b) shows, the two failures have the similar contribution in TAR. In the vision-only scenario,
most errors are from the re-identification process. We further break down the re-identification failures for
MCT+ReID into three types: (1) multi-camera error: a person is constantly recognized as someone else in the
cameras after his first appearance; (2) single-camera error: a customer is falsely identified in one camera; (3)
part-of-track error: a person is wrongly recognized for part of her track in one camera. From Figure 6.9(b), we
can see that more than half of the ReID problems are cross-camera type, which is due to the MCT module that
optimizes identity assignment across cameras - if a person is assigned an ID, she will have a higher probability
to get the same ID in following traces.
The root cause of the vision-based identification failure is the imperfect visual feature, which cannot
accurately distinguish one person from another in some scenarios. From our observation, there are three
cases that the feature extractor may easily fail: (1) blurry image; (2) partial occlusion; (3) similar appearance.
Figure 6.9(c) demonstrates each failure case where two persons are recognized as the same customer by TAR.
The figure also shows the percentage of all failure cases in the test results. We can see that the blurry and low
contrast figures lead to near half of errors and the other two types account for about 40% of the failed cases.
111
Figure 6.12: Importance of Identification Components in TAR
SimilarityMetric Accuracy (%)
DTW (used in TAR) 95.7
Euclidean Distance 88.0
Cosine Distance 84.9
Pearson Correlation 66.4
Spearman Correlation 72.5
Table 6.2: Accuracy (ratio of correct matches) of different trace similarity metrics
6.4.3.2 Importance of Different Components in TAR
Next, we analyze each component used in TAR.
People Detection. The people detector may fail in two ways: false positive, which recognizes a non-person
object as a people, and false negative, which fails to recognize a real person. For false positives, TAR could
filter them out in the vision-BLE matching process. For false negatives, people occluded larger than > 80%
of their bodies usually will be hardly detected by the detection model. Such false negative cases can be
handled by TAR’s tracking algorithm and track sewing metric, which will also be evaluated. We evaluate
the performance of current state-of-the-art open-sourced people detectors using our dataset and the results
are shown in Figure 6.11. Besides Faster-RCNN (used by TAR), we also test Mask-RCNN [187], YOLO-
9000 [255], and OpenPose [149]. We can see that YOLO and OpenPose have lower accuracy although they
are fast. In contrast, Mask-RCNN is very accurate but works too slow to meet TAR’s requirement.
Trace matching. DTW plays the key role in matching BLE traces to vision traces. Therefore, we should
understand its effectiveness in TAR’s scenario. In the experiment, we compute the similarity between one
person’s walking trace and all nearby BLE traces to find the one with the highest similarity. The association
process succeeds if the ground truth trace is matched, otherwise, it fails. We calculate the number of correct
matchings across the whole dataset and compute the successful linking ratio. Besides DTW, we also test other
metrics including Euclidean distance, cosine distance, Pearson correlation coefficient [46], and Spearman’s
rank correlation [52]. The average matching ratio of each method is shown in Table 6.2, in which DTW gets
the best accuracy.
112
Visual Tracking. Visual tracking is crucial for estimating visual traces. As TAR develops its visual-tracking
algorithm based on DeepSORT [304], we want to see TAR’s performance improvement compared with
existing state-of-the-art tracking algorithms. Towards this end, we replace our visual tracking algorithm with
DeepSORT and LMP [286], which achieves best tracking accuracy in MOT16 challenge. LMP uses DNN
for people re-identification like DeepSORT and it works offline so it can leverage posteriori knowledge of
people’s movement and use lifted multi-cut algorithm to assign traces globally.
We calculate the IDF1 percentage of each choice in Figure 6.10. We can see from the first group of
bars that TAR’s visual tracking algorithm clearly outperforms DeepSORT by 10%. This is because TAR’s
visual tracking algorithm considers several optimizations like kinematic verification, thus reduces ID switches.
Moreover, TAR performs similarly with that with LMP as the tracker, which shows that our online tracking
metric is comparable to the current state-of-the-art offline solution. LMP is not feasible for TAR since it works
offline and slowly (0.5FPS) while our usage scenario needs real-time processing.
We compare the following modules’ performance by taking away each of them from TAR and show the
system accuracy change in Figure 6.12.
ID Assignment. An alternative solution for our ID assignment algorithm is to always choose the best (top-1)
confident candidate for each track. Thus, we compare our ID assignment to the top-1 scheme and show the
result in the second group of Figure 6.12. We can see that the top-1 scheme is almost 20% worse than TAR.
The reason is that the top-1 assignment usually has the conflict error, where different visual tracks get assigned
to the same ID. TAR, on the other hand, ensures the one-on-one matching, which reduces such conflicts.
Track Sewing. If we remove the track sewing optimization, a person’s fragmented tracks will need much
longer time to be recognized, and some of them may be matched to wrong BLE IDs. Figure 6.12’s third group
proves this point. Removing track sewing drops the accuracy for nearly 25% in the retail store dataset, which
has frequent occlusion. In the evaluation, we find the average number of distinct tracks of the same person is
1.8, and the maximum number is 5.
BLE Proximity. Incorporating BLE proximity is the fundamental part to help track and identify users. To
quantify the effectiveness of BLE proximity, we calculate the accuracy with BLE matching components
removed and TAR only relies on the cross-camera association and deep visual features to identify and track
each user. Figure 6.12’s fourth group shows that the accuracy drops by 35% at most without BLE’s help.
Deep Feature. The deep feature is one of the core improvements in the visual tracking algorithm. Figure 6.12’s
fifth group shows that the accuracy drops nearly 30% because removing the deep feature will cause high-
frequency ID switches in tracking. In this case, it is hard to compensate the errors even with our other
113
optimizations.
Cross Camera Calibration. Our cross-camera calibration metric contains temporal-spatial relationship
and deep feature similarity across cameras. To understand the impact of this optimization, we remove the
component and evaluate TAR with the same dataset. Figure 6.12’s most right group shows a 10% accuracy
drop. Without cross camera calibration, we find that the matching algorithm struggles to differentiate BLE
proximity traces. In some cases, these traces demonstrate similar pattern when they move around. For example,
in the retail scenario, TAR tries to recognize one user seen in camera-1 and she’s leaving the store. Meanwhile,
another user is also moving out but with a different direction seen in camera-2. In this case, their BLE
proximity traces are hard to distinguish only with camera-1’s information.
6.4.3.3 Robustness
Robustness is essential for any surveillance or tracking system because some part of the system might fail, e.g.,
one or more cameras or BLE receivers stop working. This could happen in many situations due to battery
outage, camera damaged, or the lighting condition is bad. Therefore, how will those failures affect the overall
performance? We focus on the system accuracy under node failures. Figure 6.13 shows the performance
change of TAR when failure happens. Note that either the BLE failure or the video failure will cause the
node failure because TAR needs both information for customer tracking. Therefore, we remove the affected
nodes randomly from TAR’s network to simulate the runtime failure. Figure 6.13 shows node failures and
performance downgrades with the portion of failed nodes. We can see that TAR can still keep more than
80% accuracy with half of the nodes down. The system is robust because each healthy node could identify
and track the customer by itself. The only loss from the failed node is the cross-camera part, which uses the
temporal-spatial relationship to filter out invalid BLE IDs.
We also evaluate the relationship between the number of concurrent tracked people and the tracking
accuracy (shown in Figure 6.14). As the result shows, TAR accuracy drops as more people being tracked. The
accuracy becomes stable around 85% with 20 or more people. This is because that there is no "new" trace
pattern since all possible paths in each camera view are fully occupied. Therefore, adding more people will
not cause more uncertainty in trace matching.
114
Figure 6.13: Accuracy of TAR with different ratio of node failure (purple lines show the measured error).
Figure 6.14: Relationship between the tracking accuracy and the number of concurrently tracked people.
115
Chapter 7
Related Work
Our prior work include several topics in the area of vision-based context sensing. In this chapter, we will
review related work in each field.
7.1 Outdoor Localization and GPS Error Mitigation
NLOS Mitigation Prior work has explored NLOS mitigation. These all differ from Gnome along one or
more of these dimensions: they either require specialized hardware, use simplistic or proprietary 3D models,
or have not demonstrated scalability to smartphones. NLOS signals are known to be the major cause of GPS
errors in urban canyons [98, 206]. Other work [167] has shown that ray-tracing a single reflection generally
works as well or better than ray-tracing multiple reflections. Early work [141] uses the width of a street and
the height of its buildings to build a simple model of an entire street as consisting of two reflective surfaces.
This model deviates from reality in most modern downtown areas, where building heights vary significantly.
To overcome this drawback, other work proposes specialized hardware to build models of reflective surfaces,
including stereo fish-eye cameras [234], LiDAR [110], or panoramic cameras [289].
A long line of work has explored using proprietary 3D models to compute the path inflation, or simply
to determine whether a satellite is within line of sight or not. One branch of this research uses 3D models to
determine and filter out NLOS satellites [243, 212, 164]. However, as we show, in our dataset, nearly 90%
of the GPS readings see fewer than 4 satellites, and removing NLOS satellites would render those readings
unusable. A complementary approach has explored, as Gnome does, correcting the NLOS path inflation.
Closest to our work is the line of work [265, 232, 194] that uses candidate positions like Gnome does, but
estimates the ground truth position using path similarity. As we have shown earlier, this approach performs
worse than ours. A second line of work [180, 298, 120] assumes that NLOS satellites generally have lower
116
carrier-to-noise density than LOS satellites. However, this may not hold in general and Gnome’s evaluation
shows that this approach also does not perform well.
Building height computation Several pieces of work have used techniques to build 3D models of buildings
from a series of 2D images, using a technique called structure-from-motion (SfM [239]). Our approach uses 3D
models made available from LiDAR devices mounted on Street View scanning vehicles. Other work has used
complementary methods to obtain the height of buildings. Building height information is publicly available
from government websites [92, 93] or 3rd party services [108, 111], and some work has explored building
3D models using images and building height information [181, 266]. However, these datasets have spotty
coverage across the globe. Recognizing this, other work [279, 151, 148, 147] estimates building heights using
Synthetic Aperture Radar (SAR) data generated with remote sensing technologies. This data also has uneven
coverage. Finally, one line of work [274, 273] analyzes the shape and size of building shadow to estimate
building height. This approach needs the entire shadow to be visible on the ground with few obstructions. In
downtown areas, the shadow of tall buildings fall on other buildings, and this approach cannot be used.
Complementary approaches to improving localization accuracy In recent years, the mobile computing
community has explored several complementary ways to improve location accuracy: using the phone’s internal
sensors to track the trajectory of a user [263]; using cameras [38] or fusing GPS readings with sensors,
dead-reckoning, map matching, and landmarks to position vehicles [145, 202]; using WiFi access point based
localization [271, 270] as well as camera-based localization [309]; and crowdsourcing GPS readings [121]
to estimate the position of a target. Other work [190, 189] has explored accurate differential GPS systems
which require satellite signal correlation across large areas and don’t work well in downtown areas. GPS
signals have also been used for indoor localization [236], and other work has explored improving trajectory
estimation [305, 306] by using map-matching to correct GPS readings. While map matching works well for
streets, it is harder to use for pedestrians. In contrast to this body of work, Gnome attacks the fundamental
problem in urban canyons: GPS error due to satellite signal reflections.
7.2 Roadside Landmark Discovery and Localization
Prior work in ubiquitous and pervasive computing deals with localizing objects within the environment, or
humans. Some of these have explored localizing users using low-cost energy-efficient techniques on mobile
devices: magnetic sensors [308], inertial and light sensors [311], inertial sensors together with wireless
fingerprints [188, 193], RF signals [238, 143, 241], mobility traces [209] and other activity fingerprints [215].
117
Other work has explored localizing a network of devices using low-cost RF powered cameras [235]. Many of
these techniques are largely complementary to ALPS, which relies on Street View images to localize common
landmarks. Perhaps closest to our work is Argus [310], which complements WiFi fingerprints with visual cues
from crowd-sourced photos to improve indoor localization. This work builds a 3-D model of an indoor setting
using advanced computer vision techniques, and uses this to derive geometric constraints. By contrast, ALPS
derives geometric constraints by detecting common landmarks using object detection techniques. Finally,
several pieces of work have explored augmenting maps with place names and semantic meaning associated
with places [166, 136]. ALPS derives place name to location mappings for common places with recognizable
logos.
Computer vision research has considered variants of the following problem: given a GPS-tagged database
of images, and a query image, how to estimate for the GPS position of the given image. This requires matching
the image to the image(s) in the database, then deriving position from the geo-tags of the matched images.
Work in this area has used Street View images [316, 295], GIS databases [128]), or images from Flickr [158]).
The general approach is to match features, such as SIFT in the query image with features in the database of
images. ALPS is complementary to this line of work, since it focuses on enumerating common landmarks of a
given type. Because these landmarks have distinctive structure, we are able to use object detectors, rather than
feature matching.
Research has also considered another problem variant: given a set of images taken by a camera, finding
the position of the camera itself. This line of work [122, 218] attempts to match features in the images to a
database of geo-tagged images, or to a 3-D model derived from the image database. This is the inverse of
the our problem: given a set of geo-tagged images, to find the position of an object in these images. Finally,
Baro et al. [137] propose efficient retrieval of images from an image database matching a given object. ALPS
goes one step further and actually positions the common landmark.
7.3 Cross-camera Person Re-identification and Tracking
Tracking with other sensors There have been multiple tracking technologies available for person tracking.
[91, 118, 100, 276] uses stereo video system which utilizes camera pairs to sense 3D information of surround-
ings, however, the equipment are usually very expensive and hard to deploy. [102] uses thermal sensors to
sense the existence and position of people, but its tracking accuracy can also be affected by occlusion, which
makes it hard to distinguish people number. [89] uses laser and structured light to accurately infer the shape
of people (usually in center-meter level), which makes them the most accurate solution for people counting.
However, the short scanning range prevents the solution from continuous people tracking so other supporting
118
solutions like cameras are needed. On the other hand, Euclid Analytics [168] and Cisco Meraki [84] have
been relying on WiFi MAC Address to track the customer entry and exiting the stores. but this technology
requires activation of customer WiFi and suffers from location accuracy. Swirl [285] and InMarket [198] use
pure BLE beacons to count customers, but the proximity based approach is far from the accuracy required to
track people. Different from those approaches, TAR combines both vision and BLE proximity for not only
actually tracking people in large scale, but also identifying them.
Single camera tracking. Tracking by detection has emerged as the leading strategy for multiple object
tracking. Prior works use global optimization that processes the entire video batches to find object trajectories.
For example, three popular frameworks, flow network formulations [320, 244, 140], probabilistic graphical
models [315, 314, 126, 231] and large temporal windows (e.g., [172, 213]) have been popular among them.
However, due to the nature of batch processing, they are not applicable for real time processing where no
future information is available.
SORT [142] significantly improves the tracking speed by applying Kalman filter based tracking and
achieves a fairly good accuracy. However, it performs poorly during occlusions. DeepSORT [304] improves
based on SORT by introducing the deep neural network feature of people. The algorithm works well for high
quality video as deep features are more distinguishable. For the video with low light, the features become hard
to distinguish, the performance degrades significantly. Different from SORT and DeepSORT, TAR takes all
scope of information, including motion and appearance information, and train a more general convolutional
neural network for feature extraction. These improvement increases the robustness against detection false
negatives and occlusions.
Cross-camera people tracking. Traditional multicamera tracking algorithms like POM [172] and KSP [140]
rely on homography transformation among different cameras, which requires accurate camera angle and po-
sition measurement and camera overlapping. However, for most cameras, the exact angle and position is
not known, and scene overlapping is not satisfied due to various reasons. Moreover, those algorithms needs
global trajectories for tracking, which is not suitable to online tracking. In contrast, TAR doesn’t have these
requirements, it utilizes various context information as well as Bluetooth signal to re-identify the objects across
cameras.
Several approaches [313, 237, 280] track multiple people across cameras but require overlapped cam-
eras, which may be too restrictive for most surveillance systems. In non-overlapping scenarios, other ap-
proaches [262, 290, 153] leverage the visual similarity between people’s traces in different cameras to match
them. They also run a bipartite matching algorithm globally to minimize the ID assignment error. However,
119
these approaches can only work offline for best performance and are unsuitable for online processing purposes.
7.4 Targeted Advertising and Cashier-free Shopping
Commercial cashier-free systems We are not aware of published work on end-to-end design and evaluation
of cashier-free shopping. Amazon Go was the first commercial solution for cashier-free shopping. Several
other companies have deployed demo stores like Standard Cognition [53], Taobao [55], and Bingobox [6].
Amazon Go and Standard Cognition use computer vision to determine shopper-to-item association ([25, 2, 35]).
Amazon Go does not use RFID [2] but needs many ceiling-mounted cameras. Imagr[31] uses a camera-
equipped cart to recognize the items put into the cart by the user. Alibaba and Bingobox use RFID reader to
scan all items held by the customer at a "checkout gate" ([9, 10]). Grab incorporates many of these elements in
its design, but uses a judicious combination of complementary sensors (vision, RFID, weight scales). Action
detection is an alternative approach to identifying shopping actions. Existing state-of-the-art DNN-based
solutions [205, 283] have not yet been trained for shopping actions, so their precision and recall in our setting
is low.
Item detection and tracking Prior work has explored item identification using Google Glass [182] but such
devices are not widely deployed. RFID tag localization can be used for item tracking [272, 201] but that line
of work does not consider frequent tag movements, tag occlusion, or other adversarial actions. Vision-based
object detectors [187, 156, 257] can be used to detect items, but need to be trained for shopping items and can
be ineffective under occlusions and poor lighting. Single-instance object detection scales better for training
items but has low accuracy [207, 191].
7.5 Complex Activity Detection
Existing pipelines detect the person’s location (bounding box) in the video, extract representative features
for the person’s tube, and output the probability of each action label. Recent pipelines [293, 277, 323]
leverage DNNs to extract features from person tubes and predict actions. Other work [179, 221] estimates
human behavior by detecting the head and hand positions and analyzing their relative movement. Yet others
analyze the moving trajectories of objects near a person to predict the interaction between the person and the
object [229, 124]. By doing this, the action detector can describe more complex actions. The above approaches
achieve high accuracy only with sufficient training samples, which limits their applications for more complex
activities that involve multiple subjects and long duration.
120
Rather than analyzing a single actor’s frames, other complementary approaches [131, 123] present their
solutions to detect group behavior such as “walk in group”, “stand in queue”, and “talk together”. The approach
is to build a monolithic model that takes input both the behavior feature of each actor and the spatial-temporal
relation (e.g. distance change), and outputs the action label. The model could be an recurrent neural network
([131]) or a handcrafted linear-programming model ([123]). However, both models require training videos to
work properly because the models need training, rendering these approaches unsuitable for Caesar. A more
recent approach [173] leverages face and object detection to extract meaningful objects in video clips, and
then associate the objects with video’s annotations using spatiotemporal relationship. The user could search
for clips that contain specific object-annotation combinations. This system works offline and does not support
mobile devices. It also lacks Caesar’s extensible vocabulary which has object movement and atomic activities.
Zero-shot action detection is closely related to Caesar, and targets near real-time detection even when
there are very few samples for training. Some approaches [312, 248] train a DNN-based feature extractor
with videos and labels. The feature extractor can generate similar outputs for the actions that share similar
attributes. When an unknown action tube arrives, these approaches cluster it with existing labels, and evaluate
its similarity with the few positive samples. Another approach [175] further decomposes an action query
sentence into meaningful keywords which have corresponding features clusters, and waits for those clusters to
be matched together at runtime. However, these zero-shot detection approaches suffer from limited vocabulary
and low accuracy (<40%).
7.6 Wireless Camera Networks
Wang et al. [299] discuss an edge-computing based approach in which a group of camera-equipped drones
efficiently livestream a sporting event. However, they focus on controlling drone movements and efficiently
transmitting the video frame over a shared wireless medium in order to maintain good quality live video
streaming with low end to end latency. Other work [129] presents a new FPGA architecture and a communica-
tion protocol for that architecture to efficiently transmit images in a wireless camera network. San Miguel
et al. [269] present a vision of a smart multi-camera network and the required optimization and properties,
but discuss no specific detection techniques. A related work [260] proposes a method for re-configuring the
camera network over time based on the description of the camera nodes, specifications of the area of interest
and monitoring activities, and a description of the analysis tasks. Finally, MeerKats [146] uses different image
acquisition policies with resource management and adaptive communication strategies. No other prior work
has focused on cross-camera complex activity detection, as Caesar has.
121
7.7 Scaling DNN Pipelines
More and more applications rely on a chain of DNNs running on edge clusters. This raises challenges for
scaling well with fixed number of computation resources. Recent work [318] addresses the problem by
tuning different performance settings (frame rate and resolution) for task queries to maximize the server
utilization while keeping the quality of service. Downgrading frame rates and DNNs is not a good choice for
Caesar because both options will adversely impact accuracy. [197] proposes a scheduler on top of TensorFlow
Serving [56] to improve the GPU utilization with different DNNs on it. Caesar could leverage such a model
serving system, but is complementary to it. Recent approaches [159, 216] cache the intermediate results to
save GPU cycles. Caesar goes one step further with lazily activating the action DNN. [159] also batches the
input for higher per-image processing speed on GPU, which Caesar also adopts to perform object detection on
the mobile device.
122
Chapter 8
Conclusions and Takeaways
In Chapter 2, we discuss Gnome as a practical and deployable method for correcting GPS errors resulting
from non-line-of-sight satellites. Our approach uses publicly available 3D models, but augments them with
the height of buildings estimated from panoramic images. We also develop a robust method to estimate
the ground truth location from candidate positions, and an aggressive precomputation strategy and efficient
search methods to enable our system to run efficiently entirely on a smartphone. Results from cities in North
America, Europe, and Asia show 6-8m positioning error reductions over today’s highly optimized smartphone
positioning systems.
In Chapter 3, we discuss ALPS, which achieves accurate, high coverage positioning of common landmarks
at city-scales. It uses novel techniques for scaling (adaptive image retrieval) and accuracy (increasing
confidence using zooming, disambiguating landmarks using clustering, and least-squares regression to deal
with sensor error). ALPS discovers over 92% of Subway restaurants in several large cities and over 95% of
hydrants in a single zip-code, while localizing 93% of Subways and 87% of hydrants with an error less than
10 meters.
In Chapter 4, we discuss Caesar, a hybrid multi-camera complex activity detection system that combines
traditional rule based activity detection with DNN-based activity detection. Caesar supports an extensible
vocabulary of actions and spatiotemporal relationships and users can specify complex activities using this
vocabulary. To satisfy the network bandwidth and low latency requirements for near real-time activity detection
with a set of non-overlapping (possibly wireless) cameras, Caesar partitions activity detection between a
camera and a nearby edge cluster that lazily retrieves images and lazily invokes DNNs. Through extensive
evaluations on a public multi-camera dataset, we show that Caesar can have high precision and recall rate
with accurate DNN models, while keeping the bandwidth and GPU usage an orders of magnitude lower that a
123
strawman solution that does not incorporate its performance optimizations. Caesar also reduces the energy
consumption on the mobile nodes by 7.
In Chapter 5, we discuss Grab, a cashier-free shopping system that uses a skeleton-based pose tracking
DNN as a building block, but develops lightweight vision processing algorithms for shopper identification
and tracking, and uses a probabilistic matching technique for associating shoppers with items they purchase.
Grab achieves over 90% precision and recall in a data set with up to 40% adversarial actions, and its efficiency
optimizations can reduce investment in computing infrastructure by up to 4.
In Chapter 6, we discuss TAR, a system that utilizes existing surveillance cameras and ubiquitous BLE
signals to precisely identify and track people in retail stores. In TAR, we have first designed a single-camera
tracking algorithm that accurately tracks people, and then extended it to the multi-camera scenario to recognize
people across distributed cameras. TAR leverages BLE proximity information, cross-camera movement
patterns, and single-camera tracking algorithm to achieve high accuracy of multi-camera multi-people tracking
and identification. We have implemented and deployed TAR in two realistic retail-shop setting, and then
conduct extensive experiments with more than 20 people. Our evaluation results demonstrated that TAR
delivers high accuracy (90%) and serves as a practical solution for people tracking and identification.
8.1 Guidelines for Vision-Based Context Sensing Systems
In this section, we will summarize all the lessons learned from prior work in designing efficient vision-based
context sensing systems. At high level, a general process of designing such a system contains three steps: (1)
Decide the data source and the hardware platforms used for the task; (2) Design the system architecture and
major logic; (3) Design each algorithm and implement the system.
Data Source and Platforms It is costly to set up new sensors for collecting vision data at large scale with
organized context. Our prior work show that leveraging existing sensing infrastructure and imagery data source
is a good way to scale the data collection process. Specifically, people could use widely-deployed surveillance
cameras for video collection, and use street-level imagery like Google Street View and Bing Streetside to
collect images across the world. The surveillance cameras and street-level imagery usually contain location
information. The images also contain accurate camera pose information like bearing and tilt. This could help
many sensing tasks when it is analyzed together with the map information. For sensing platforms, we suggest
that each available sensor should be evaluated in terms of its cost, sampling rate, data type, and data quality,
and be selected to help cameras in sensing tasks. Cameras, as the major source of vision data, should be
carefully analyzed and deployed because the camera placement, environment lighting, and image quality will
124
strongly affect the end-to-end accuracy.
Architecture Design Researchers should try to use a single context-extraction model in a system, instead
of develop separated solutions for different parts of a task. For example, Grab buildings all person behavior
detectors around the keypoint extraction model. Caesar uses and high-level action graph metric and supports it
with various elements in the vocabulary. For the system’s performance bottleneck, our prior work show that it
is an effective way to spread the bottleneck workload among different hardware components: redundant CPU
cycles could be used to handle tasks that originally require a GPU; the mobile devices and edge servers could
split the computation according to their local hardware limitations.
Algorithm Choices Complex neural-net-based vision algorithms are always applied for vision sensing tasks.
However, they are heavy in processing time and their accuracy require sufficient training data. Our experience
proves that, in many cases, researchers could apply much cheaper algorithms while keeping a good accuracy
(e.g. joint tracking between key frames in Grab). Researcher could also use some task and data specific
intuition to optimize the usage of expensive neural nets. For instance, Caesar lazily match action graphs to
minimize GPU usage, and ALPS saves processing time with a two-stage candidate filtering metric.
8.2 Potential Extensions
For outdoor pedestrian localization, further effort could be used to optimize the energy usage and storage
requirements of Gnome on Android phones, and test it more extensively in urban canyons in other major cities
of the world.
For outdoor object positioning, potential future work includes documenting large cities with ALPS, and
extending it to common landmarks that may be set back from the street yet visible in Street View, such as
transmission or radio towers, and integrating Bing Streetside to increase coverage and accuracy.
For vision-only behavior sensing, the future work includes extending Caesar to re-identify more targets like
cars and bags. Moreover, the activity graph could support extended vocabulary which includes time restriction,
location, etc.
For shopping behavior sensing (vision + sensor), much future work remains including obtaining results
from longer-term deployments, improvements in robust sensing in the face of adversarial behavior, and
exploration of cross-camera fusion to improve the accuracy of TAR and Grab even further.
125
References
[1] Amazon Go. https://www.amazon.com/b?node=16008589011.
[2] Amazon Go is Finally a Go. https://bit.ly/2OEypnc.
[3] Amcrest 720P Security Wireless IP Camera. https://amcrest.com/
amcrest-720p-wifi-video-security-ip-camera-pt.html.
[4] Arduino Uno MCU. https://store.arduino.cc/usa/arduino-uno-rev3.
[5] AV A Actions Dataset. https://research.google.com/ava/.
[6] Bingobox. https://bit.ly/34esmMT.
[7] Cashier Free Shopping is the Trend. https://www.thirdchannel.com/mind-the-store/
adopting-new-technologies-in-store.
[8] Checkout Time Limit Around Four Minutes. http://www.retailwire.com/discussion/
checkout-time-limit-around-four-minutes/.
[9] China’s Brick-and-Mortar Stores Fight Back with Tech. https://bit.ly/2O93C2W.
[10] China’s Convenience Stores No Longer Need People. http://technode.com/2017/07/04/
chinas-convenience-stores-no-longer-need-people-bingobox/.
[11] COCO Challenge. http://cocodataset.org/.
[12] Dawn of the Smart Surveillance Camera. https://www.zdnet.com/article/
dawn-of-the-smart-surveillance-camera/.
[13] Duke Multi-Target, Multi-Camera Tracking Project. http://vision.cs.duke.edu/DukeMTMC/.
[14] Dynamic time warping. https://en.wikipedia.org/wiki/Dynamic_time_warping.
[15] Face Feature with Deep Neural Network. https://github.com/ageitgey/face_recognition/.
[16] Facebook Location Targeting. https://www.facebook.com/business/a/location-targeting.
[17] FastDTW. https://pypi.python.org/pypi/fastdtw.
[18] Feature Points Matching.http://opencv-python-tutroals.readthedocs.org/en/latest/py_
tutorials/py_feature2d/py_matcher/py_matcher.html.
[19] Flask Framework. http://flask.pocoo.org/.
[20] Google Earth Pro. https://www.google.com/earth/desktop/.
[21] Google Street View. https://www.google.com/maps/streetview/.
[22] Gpx Logger for Android. https://play.google.com/store/apps/details?id=com.
eartoearoak.gpxlogger&hl=en.
[23] GRPC: A High-performance, Open-source Universal RPC Framework. https://grpc.io/.
[24] Homography. https://ags.cs.uni-kl.de/fileadmin/inf_ags/3dcv-ws11-12/3DCV_
WS11-12_lec04.pdf.
[25] How ’Amazon Go’ works. http://bit.ly/2Rud9Dv.
[26] How beacons will influence billions in us retail sales. http://www.businessinsider.com/beacons-impact-billions-in-
reail-sales-2015-2.
[27] How Nordstrom Uses Wifi To Spy On Shoppers. https://www.forbes.com/sites/petercohan/2013/05/09/how-
nordstrom-and-home-depot-use-wifi-to-spy-on-shoppers.
126
[28] How Retail Stores Track You Using Your Smartphone. https://lifehacker.com/how-retail-stores-track-you-using-
your-smartphone-and-827512308.
[29] How Smart Shelf Technology is Reshaping the Retail Industry. https://www.datexcorp.com/
smart-shelf-technology-reshaping-retail-industry/.
[30] HX711 ADC chip. https://www.sparkfun.com/products/13879.
[31] Imagr. http://www.imagr.co/.
[32] In Retail Small is Big. https://www.rangeme.com/blog/in-retail-small-is-big/.
[33] Interior Wants Wi-Fi At Burning Man. https://www.nextgov.com/cio-briefing/2018/04/
interior-wants-wi-fi-burning-man/147852/.
[34] Ios Core Bluetooth. https://developer.apple.com/documentation/corebluetooth.
[35] Is The End Of The Checkout Near? https://www.forbes.com/sites/neilstern/2017/08/25/
is-the-end-of-the-checkout-near/#59e87b145d21.
[36] Kinetic Lite Gps for Ios. https://itunes.apple.com/us/app/kinetic-lite-gps/
id390946616?mt=8.
[37] Making Self-Checkout Work: Learning From Albertsons. https://www.forbes.com/sites/
bryanpearson/2016/12/06/making-self-checkout-work-learning-from-albertsons/
#5324dc8c7c11.
[38] Mobileye Auto Mapping Technology.http://gpsworld.com/gm-volkswagen-to-use-mobileye-auto-mapping-technology/.
[39] MOT: Multiple Object Tracking Benchmark. https://motchallenge.net/.
[40] MPII Human Pose Dataset. http://human-pose.mpi-inf.mpg.de/.
[41] MTMCT on MOT Challenge. https://motchallenge.net/data/DukeMTMCT/.
[42] NATIONAL RETAIL SECURITY SURVEY 2018. https://nrf.com/resources/retail-library/
national-retail-security-survey-2018.
[43] Nvidia GeForce RTX 2080. https://www.nvidia.com/en-us/geforce/graphics-cards/
rtx-2080/.
[44] Nvidia Jetson TX2. https://www.nvidia.com/en-us/autonomous-machines/
embedded-systems/jetson-tx2/.
[45] OpenPose. https://github.com/CMU-Perceptual-Computing-Lab/openpose.
[46] Pearson Correlation Coefficient. https://en.wikipedia.org/wiki/Pearson_correlation_
coefficient.
[47] Redis. https://redis.io/.
[48] Self-Checkout at Walmart Results in More Theft Charges. http://www.amacdonaldlaw.com/blog/
2014/03/self-checkout-at-wal-mart-results-in-more-theft-charges.shtml.
[49] Site Modeling in Sketchup From Google Earth. https://youtu.be/nVhM3IYMF8o.
[50] SkyRec. http://www.skyrec.cc/eng.html.
[51] Sparkfun RFID Module and Antenna. https://www.sparkfun.com/products/14066,https://
www.sparkfun.com/products/14131.
[52] Spearman’s Rank Correlation Coefficient. https://en.wikipedia.org/wiki/Spearman%27s_rank_
correlation_coefficient.
[53] Standard Cognition. https://www.standardcognition.com/.
[54] Supermarket Scanner Errors. https://abcnews.go.com/GMA/ConsumerNews/
avoid-costly-supermarket-grocery-store-scanner-mistakes/story?id=11619157.
[55] Taobao Cashier-Free Demo Store. http://mashable.com/2017/07/12/taobao-china-store/
#57OXUGTQVaqg.
[56] TensorFlow Serving. https://github.com/tensorflow/serving.
[57] u-blox GPS module. https://www.u-blox.com/en/product/c94-m8p.
[58] UCF 101 Action Dataset. https://www.crcv.ucf.edu/data/UCF101.php.
[59] VIRAT Action Dataset. http://www.viratdata.org/.
127
[60] What’s Mobile’s Influence In-Store. https://www.marketingcharts.com/industries/
retail-and-e-commerce-65972.
[61] WHAT’S WRONG WITH PUBLIC VIDEO SURVEILLANCE? https://www.aclu.org/other/
whats-wrong-public-video-surveillance.
[62] Why Retailers Must Adopt a Technology Mind-set. http://www.mckinsey.com/~/media/mckinsey/
dotcom/client_service/bto/pdf/mobt32_10-13_niemeierinterview_r5.ashx.
[63] Will Smart Shelves Ever be Smart Enough for Kroger and Other Retailers? http://www.retailwire.
com/discussion/will-smart-shelves-ever-be-smart-enough-for-kroger-and/
-other-retailers/.
[64] Find a Parking Space Online. https://www.technologyreview.com/s/410505/
find-a-parking-space-online/, 2008.
[65] High-Tech Web Mapping Helps City of New York’s Fire Department Before Emergencies. http://www.esri.
com/news/arcnews/fall10articles/new-york-fire-dept.html, 2010.
[66] Fire Mapping: Building and Maintaining Datasets in ArcGIS. http://www.esri.com/library/ebooks/
fire-mapping.pdf, 2012.
[67] Los Angeles County Fire Hydrant Layer. http://egis3.lacounty.gov/dataportal/2012/05/23/
los-angeles-county-fire-hydrant-layer/, 2012.
[68] Firefighters Searching for Hydrants. http://patch.com/connecticut/danbury/
firefighters-searching-for-hydrants, 2013.
[69] Twitter Mobile Ads. https://business.twitter.com/en/advertising/mobile-ads-companion.html, 2013.
[70] ILSVRC2015 Results. http://image-net.org/challenges/LSVRC/2015/results, 2015.
[71] Bing Maps API. https://msdn.microsoft.com/en-us/library/dd877180.aspx, 2016.
[72] Bing Streetside. https://www.bing.com/mapspreview, 2016.
[73] Cities Unlocked. http://www.citiesunlocked.org.uk/, 2016.
[74] Flickr API. https://www.flickr.com/services/api/, 2016.
[75] Google Place API. https://developers.google.com/places/web-service/search, 2016.
[76] Google Street View Image API. https://developers.google.com/maps/documentation/
streetview/, 2016.
[77] LIBSVM. https://www.csie.ntu.edu.tw/~cjlin/libsvm/, 2016.
[78] OpenStreetMap. http://www.openstreetmap.org/, 2016.
[79] Subway. http://www.subway.com/, 2016.
[80] Understand Street View. https://www.google.com/maps/streetview/understand/, 2016.
[81] US Lane Width. http://safety.fhwa.dot.gov/geometric/pubs/mitigationstrategies/
chapter3/3_lanewidth.cfm, 2016.
[82] US Sideway Guideline. http://www.fhwa.dot.gov/environment/bicycle_pedestrian/
publications/sidewalks/chap4a.cfm, 2016.
[83] Yelp API. https://www.yelp.com/developers/documentation/v2/search_api, 2016.
[84] Cisco Meraki . https://meraki.cisco.com/, 2017.
[85] Accenture. https://www.accenture.com, 2017.
[86] Altbeacon. http://altbeacon.org/, 2017.
[87] Amazon Go. http://euclidanalytics.com/, 2017.
[88] Apple iBeacon. https://developer.apple.com/ibeacon/, 2017.
[89] Bea inc. https://www.beainc.com/en/technologies/, 2017.
[90] Bluetooth Le: Broadcast. https://www.bluetooth.com/what-is-bluetooth-technology/how-it-works/le-broadcast ,
2017.
[91] brickstream. http://www.brickstream.com/, 2017.
[92] Building Information of Los Angeles. http://geohub.lacity.org/datasets/, 2017.
128
[93] Building Information of New York. http://www1.nyc.gov/nyc-resources/service/2266/
property-deeds-and-other-documents, 2017.
[94] Deloitte. https://www2.deloitte.com, 2017.
[95] Eddystone Beacon. https://developers.google.com/beacons/, 2017.
[96] Extract Depth Maps From Google Street View. https://0xef.wordpress.com/2013/05/01/
extract-depth-maps-from-google-street-view/, 2017.
[97] Google Streetview. https://www.google.com/streetview/, 2017.
[98] Gps and Gnss for Geospatial Professionals. https://www.e-education.psu.edu/geog862/, 2017.
[99] Gps Module with Dead Reckoning. http://bit.ly/2uxV1zH, 2017.
[100] hella. http://www.hella.com/microsite-electronics/en/Sensors-94.html, 2017.
[101] How beacons can reshape retail marketing. https://www.thinkwithgoogle.com/articles/
retail-marketing-beacon-technology.html, 2017.
[102] irisys. http://www.irisys.net/, 2017.
[103] Kalman Filter. https://en.wikipedia.org/wiki/Kalman_filter, 2017.
[104] Map Matching.https://blog.mapbox.com/matching-gps-traces-to-a-map-73730197d0e2,
2017.
[105] Mobile Ads. https://www.technologyreview.com/s/538731/how-ads-follow-you-from-phone-to-desktop-to-tablet/,
2017.
[106] Nmea Protocol. https://en.wikipedia.org/wiki/NMEA_0183, 2017.
[107] Online and In-store Shopping: Competition or Complements? http://genyu.net/2014/12/09/
online-and-in-store-shopping-competition-or-complements/, 2017.
[108] Open Street Map. https://www.openstreetmap.org/, 2017.
[109] Person re-identification. https://github.com/D-X-Y/caffe-reid, 2017.
[110] Position Estimation using Non-line-of-sight Gps Signals. http://bit.ly/38DEUix, 2017.
[111] Propertyshark: Real-estate Data Source. https://www.propertyshark.com/mason/, 2017.
[112] Raw Gnss Measurements on Android.https://developer.android.com/guide/topics/sensors/
gnss.html, 2017.
[113] Ray Tracing. https://www.cs.unc.edu/~rademach/xroads-RT/RTarticle.html, 2017.
[114] Shopping Easier with Store App. https://corporate.target.com/article/2017/06/sean-murphy-target-app , 2017.
[115] Study Finds Shoppers Prefer Brick-And-Mortar Stores to Amazon and
EBay. https://www.forbes.com/sites/barbarathau/2014/07/25/
report-amazons-got-nothing-on-brick-and-mortar-stores/#7557597262d0, 2017.
[116] The Retailer’s Dilemma: a Brick-and-Mortar or Brand Problem. https://go.forrester.com/
ep06-the-retailers-dilemma/, 2017.
[117] Wifi Based Localization on Android. https://gigaom.com/2012/11/29/
android-app-toggles-wi-fi-based-on-location-no-gps-needed, 2017.
[118] xovis. https://www.xovis.com/en/xovis/, 2017.
[119] Zed Stereo Camera. https://www.stereolabs.com/zed/, 2017.
[120] Mounir Adjrad and Paul D Groves. Enhancing Conventional Gnss Positioning with 3d Mapping Without Accurate
Prior Knowledge. The Institute of Navigation, 2015.
[121] Ioannis Agadakos, Jason Polakis, and Georgios Portokalidis. Techu: Open and Privacy-preserving Crowdsourced
Gps for the Masses. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications,
and Services, pages 475–487. ACM, 2017.
[122] Pratik Agarwal, Wolfram Burgard, and Luciano Spinello. Metric localization using google street view. In 2015
IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2015, Hamburg, Germany, September
28 - October 2, 2015, pages 3111–3118, 2015.
[123] Mohamed Rabie Amer, Peng Lei, and Sinisa Todorovic. Hirf: Hierarchical Random Field for Collective Activity
Recognition in Videos. In European Conference on Computer Vision, pages 572–585. Springer, 2014.
129
[124] Boulbaba Ben Amor, Jingyong Su, and Anuj Srivastava. Action Recognition using Rate-invariant Analysis of
Skeletal Shape Trajectories. IEEE transactions on pattern analysis and machine intelligence, 38(1):1–13, 2016.
[125] Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. Openface: A General-purpose Face Recognition
Library with Mobile applications. CMU School of Computer Science, 2016.
[126] Anton Andriyenko, Konrad Schindler, and Stefan Roth. Discrete-continuous optimization for multi-target tracking.
In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1926–1933. IEEE, 2012.
[127] Dragomir Anguelov, Carole Dulong, Daniel Filip, Christian Frueh, Stéphane Lafon, Richard Lyon, Abhijit Ogale,
Luc Vincent, and Josh Weaver. Google Street View: Capturing the World At Street Level. Computer, 43(6):32–38,
2010.
[128] Shervin Ardeshir, Amir Roshan Zamir, Alejandro Torroella, and Mubarak Shah. GIS-assisted Object Detection and
Geospatial Localization. In Computer Vision–ECCV 2014, pages 602–617. Springer, 2014.
[129] Syed Mahfuzul Aziz and Duc Minh Pham. Energy Efficient Image Transmission in Wireless Multimedia Sensor
Networks. IEEE communications letters, 17(6):1084–1087, 2013.
[130] Seung-Hwan Bae and Kuk-Jin Yoon. Confidence-based Data Association and Discriminative Deep Appearance
Learning for Robust Online Multi-object Tracking. IEEE transactions on pattern analysis and machine intelligence,
40(3):595–610, 2018.
[131] Timur Bagautdinov, Alexandre Alahi, François Fleuret, Pascal Fua, and Silvio Savarese. Social Scene Understanding:
End-to-end Multi-person Action Localization and Collective Activity Recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4315–4324, 2017.
[132] Song Bai, Xiang Bai, and Qi Tian. Scalable person re-identification on supervised smoothed manifold. In CVPR,
2017.
[133] Slawomir Bak. Human re-identification through a video camera network. PhD thesis, Université Nice Sophia
Antipolis, 2012.
[134] Vahid Balali, Elizabeth Depwe, and Mani Golparvar-Fard. Multi-class traffic sign detection and classification using
google street view images. In Transportation Research Board 94th Annual Meeting, Transportation Research Board,
Washington, DC, 2015.
[135] Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. Openface: an Open Source Facial Behavior
Analysis toolkit. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, 2016.
[136] Xuan Bao, Bin Liu, Bo Tang, Bing Hu, Deguang Kong, and Hongxia Jin. PinPlace: Associate Semantic Meanings
with Indoor Locations Without Active Fingerprinting. In Proceedings of the 2015 ACM International Joint
Conference on Pervasive and Ubiquitous Computing, UbiComp ’15, pages 921–925, New York, NY , USA, 2015.
ACM.
[137] Xavier Baró, Sergio Escalera, Petia Radeva, and Jordi Vitrià. Generic object recognition in urban image databases.
In Artificial Intelligence Research and Development, Proceedings of the 12th International Conference of the
Catalan Association for Artificial Intelligence, CCIA 2009, October 21-23, 2009, Vilar Rural de Cardona (El Bages),
Cardona, Spain, pages 27–34, 2009.
[138] Michael Batty, Kay W Axhausen, Fosca Giannotti, Alexei Pozdnoukhov, Armando Bazzani, Monica Wachowicz,
Georgios Ouzounis, and Yuval Portugali. Smart cities of the future. The European Physical Journal Special Topics,
214(1):481–518, 2012.
[139] Miguel Angel Bautista, Antonio Hernández-Vela, Victor Ponce, Xavier Perez-Sala, Xavier Baró, Oriol Pujol,
Cecilio Angulo, and Sergio Escalera. Probability-based Dynamic Time Warping for Gesture Recognition on Rgb-d
data. In Advances in Depth Image Analysis and Applications. Springer, 2013.
[140] Jerome Berclaz, Francois Fleuret, Engin Turetken, and Pascal Fua. Multiple object tracking using k-shortest paths
optimization. IEEE transactions on pattern analysis and machine intelligence, 33(9):1806–1819, 2011.
[141] David Bétaille, François Peyret, and Miguel Ortiz. How to Enhance Accuracy and Integrity of Satellite Positioning
for Mobility Pricing in Cities: The Urban Trench Method. In Transport Research Arena TRA 2014, page 8p, 2014.
[142] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In
Image Processing (ICIP), 2016 IEEE International Conference on, pages 3464–3468. IEEE, 2016.
[143] Jacob T. Biehl, Matthew Cooper, Gerry Filby, and Sven Kratz. LoCo: A Ready-to-deploy Framework for Efficient
Room Localization Using Wi-Fi. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and
Ubiquitous Computing, UbiComp ’14, pages 183–187, New York, NY , USA, 2014. ACM.
[144] Andria Bilich and Kristine M Larson. Mapping the Gps Multipath Environment using the Signal-to-noise Ratio
(snr). Radio Science, 42(6), 2007.
130
[145] Cheng Bo, Xiang-Yang Li, Taeho Jung, Xufei Mao, Yue Tao, and Lan Yao. Smartloc: Push the Limit of the
Inertial Sensor Based Metropolitan Localization using Smartphone. In Proceedings of the 19th annual international
conference on Mobile computing & networking, pages 195–198. ACM, 2013.
[146] J Boice, X Lu, C Margi, G Stanek, G Zhang, R Manduchi, and K Obraczka. Meerkats: A Power-aware, Self-
managing Wireless Camera Network for Wide Area Monitoring. In Proc. Workshop on Distributed Smart Cameras,
pages 393–422. Citeseer, 2006.
[147] Dominik Brunner, Guido Lemoine, and Lorenzo Bruzzone. Extraction of Building Heights From Vhr Sar Imagery
using an Iterative Simulation and Match Procedure. In Geoscience and Remote Sensing Symposium, 2008. IGARSS
2008. IEEE International, volume 4, pages IV–141. IEEE, 2008.
[148] Dominik Brunnera, Guido Lemoinea, and Lorenzo Bruzzoneb. Building Height Retrieval From Airborne Vhr Sar
Imagery Based on an Iterative Simulation and Matching Procedure. In Proc. of SPIE Vol, volume 7110, pages
71100F–1, 2008.
[149] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime Multi-person 2d Pose Estimation using Part
Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
7291–7299, 2017.
[150] Joao Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? a New Model and the Kinetics Dataset. In
proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[151] François Cellier and Elise Colin. Building Height Estimation using Fine Analysis of Altimetric Mixtures in Layover
Areas on Polarimetric Interferometric X-band Sar Images. In Geoscience and Remote Sensing Symposium, 2006.
IGARSS 2006. IEEE International Conference on, pages 4004–4007. IEEE, 2006.
[152] Dongyao Chen, Kang G Shin, Yurong Jiang, and Kyu-Han Kim. Locating and Tracking Ble Beacons with
Smartphones. 2017.
[153] Weihua Chen, Lijun Cao, Xiaotang Chen, and Kaiqi Huang. An Equalized Global Graph Model-based Approach for
Multicamera Object Tracking. IEEE Transactions on Circuits and Systems for Video Technology, 27(11):2367–2381,
2017.
[154] Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond triplet loss: a deep quadruplet network for
person re-identification. In Proc. CVPR, 2017.
[155] Xinlei Chen and Abhinav Gupta. An implementation of faster rcnn with study for region sampling. arXiv preprint
arXiv:1702.02138, 2017.
[156] Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Domain Adaptive Faster R-cnn for
Object Detection in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 3339–3348, 2018.
[157] Christopher J Comp and Penina Axelrad. Adaptive Snr-based Carrier Phase Multipath Mitigation Technique. IEEE
Transactions on Aerospace and Electronic Systems, 34(1):264–276, 1998.
[158] David J. Crandall, Lars Backstrom, Daniel Huttenlocher, and Jon Kleinberg. Mapping the World’s Photos. In
Proceedings of the 18th International Conference on World Wide Web, WWW ’09, pages 761–770, New York, NY ,
USA, 2009. ACM.
[159] Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. Clipper: A
Low-latency Online Prediction Serving System. In 14th USENIX Symposium on Networked Systems Design and
Implementation (NSDI 17), pages 613–627, 2017.
[160] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA,
USA, pages 886–893, 2005.
[161] Pasquale Daponte, Luca De Vito, Gianluca Mazzilli, Francesco Picariello, Sergio Rapuano, and Maria Riccio.
Metrology for drone and drone for metrology: measurement systems on small civilian drones. In Proc. of 2nd Int.
Workshop on Metrology for Aerospace, pages 316–321, June 2015.
[162] Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L Massart. The mahalanobis distance. Chemometrics
and intelligent laboratory systems, 50(1):1–18, 2000.
[163] Christopher Drane, Malcolm Macnaughtan, and Craig Scott. Positioning Gsm Telephones. IEEE Communications
magazine, 36(4):46–54, 1998.
[164] Vincent Drevelle and Philippe Bonnifait. Igps: Global Positioning in Urban Canyons with Road Surface Maps.
IEEE Intelligent Transportation Systems Magazine, 4(3):6–18, 2012.
131
[165] Hugh F. Durrant-Whyte and Tim Bailey. Simultaneous localization and mapping: part I. IEEE Robot. Automat.
Mag., 13(2):99–110, 2006.
[166] Moustafa Elhamshary and Moustafa Youssef. CheckInside: A Fine-grained Indoor Location-based Social Network.
In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp
’14, pages 607–618, New York, NY , USA, 2014. ACM.
[167] Rudy Ercek, Philippe De Doncker, and Francis Grenez. Nlos-multipath Effects on Pseudo-range Estimation in Urban
Canyons for Gnss Applications. In Antennas and Propagation, 2006. EuCAP 2006. First European Conference on,
pages 1–6. IEEE, 2006.
[168] Euclid Analytics . http://amazongo.com/, 2017.
[169] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The
PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-
network.org/challenges/VOC/voc2012/workshop/index.html.
[170] Loïc Fagot-Bouquet, Romaric Audigier, Yoann Dhome, and Frédéric Lerasle. Improving Multi-frame Data
Association with Sparse Representations for Robust Near-online Multi-object Tracking. In European Conference
on Computer Vision, pages 774–790. Springer, 2016.
[171] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications
to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.
[172] Francois Fleuret, Jerome Berclaz, Richard Lengagne, and Pascal Fua. Multicamera people tracking with a
probabilistic occupancy map. IEEE transactions on pattern analysis and machine intelligence, 30(2):267–282,
2008.
[173] Daniel Y Fu, Will Crichton, James Hong, Xinwei Yao, Haotian Zhang, Anh Truong, Avanika Narayan, Maneesh
Agrawala, Christopher Ré, and Kayvon Fatahalian. Rekall: Specifying video events using compositions of
spatiotemporal labels. arXiv preprint arXiv:1910.02993, 2019.
[174] Keinosuke Fukunaga and Larry D. Hostetler. The estimation of the gradient of a density function, with applications
in pattern recognition. IEEE Trans. Information Theory, 21(1):32–40, 1975.
[175] Chuang Gan, Yi Yang, Linchao Zhu, Deli Zhao, and Yueting Zhuang. Recognizing an Action using Its Name: A
Knowledge-based Approach. International Journal of Computer Vision, 120(1):61–77, 2016.
[176] Rohit Girdhar, João Carreira, Carl Doersch, and Andrew Zisserman. Video Action Transformer Network. CoRR,
abs/1812.02707, 2018.
[177] Ross B. Girshick. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago,
Chile, December 7-13, 2015, pages 1440–1448, 2015.
[178] Mengran Gou, Srikrishna Karanam, Wenqian Liu, Octavia Camps, and Richard J. Radke. Dukemtmc4reid: A
large-scale multi-camera person re-identification dataset. In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, July 2017.
[179] D Gowsikhaa, S Abirami, et al. Suspicious Human Activity Detection From Surveillance Videos. International
Journal on Internet & Distributed Computing Systems, 2(2), 2012.
[180] Paul D Groves. Shadow Matching: A New Gnss Positioning Technique for Urban Canyons. The journal of
Navigation, 64(3):417–430, 2011.
[181] Tao Guo and Yoshifumi Yasuoka. Snake-based Approach for Building Extraction From High-resolution Satellite
Images and Height Data in Urban Areas. In Proceedings of the 23rd Asian Conference on Remote Sensing, pages
25–29, 2002.
[182] Kiryong Ha, Zhuo Chen, Wenlu Hu, Wolfgang Richter, Padmanabhan Pillai, and Mahadev Satyanarayanan. Towards
Wearable Cognitive assistance. In Proceedings of the 12th annual international conference on Mobile systems,
applications, and services, 2014.
[183] Harald Haelterman. Crime Script Analysis: Preventing Crimes Against Business. Springer, 2016.
[184] Zekun Hao, Yu Liu, Hongwei Qin, Junjie Yan, Xiu Li, and Xiaolin Hu. Scale-aware Face detection. In The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[185] Kotaro Hara, Victoria Le, Jin Sun, David Jacobs, and J Froehlich. Exploring early solutions for automatically
identifying inaccessible sidewalks in the physical world using google street view. Human Computer Interaction
Consortium, 2013.
[186] Munawar Hayat, Salman H Khan, Naoufel Werghi, and Roland Goecke. Joint Registration and Representation
Learning for Unconstrained Face identification. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017.
132
[187] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-cnn. In Computer Vision (ICCV), 2017
IEEE International Conference on, 2017.
[188] Suining He, S.-H. Gary Chan, Lei Yu, and Ning Liu. Calibration-free Fusion of Step Counter and Wireless
Fingerprints for Indoor Localization. In Proceedings of the 2015 ACM International Joint Conference on Pervasive
and Ubiquitous Computing, UbiComp ’15, pages 897–908, New York, NY , USA, 2015. ACM.
[189] Will Hedgecock, Miklos Maroti, Akos Ledeczi, Peter V olgyesi, and Rueben Banalagay. Accurate Real-time Relative
Localization using Single-frequency Gps. In Proceedings of the 12th ACM Conference on Embedded Network
Sensor Systems, pages 206–220. ACM, 2014.
[190] Will Hedgecock, Miklos Maroti, Janos Sallai, Peter V olgyesi, and Akos Ledeczi. High-accuracy Differential
Tracking of Low-cost Gps Receivers. In Proceeding of the 11th annual international conference on Mobile systems,
applications, and services, pages 221–234. ACM, 2013.
[191] David Held, Sebastian Thrun, and Silvio Savarese. Robust Single-view Instance recognition. In Robotics and
Automation (ICRA), 2016 IEEE International Conference on, 2016.
[192] Roberto Henschel, Laura Leal-Taixé, Daniel Cremers, and Bodo Rosenhahn. Improvements to Frank-wolfe
Optimization for Multi-detector Multi-object tracking. CoRR, 2017.
[193] Sebastian Hilsenbeck, Dmytro Bobkov, Georg Schroth, Robert Huitl, and Eckehard Steinbach. Graph-based Data
Fusion of Pedometer and WiFi Measurements for Mobile Indoor Positioning. In Proceedings of the 2014 ACM
International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’14, pages 147–158, New York,
NY , USA, 2014. ACM.
[194] Li-Ta Hsu, Yanlei Gu, and Shunsuke Kamijo. 3d Building Model-based Pedestrian Positioning Method using
Gps/glonass/qzss and Its Reliability Calculation. GPS solutions, 20(3):413–428, 2016.
[195] LT Hsu and S Kamijo. Nlos Exclusion using Consistency Check and City Building Model in Deep Urban Canyons.
In ION GNSS, pages 2390–2396, 2015.
[196] Yitao Hu, Xiaochen Liu, Suman Nath, and Ramesh Govindan. Alps: accurate landmark positioning at city scales.
In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages
1147–1158. ACM, 2016.
[197] Yitao Hu, Swati Rallapalli, Bongjun Ko, and Ramesh Govindan. Olympian: Scheduling Gpu Usage in a Deep
Neural Network Model Serving System. In Proceedings of the 19th International Middleware Conference, pages
53–65. ACM, 2018.
[198] inmarket. https://inmarket.com/, 2017.
[199] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Deepercut: A
deeper, stronger, and Faster Multi-person Pose Estimation model. In European Conference on Computer Vision,
2016.
[200] Umar Iqbal, Anton Milan, and Juergen Gall. Posetrack: Joint Multi-person Pose Estimation and tracking. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[201] Chengkun Jiang, Yuan He, Xiaolong Zheng, and Yunhao Liu. Orientation-aware Rfid Tracking with Centimeter-
level Accuracy. In Proceedings of the 17th ACM/IEEE International Conference on Information Processing in
Sensor Networks, pages 290–301. IEEE Press, 2018.
[202] Yurong Jiang, Hang Qiu, Matthew McCartney, Gaurav Sukhatme, Marco Gruteser, Fan Bai, Donald Grimm, and
Ramesh Govindan. Carloc: Precise Positioning of Automobiles. In Proceedings of the 13th ACM Conference on
Embedded Networked Sensor Systems, pages 253–265. ACM, 2015.
[203] Roy Jonker and Anton V olgenant. A Shortest Augmenting Path Algorithm for Dense and Sparse Linear Assignment
Problems. Computing, 38(4):325–340, 1987.
[204] Angelo Joseph. Measuring Gnss Signal Strength. Inside GNSS, 5(8):20–25, 2010.
[205] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for
spatio-temporal action localization. In ICCV, 2017.
[206] Elliott Kaplan and Christopher Hegarty. Understanding GPS: Principles and Applications. Artech house, 2005.
[207] Leonid Karlinsky, Joseph Shtok, Yochay Tzur, and Asaf Tzadok. Fine-grained Recognition of Thousands of Object
Categories with Single-example training. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017.
[208] Steven M Kay. Fundamentals of Statistical Signal Processing: Practical Algorithm Development, volume 3.
Pearson Education, 2013.
133
[209] Christian Koehler, Nikola Banovic, Ian Oakley, Jennifer Mankoff, and Anind K. Dey. Indoor-ALPS: An Adaptive
Indoor Location Prediction System. In Proceedings of the 2014 ACM International Joint Conference on Pervasive
and Ubiquitous Computing, UbiComp ’14, pages 171–181, New York, NY , USA, 2014. ACM.
[210] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[211] Harold W Kuhn. The Hungarian Method for the Assignment problem. Naval Research Logistics (NRL), 1955.
[212] Ashwani Kumar, Yoshihiro Sato, Takeshi Oishi, and Katsushi Ikeuchi. Identifying Reflected Gps Signals and
Improving Position Estimation using 3d Map Simultaneously Built with Laser Range Scanner. Rapport technique,
Computer Vision Laboratory, Institute of Industrial Science, The University of Tokyo, 2014.
[213] Cheng-Hao Kuo and Ram Nevatia. How does person identity recognition help multi-person tracking? In Computer
Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1217–1224. IEEE, 2011.
[214] SangJeong Lee, Chulhong Min, Chungkuk Yoo, and Junehwa Song. Understanding customer malling behavior in an
urban shopping mall using smartphones. In Proceedings of the 2013 ACM conference on Pervasive and ubiquitous
computing adjunct publication, pages 901–910. ACM, 2013.
[215] Seungwoo Lee, Yungeun Kim, Daye Ahn, Rhan Ha, Kyoungwoo Lee, and Hojung Cha. Non-obstructive Room-level
Locating System in Home Environments Using Activity Fingerprints from Smartwatch. In Proceedings of the 2015
ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’15, pages 939–950, New
York, NY , USA, 2015. ACM.
[216] Yunseong Lee, Alberto Scolari, Byung-Gon Chun, Marco Domenico Santambrogio, Markus Weimer, and Matteo
Interlandi. PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems. In 13th USENIX
Symposium on Operating Systems Design and Implementation (OSDI 18), pages 611–626, 2018.
[217] Rainer Lienhart and Jochen Maydt. An Extended Set of Haar-like Features for Rapid Object detection. In Image
Processing. 2002. Proceedings. 2002 International Conference on, 2002.
[218] Heng Liu, Tao Mei, Jiebo Luo, Houqiang Li, and Shipeng Li. Finding perfect rendezvous on the go: accurate
mobile visual localization and its applications to routing. In Proceedings of the 20th ACM Multimedia Conference,
MM ’12, Nara, Japan, October 29 - November 02, 2012, pages 9–18, 2012.
[219] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
Berg. Ssd: Single Shot Multibox Detector. In European conference on computer vision, pages 21–37. Springer,
2016.
[220] Xiaochen Liu, Pradipta Ghosh, Oytun Ulutan, BS Manjunath, Kevin Chan, and Ramesh Govindan. Caesar: cross-
camera complex activity recognition. In Proceedings of the 17th Conference on Embedded Networked Sensor
Systems, pages 232–244. ACM, 2019.
[221] Xiaochen Liu, Yurong Jiang, Puneet Jain, and Kyu-Han Kim. Tar: Enabling fine-grained targeted advertising in
retail stores. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and
Services, pages 323–336. ACM, 2018.
[222] Xiaochen Liu, Yurong Jiang, Kyu-Han Kim, and Ramesh Govindan. Grab: Fast and accurate sensor processing for
cashier-free shopping. arXiv preprint arXiv:2001.01033, 2020.
[223] Xiaochen Liu, Suman Nath, and Ramesh Govindan. Gnome: A practical approach to nlos mitigation for gps
positioning in smartphones. In Proceedings of the 16th Annual International Conference on Mobile Systems,
Applications, and Services, pages 163–177. ACM, 2018.
[224] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer
Vision, 60(2):91–110, 2004.
[225] David G Lowe. Distinctive Image Features From Scale-invariant Keypoints. International journal of computer
vision, 60(2):91–110, 2004.
[226] Wenmiao Lu and Yap-Peng Tan. A Color Histogram Based People Tracking system. In Circuits and Systems, 2001.
ISCAS 2001. The 2001 IEEE International Symposium on, 2001.
[227] Iacopo Masi, Stephen Rawls, Gérard Medioni, and Prem Natarajan. Pose-aware Face Recognition in the wild. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[228] Suhas Mathur, Tong Jin, Nikhil Kasturirangan, Janani Chandrasekaran, Wenzhi Xue, Marco Gruteser, and Wade
Trappe. Parknet: drive-by sensing of road-side parking statistics. In Proceedings of the 8th international conference
on Mobile systems, applications, and services, pages 123–136. ACM, 2010.
[229] Pascal Mettes and Cees GM Snoek. Spatial-aware Object Embeddings for Zero-shot Localization and Classification
of Actions. In Proceedings of the IEEE International Conference on Computer Vision, pages 4443–4452, 2017.
134
[230] Anton Milan, Seyed Hamid Rezatofighi, Anthony R Dick, Ian D Reid, and Konrad Schindler. Online Multi-target
Tracking Using Recurrent Neural Networks. In AAAI, pages 4225–4232, 2017.
[231] Anton Milan, Konrad Schindler, and Stefan Roth. Detection-and trajectory-level exclusion in multiple object
tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3682–3689,
2013.
[232] Shunsuke Miura, Li-Ta Hsu, Feiyu Chen, and Shunsuke Kamijo. Gps Error Correction with Pseudorange Evaluation
using Three-dimensional Maps. IEEE Transactions on Intelligent Transportation Systems, 16(6):3104–3115, 2015.
[233] Andreas Mogelmose, Mohan Manubhai Trivedi, and Thomas B Moeslund. Vision-based traffic sign detection
and analysis for intelligent driver assistance systems: Perspectives and survey. IEEE Transactions on Intelligent
Transportation Systems, 13(4):1484–1497, 2012.
[234] Julien Moreau, Sébastien Ambellouis, and Yassine Ruichek. Fisheye-based Method for Gps Localization Improve-
ment in Unknown Semi-obstructed Areas. Sensors, 17(1):119, 2017.
[235] Saman Naderiparizi, Yi Zhao, James Youngquist, Alanson P. Sample, and Joshua R. Smith. Self-localizing Battery-
free Cameras. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous
Computing, UbiComp ’15, pages 445–449, New York, NY , USA, 2015. ACM.
[236] Shahriar Nirjon, Jie Liu, Gerald DeJean, Bodhi Priyantha, Yuzhe Jin, and Ted Hart. Coin-gps: Indoor Localization
From Direct Gps Receiving. In Proceedings of the 12th annual international conference on Mobile systems,
applications, and services, pages 301–314. ACM, 2014.
[237] Kanishka Nithin and François Brémond. Globality–locality-based Consistent Discriminant Feature Ensemble for
Multicamera Tracking. IEEE Transactions on Circuits and Systems for Video Technology, 27(3):431–440, 2017.
[238] Kazuya Ohara, Takuya Maekawa, Yasue Kishino, Yoshinari Shirai, and Futoshi Naya. Transferring Positioning
Model for Device-free Passive Indoor Localization. In Proceedings of the 2015 ACM International Joint Conference
on Pervasive and Ubiquitous Computing, UbiComp ’15, pages 885–896, New York, NY , USA, 2015. ACM.
[239] Onur Özye¸ sil, Vladislav V oroninski, Ronen Basri, and Amit Singer. A Survey of Structure From Motion. Acta
Numerica, 26:305–364, 2017.
[240] Hae-Sang Park and Chi-Hyuck Jun. A Simple and Fast Algorithm for K-medoids clustering. Expert systems with
applications, 2009.
[241] Anindya S. Paul, Eric A. Wan, Fatema Adenwala, Erich Schafermeyer, Nick Preiser, Jeffrey Kaye, and Peter G.
Jacobs. MobileRF: A Robust Device-free Tracking System Based on a Hybrid Neural Network HMM Classifier. In
Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp
’14, pages 159–170, New York, NY , USA, 2014. ACM.
[242] Charith Perera, Arkady B. Zaslavsky, Peter Christen, and Dimitrios Georgakopoulos. Context aware computing for
the internet of things: A survey. IEEE Communications Surveys and Tutorials, 16(1):414–454, 2014.
[243] Sébastien Peyraud, David Bétaille, Stéphane Renault, Miguel Ortiz, Florian Mougel, Dominique Meizel, and
François Peyret. About Non-line-of-sight Satellite Detection and Exclusion in a 3d Map-aided Localization
Algorithm. Sensors, 13(1):829–847, 2013.
[244] Hamed Pirsiavash, Deva Ramanan, and Charless C Fowlkes. Globally-optimal greedy algorithms for tracking a
variable number of objects. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages
1201–1208. IEEE, 2011.
[245] Ronald Poppe. A survey on vision-based human action recognition. Image and vision computing, 28(6):976–990,
2010.
[246] BLE Proximity technologies. http://bit.ly/2sZoP7V, 2017.
[247] Siyuan Qi, Siyuan Huang, Ping Wei, and Song-Chun Zhu. Predicting human activities using stochastic grammar. In
International Conference on Computer Vision (ICCV), IEEE, 2017.
[248] Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Jiaxin Chen, and Yunhong Wang. Zero-shot Action
Recognition with Error-correcting Output Codes. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2833–2842, 2017.
[249] H Qiu, X Liu, S Rallapalli, A J Bency, K Chan, R Urgaonkar, B S Manjunath, and R Govindan. Kestrel: Video
Analytics for Augmented Multi-camera Vehicle Tracking. In 2018 IEEE/ACM Third International Conference on
Internet-of-Things Design and Implementation (IoTDI), pages 48–59, 2018.
[250] Hang Qiu, Fawad Ahmad, Ramesh Govindan, Marco Gruteser, Fan Bai, and Gorkem Kar. Augmented Vehicular
Reality: Enabling Extended Vision for Future Vehicles. In Proceedings of the 18th International Workshop on
Mobile Computing Systems and Applications, pages 67–72. ACM, 2017.
135
[251] Hang Qiu, Krishna Chintalapudi, and Ramesh Govindan. Satyam: Democratizing Groundtruth for Machine Vision.
CoRR, abs/1811.03621, 2018.
[252] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time
object detection. CoRR, abs/1506.02640, 2015.
[253] Joseph Redmon and Ali Farhadi. Yolo9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242, 2016.
[254] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. CoRR, 2016.
[255] Joseph Redmon and Ali Farhadi. Yolo9000: Better, Faster, Stronger. arXiv preprint, 2017.
[256] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
[257] Joseph Redmon and Ali Farhadi. Yolov3: An Incremental Improvement. arXiv preprint arXiv:1804.02767, 2018.
[258] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-cnn: Towards Real-time Object Detection with
Region Proposal networks. In Advances in neural information processing systems, 2015.
[259] Retail Next. https://retailnext.net/en/home/, 2017.
[260] Bernhard Rinner, Bernhard Dieber, Lukas Esterle, Peter R Lewis, and Xin Yao. Resource-aware Configuration in
Smart Camera Networks. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops, pages 58–65. IEEE, 2012.
[261] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data
set for multi-target, multi-camera tracking. In European Conference on Computer Vision, pages 17–35. Springer,
2016.
[262] Ergys Ristani and Carlo Tomasi. Features for multi-target multi-camera tracking and re-identification. arXiv
preprint arXiv:1803.10859, 2018.
[263] Nirupam Roy, He Wang, and Romit Roy Choudhury. I Am a Smartphone and I Can Tell My User’s Walking
Direction. In Proceedings of the 12th annual international conference on Mobile systems, applications, and services,
pages 329–342. ACM, 2014.
[264] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An Efficient Alternative to Sift or surf. In
Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, 2011.
[265] Mohamed Sahmoudi, Aude Bourdeau, and Jean-Yves Tourneret. Deep Fusion of Vector Tracking Gnss Receivers
and a 3d City Model for Robust Positioning in Urban Canyons with Nlos Signals. In Satellite Navigation
Technologies and European Workshop on GNSS Signals and Signal Processing (NAVITEC), 2014 7th ESA Workshop
on, pages 1–7. IEEE, 2014.
[266] Shunta Saito, Takayoshi Yamashita, and Yoshimitsu Aoki. Multiple Object Extraction From Aerial Imagery with
Convolutional Neural Networks. volume 2016, pages 1–9. Society for Imaging Science and Technology, 2016.
[267] Stan Salvador and Philip Chan. Toward Accurate Dynamic Time Warping in Linear Time and Space. Intelligent
Data Analysis, 11(5):561–580, 2007.
[268] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted
Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4510–4520, 2018.
[269] Juan C SanMiguel, Christian Micheloni, Karen Shoop, Gian Luca Foresti, and Andrea Cavallaro. Self-reconfigurable
Smart Camera Networks. Computer, 47(5):67–73, 2014.
[270] Souvik Sen, Jeongkeun Lee, Kyu-Han Kim, and Paul Congdon. Avoiding Multipath to Revive Inbuilding Wifi
Localization. In Proceeding of the 11th annual international conference on Mobile systems, applications, and
services, pages 249–262. ACM, 2013.
[271] Souvik Sen, Božidar Radunovic, Romit Roy Choudhury, and Tom Minka. You Are Facing the Mona Lisa: Spot
Localization using Phy Layer Information. In Proceedings of the 10th international conference on Mobile systems,
applications, and services, pages 183–196. ACM, 2012.
[272] Longfei Shangguan, Zimu Zhou, and Kyle Jamieson. Enabling Gesture-based Interactions with objects. In
Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys
’17, 2017.
[273] Yang Shao, Gregory N Taff, and Stephen J Walsh. Shadow Detection and Building-height Estimation using Ikonos
Data. International journal of remote sensing, 32(22):6929–6944, 2011.
[274] VK Shettigara and GM Sumerling. Height Determination of Extended Objects using Shadows in Spot Images.
Photogrammetric Engineering and Remote Sensing, 64(1):35–43, 1998.
136
[275] KA Shiva Kumar, KR Ramakrishnan, and GN Rathna. Inter-camera Person Tracking in Non-overlapping Networks:
Re-identification Protocol and On-line Update. In Proceedings of the 11th International Conference on Distributed
Smart Cameras, pages 55–62. ACM, 2017.
[276] shoppertrak. https://www.shoppertrak.com, 2017.
[277] Zheng Shou, Junting Pan, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavier Giro-i
Nieto, and Shih-Fu Chang. Online detection of action start in untrimmed, streaming videos. In Proceedings of the
European Conference on Computer Vision (ECCV), pages 534–551, 2018.
[278] Yuanchao Shu, Kang G Shin, Tian He, and Jiming Chen. Last-mile Navigation using Smartphones. In Proceedings
of the 21st Annual International Conference on Mobile Computing and Networking, pages 512–524. ACM, 2015.
[279] Uwe Soergel, Eckart Michaelsen, Antje Thiele, Erich Cadario, and Ulrich Thoennessen. Stereo Analysis of
High-resolution Sar Images for Building Height Estimation in Cases of Orthogonal Aspect Directions. ISPRS
Journal of Photogrammetry and Remote Sensing, 64(5):490–500, 2009.
[280] Francesco Solera, Simone Calderara, Ergys Ristani, Carlo Tomasi, and Rita Cucchiara. Tracking Social Groups
Within and Across Cameras. IEEE Transactions on Circuits and Systems for Video Technology, 2016.
[281] Chi Su, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Deep attributes driven multi-camera person
re-identification. In European conference on computer vision, pages 475–491. Springer, 2016.
[282] Chen Sun, Abhinav Shrivastava, Carl V ondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. Actor-
centric Relation Network. In Proceedings of the European Conference on Computer Vision (ECCV), pages 318–334,
2018.
[283] Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang, and Wei Zhang. Optical Flow Guided Feature: A Fast
and Robust Motion Representation for Video Action Recognition. In The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR)(June 2018), 2018.
[284] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond Part Models: Person Retrieval with Refined
Part Pooling (and a Strong Convolutional Baseline). In Proceedings of the European Conference on Computer
Vision (ECCV), pages 480–496, 2018.
[285] Swirl. http://www.swirl.com/, 2017.
[286] Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. Multiple People Tracking by Lifted Multicut
and Person Re-identification. 2017.
[287] Mohammed Yassine Kazi Tani, Adel Lablack, Abdelghani Ghomari, and Ioan Marius Bilasco. Events detection
using a video-surveillance ontology and a rule-based approach. In European Conference on Computer Vision, pages
299–308. Springer, 2014.
[288] Paresh M Tank and Hitesh A Patel. Survey on Human Detection Techniques in Real Time Video. International
Journal of Innovative Research in Science, Engineering and Technology, 7(5):5852–5858, 2018.
[289] Sarab Tay and Juliette Marais. Weighting Models for Gps Pseudorange Observations for Land Transportation in
Urban Canyons. In 6th European Workshop on GNSS Signals and Signal Processing, page 4p, 2013.
[290] Yonatan Tariku Tesfaye, Eyasu Zemene, Andrea Prati, Marcello Pelillo, and Mubarak Shah. Multi-target Tracking
in Multiple Non-overlapping Cameras using Constrained Dominant Sets. arXiv preprint arXiv:1706.06196, 2017.
[291] Carlo Tomasi and Takeo Kanade. Shape and motion from image streams under orthography: a factorization method.
International Journal of Computer Vision, 9(2):137–154, 1992.
[292] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled Representation Learning Gan for Pose-invariant Face
recognition. In CVPR, number 6, 2017.
[293] Oytun Ulutan, Swati Rallapalli, Carlos Torres, Mudhakar Srivatsa, and BS Manjunath. Actor Conditioned Attention
Maps for Video Action Detection. In the IEEE Winter Conference on Applications of Computer Vision (WACV).
IEEE, 2020.
[294] Christopher Urmson, Joshua Anhalt, Hong Bae, J. Andrew (Drew) Bagnell, Christopher R. Baker andRobert
E Bittner, Thomas Brown, M. N. Clark, Michael Darms, Daniel Demitrish, John M Dolan, David Duggins, David
Ferguson , Tugrul Galatali, Christopher M Geyer, Michele Gittleman, Sam Harbaugh andMartial Hebert , Thomas
Howard, Sascha Kolski, Maxim Likhachev , Bakhtiar Litkouhi, Alonzo Kelly andMatthew McNaughton, Nick
Miller, Jim Nickolaou, Kevin Peterson, Brian Pilnick, Ragunathan Rajkumar andPaul Rybski, Varsha Sadekar,
Bryan Salesky, Young-Woo Seo, Sanjiv Singh, Jarrod M Snider, Joshua C Struble, Anthony (Tony) Stentz , Michael
Taylor , William (Red) L. Whittaker, Ziv Wolkowicki, Wende Zhang, and Jason Ziglar. Autonomous driving in
urban environments: Boss and the Urban Challenge. Journal of Field Robotics Special Issue on the 2007 DARPA
Urban Challenge, Part I, 25(8):425–466, June 2008.
137
[295] Gonzalo Vaca-Castano, Amir Roshan Zamir, and Mubarak Shah. City scale geo-spatial trajectory estimation of a
moving camera. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1186–1193.
IEEE, 2012.
[296] Reza M Vaghefi, Mohammad Reza Gholami, and Erik G Ström. Bearing-only target localization with uncertainties
in observer position. In Personal, Indoor and Mobile Radio Communications Workshops (PIMRC Workshops),
2010 IEEE 21st International Symposium on, pages 238–242. IEEE, 2010.
[297] Katrien Verbert, Nikos Manouselis, Xavier Ochoa, Martin Wolpers, Hendrik Drachsler, Ivana Bosnic, and Erik
Duval. Context-aware recommender systems for learning: A survey and future challenges. IEEE Trans. Learn.
Technol., 5(4):318–335, January 2012.
[298] Lei Wang, Paul D Groves, and Marek K Ziebart. Smartphone Shadow Matching for Better Cross-street Gnss
Positioning in Urban Environments. The Journal of Navigation, 68(3):411–433, 2015.
[299] Xiaoli Wang, Aakanksha Chowdhery, and Mung Chiang. Networked Drone Cameras for Sports Streaming. In 2017
IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 308–318. IEEE, 2017.
[300] Jan D. Wegner, Steven Branson, David Hall, Konrad Schindler, and Pietro Perona. Cataloging public objects using
aerial and street-level images - urban trees. In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2016.
[301] Nicolai Wojke and Alex Bewley. Deep Cosine Metric Learning for Person Re-identification. CoRR, abs/1812.00442,
2018.
[302] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple Online and Realtime Tracking with a Deep Association
Metric. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3645–3649. IEEE, 2017.
[303] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple Online and Realtime Tracking with a Deep Association
metric. arXiv preprint arXiv:1703.07402, 2017.
[304] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association
metric. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3645–3649. IEEE, 2017.
[305] Hao Wu, Weiwei Sun, and Baihua Zheng. Is Only One Gps Position Sufficient to Locate You to the Road Network
Accurately? In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous
Computing, pages 740–751. ACM, 2016.
[306] Hao Wu, Weiwei Sun, Baihua Zheng, Li Yang, and Wei Zhou. Clsters: A General System for Reducing Errors of
Trajectories Under Challenging Localization Situations. Proceedings of the ACM on Interactive, Mobile, Wearable
and Ubiquitous Technologies, 1(3):115, 2017.
[307] Muchen Wu, Parth H. Pathak, and Prasant Mohapatra. Monitoring Building Door Events Using Barometer Sensor
in Smartphones. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous
Computing, UbiComp ’15, pages 319–323, New York, NY , USA, 2015. ACM.
[308] Hongwei Xie, Tao Gu, Xianping Tao, Haibo Ye, and Jian Lv. MaLoc: A Practical Magnetic Fingerprinting Approach
to Indoor Localization Using Smartphones. In Proceedings of the 2014 ACM International Joint Conference on
Pervasive and Ubiquitous Computing, UbiComp ’14, pages 243–253, New York, NY , USA, 2014. ACM.
[309] Danfei Xu, Hernán Badino, and Daniel Huber. Topometric Localization on a Road Network. In Intelligent Robots
and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on, pages 3448–3455. IEEE, 2014.
[310] Han Xu, Zheng Yang, Zimu Zhou, Longfei Shangguan, Ke Yi, and Yunhao Liu. Enhancing Wifi-based Localization
with Visual Clues. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous
Computing, UbiComp ’15, pages 963–974, New York, NY , USA, 2015. ACM.
[311] Qiang Xu, Rong Zheng, and Steve Hranilovic. IDyLL: Indoor Localization Using Inertial and Light Sensors
on Smartphones. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous
Computing, UbiComp ’15, pages 307–318, New York, NY , USA, 2015. ACM.
[312] Xun Xu, Timothy M Hospedales, and Shaogang Gong. Multi-task Zero-shot Action Recognition with Prioritised
Data Augmentation. In European Conference on Computer Vision, pages 343–359. Springer, 2016.
[313] Yuanlu Xu, Xiaobai Liu, Lei Qin, and Song-Chun Zhu. Cross-view People Tracking by Scene-centered Spatio-
temporal Parsing. In AAAI, pages 4299–4305, 2017.
[314] Bo Yang and Ram Nevatia. Multi-target tracking by online learning of non-linear motion patterns and robust
appearance models. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages
1918–1925. IEEE, 2012.
[315] Bo Yang and Ram Nevatia. An online learned crf model for multi-target tracking. In Computer Vision and Pattern
Recognition (CVPR), 2012 IEEE Conference on, pages 2034–2041. IEEE, 2012.
138
[316] Amir Roshan Zamir and Mubarak Shah. Accurate Image Localization Based on Google Maps Street View. In
Computer Vision–ECCV 2010, pages 255–268. Springer, 2010.
[317] Cheng Zhang, Wangdong Qi, Li Wei, Jiang Chang, and Yuexin Zhao. Multipath Error Correction in Radio
Interferometric Positioning Systems. arXiv preprint arXiv:1702.07624, 2017.
[318] Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J Freedman.
Live Video Analytics At Scale with Approximation and Delay-tolerance. In 14th USENIX Symposium on Networked
Systems Design and Implementation (NSDI 17), pages 377–392, 2017.
[319] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint Face Detection and Alignment Using Multitask
Cascaded Convolutional networks. IEEE Signal Processing Letters, 2016.
[320] Li Zhang, Yuan Li, and Ramakant Nevatia. Global data association for multi-object tracking using network flows.
In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
[321] Yujie Zhang and Chris Bartone. Multipath Mitigation in the Frequency Domain. In Position Location and Navigation
Symposium, 2004. PLANS 2004, pages 486–495. IEEE, 2004.
[322] Haiyu Zhao, Maoqing Tian, Shuyang Sun, Jing Shao, Junjie Yan, Shuai Yi, Xiaogang Wang, and Xiaoou Tang.
Spindle net: Person Re-identification with Human Body Region Guided Feature Decomposition and fusion. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[323] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection
with structured segment networks. ICCV , Oct, 2, 2017.
[324] Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. Mars: A Video Benchmark
for Large-scale Person Re-identification. In European Conference on Computer Vision, pages 868–884. Springer,
2016.
[325] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable Person Re-identification:
A Benchmark. In Computer Vision, IEEE International Conference on, 2015.
[326] Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. Joint Discriminative and
Generative Learning for Person Re-identification. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2138–2147, 2019.
[327] Zhedong Zheng, Liang Zheng, and Yi Yang. A Discriminatively Learned Cnn Embedding for Person Re-
identification. arXiv preprint arXiv:1611.05666, 2016.
[328] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled Samples Generated by Gan Improve the Person Re-
identification Baseline in Vitro. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
139
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Improving efficiency, privacy and robustness for crowd‐sensing applications
PDF
Crowd-sourced collaborative sensing in highly mobile environments
PDF
Gradient-based active query routing in wireless sensor networks
PDF
Design of cost-efficient multi-sensor collaboration in wireless sensor networks
PDF
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Cloud-enabled mobile sensing systems
PDF
Efficient crowd-based visual learning for edge devices
PDF
Efficient data collection in wireless sensor networks: modeling and algorithms
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Multichannel data collection for throughput maximization in wireless sensor networks
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Relative positioning, network formation, and routing in robotic wireless networks
PDF
Magnetic induction-based wireless body area network and its application toward human motion tracking
PDF
Detecting and mitigating root causes for slow Web transfers
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Efficient graph learning: theory and performance evaluation
PDF
Point-based representations for 3D perception and reconstruction
PDF
Theoretical and computational foundations for cyber‐physical systems design
Asset Metadata
Creator
Liu, Xiaochen
(author)
Core Title
Efficient pipelines for vision-based context sensing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/06/2020
Defense Date
12/04/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
activity recognition,bluetooth,camera networks,computer vision,context sensing,deep neural networks,edge computing,GPS,GPU,images,information retrieval,Internet of Things,localization,mapping,mobile computing,OAI-PMH Harvest,object detection,object tracking,RFID,signal processing,street view,videos,wireless networks
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Govindan, Ramesh (
committee chair
), Krishnamachari, Bhaskar (
committee member
), Raghavan, Barath (
committee member
)
Creator Email
liu851@usc.edu,xcliu8@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-266709
Unique identifier
UC11673460
Identifier
etd-LiuXiaoche-8153.pdf (filename),usctheses-c89-266709 (legacy record id)
Legacy Identifier
etd-LiuXiaoche-8153.pdf
Dmrecord
266709
Document Type
Dissertation
Rights
Liu, Xiaochen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
activity recognition
bluetooth
camera networks
computer vision
context sensing
deep neural networks
edge computing
GPS
GPU
information retrieval
Internet of Things
localization
mapping
mobile computing
object detection
object tracking
RFID
signal processing
street view
wireless networks