Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 740 (2001)
(USC DC Other)
USC Computer Science Technical Reports, no. 740 (2001)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
2D TSA-tree: A Wavelet-Based Approach to Improve the Efficiency of
Multi-Level Spatial Data Mining
Cyrus Shahabi, Seokkyung Chung
Integrated Media Systems Center
Department of Computer Science
University of Southern California
Los Angeles, California 90089–0781
shahabi seok k y uc uscedu
Maytham Safar
Computer Engineering Department
Kuwait University
may tham eng k univ eduk w
George Hajj
JPL M/S 238-600
4800 Oak Grove Dr.
Pasadena, California 91109
haj j cobr aj pl nasag ov
Abstract
Due to the large amount of the collected scientific data,
it is becoming increasingly difficult for scientists to com-
prehend and interpret the available data. Moreover, typical
queries on these data sets are in the nature of identifying (or
visualizing) trends and surprises at a selected sub-region in
multiple levels of abstraction rather than identifying infor-
mation about a specific data point.
In this paper, we propose a versatile wavelet-based data
structure, 2D TSA-tree (stands for Trend and Surprise Ab-
stractions Tree), to enable efficient multi-leveltrend and sur-
prise detection on spatio-temporal data. We show how 2D
TSA-tree can be utilized efficiently for sub-region selections
by either restricting users in selecting pre-defined cells in the
space or computing a customized subtree, that corresponds
to the user’s selected area on-the-fly. Moreover, 2D TSA-
tree can be utilized to pre-compute the reconstruction error
and retrieval time of a data subset in advance in order to al-
low the user to trade off accuracy for response time (or vice
versa) at the query time. Finally, when the storage space is
limited, our 2D Optimal TSA-tree saves on storage by stor-
ing only a specific optimal subset of the tree.
To demonstrate the effectiveness of our proposed meth-
ods, we evaluated our 2D TSA-tree using real and synthetic
data. Our results show that our method outperformed other
methods (DFT and SVD) in terms of accuracy, complexity
and scalability.
1 Introduction
Rapid growth in remote sensing systems has made it pos-
This research has been funded in part by NSF grants EEC-9529152
(IMSC ERC) and ITR-0082826, NASA/JPL contract nr. 961518, DARPA
and USAF under agreement nr. F30602-99-1-0524, and unrestricted
cash/equipment gifts from NCR, IBM, Intel and SUN.
sible to obtain data about nearly every part of our larger
world, including the solid earth, ocean, atmosphere and the
surrounding space environment. However, it is becoming
increasingly difficult for scientists to comprehend and in-
terpret the available data. The discovery of new cross-
disciplinaryphysical relationships(e.g. ocean-climate inter-
action) is hampered by the sheer quantity of data to be di-
gested. The pursuit of new physical understanding can be
aided immeasurably by automated tools for data interpreta-
tion and model construction.
To illustrate a sample application, consider a joint
project that we defined with Jet Propulsion Laboratory
(JPL) for NASA. The project is entitled GENESIS: GPS
environmental and earth science information system
(see http://genesis.jpl.nasa.gov/html/index.shtml). In this
project, signals from GPS satellites are processed and
analyzed to extract global atmospheric and ionospheric
data. After three levels of off-line pre-processing, the tem-
perature, water vapor, and refractivity of certain coordinates
on earth can be extracted for every half-hour at different
heights. To give a sense of the volume of the recorded data,
consider the small subset of sea surface temperature [11]
recorded twice a day. That is, restricting the data in four
dimensions: height (sea surface vs. multiple heights),
area (oceans vs. the globe), type (temperature vs. other
measures), frequency (twice a day vs. every half-hour). In
this case, there are 4096 2048 sampling points on globe
wide sea surface, where for each sampling point the daily
average (day and night) temperature is stored as a 10-byte
floating number. Assuming we store both the ascending
pass (daytime) and descending pass (nighttime) daily for
the last ten years, the volume of this database will be at
least 600 GBytes.
These data can be stored in database server(s) and ac-
cessed by users via the Internet. Methods for storing, re-
trieving and analyzing these data efficiently are central chal-
lenges for the database community. Typical queries on these
Figure 1. A sample GUI
data sets are, however, different than conventional point
queries. For example, a query to acquire the temperature of a
specific location at a specific time and date is rare. The more
frequent queries are in the nature of identifying (or visualiz-
ing) trends and surprises of a selected sub-region at multiple
levels of abstraction.
To illustrate, consider the GUI depicted in Fig. 1. The
user selects a region on the map of the globe and asks for the
trend of water vapor data for the selected region. In this case,
one option is to transmit the entire data set we have available
for the selected sub-region to the user in order for the user to
visualize water vapor trend, color-coded on his/her display
screen. Trivially, due to the I/O and network bottlenecks,
this solution might result in a very long latency observed by
the user. However, the amount of retrieved and transmitted
data can be reduced significantly depending on: 1) the user’s
tolerance for error (e.g., based on the user’s display reso-
lution), and/or 2) the user’s expected response time (e.g.,
based on the size of the retrieved/transmitted data). There-
fore, we need a technique to condense the entire data set
in advance such that: 1) given a particular sub-region, con-
densed data corresponding to only that area can be extracted
quickly, 2) data can be condensed in multiple levels of ab-
stractions, and 3) the error and response time of the query re-
sult can be determined in advance and can be traded off for
each other. The above requirements motivate us to exploit
wavelet transform because wavelet analyzes data in a lo-
calized and multi-resolution manner. In this paper, we pro-
pose a versatile wavelet-based data structure, 2D TSA-tree
(stands for Trend and Surprise Abstractions Tree), to enable
efficient multi-level trend and surprise detection on spatio-
temporal data over all time and space scales. For example,
Fig. 15 demonstrates the nodes of 2D TSA-tree to capture
the spatial trend of water vapor data using db6 wavelet. As
shown, level 2 might be visually adequate while its size is
1/16th of the original data. Hence, if (say) the client’s dis-
play resolutionis low, one can get by with sending much less
data while not sacrificing visual impact.
We will extend our work with multi-level trend and sur-
prise queries on temporal data [17] to support spatial mining
queries. In [17], we proposed a novel tree-like data struc-
ture termed TSA-tree. The root of this tree is a time series
while each internal node (or leaf) is constructed by applying
wavelet transform to its parent. We proved that by utilizing
the wavelet transform, we can naturally split a time-series
sequence into two nodes where one node captures the trends
and the other the surprises within the original sequence.
Here we extend TSA-tree for mining trends on 2D spa-
tial data. Note that the 2D extension can be used for spatio-
temporal mining as well. In this case the space should
be conceptualized through a single dimension, e.g., ground
instrument locations, or fixed latitude/longitude grids, and
time as the second dimension. To extend TSA-tree to 2D,
we construct a wavelet-based tree structure that applies two
separate 1D wavelet transforms along the X-axis and Y-axis,
resulting in one averaged signal and three detailed signals.
Consequently, 2D TSA-tree is constructed by recursively
performing this procedure to the averaged signal. The nodes
can be immediately used to visualize trends at different lev-
els.
Another contribution of this paper is to show how the
2D TSA-tree can be utilized efficiently for both sub-region
selection and accuracy/response-time trade off. For sub-
region selection, we first assume that the space is partitioned
into pre-defined cells and a user can only select a single cell.
In Sec. 6.1, we relax this assumption by computing a cus-
tomized subtree of 2D TSA-tree, which corresponds to the
user’s selected area, on-the-fly. For accuracy/response-time
trade off, we pre-compute the reconstruction error and the
retrieval time for each level of 2D TSA-tree and store these
values within each node. At the query time, this stored infor-
mation can be used to trade accuracy for response time (or
vice versa) depending on the user’s requirements. Hence, if
a user enters accuracy for a submitted query, then the
system can respond by an estimated “response time” value
before query execution. If the response time is acceptable
to the user, the query can be performed. The reverse sce-
nario where the user provides a tolerable response time and
the system replies back with an accuracy estimation is also
feasible.
Finally, in [17], we considered different scenarios where
TSA-tree cannot be stored on magnetic disk(s) in its en-
tirety due to space limitations. We proved a specific sub-
set of the tree (specifically, all its leaf nodes) is the opti-
mal subset to be kept on disk, termed OTSA-tree (Optimal
TSA-tree). For the cases where the storage space is even
more limited, we proposed alternative techniques to reduce
the size of OTSA-tree further by dropping tree nodes and/or
wavelet coefficients with less energy. In this paper, we ex-
tend the OTSA model to our 2D data to reduce the size of
2D TSA-tree. While other single-level compression tech-
niques such as Singular Value Decomposition (SVD) and
Discrete Fourier Transform (DFT) lack the characteristics
to support sub-region and multi-resolution selections, we
compared their compression performance with that of 2D
OTSA-tree on our data sets and the results demonstrate the
superiority of 2D OTSA-tree.
The remainder of this paper is organized as follows.
Sec. 2 distinguishes our work from the other related works
in this area. In Sec. 3, we provide the background on
wavelet and 1D TSA-tree. 2D wavelet and TSA-tree
are explained in Sec. 4. In Sec. 5, we describe the user
interactions with 2D TSA-tree for sub-region selection and
accuracy/response-time trade off. A more flexible approach
to sub-region selection (customized 2D TSA-tree) and the
storage friendly version of 2D TSA-tree (2D OTSA-tree)
are discussed in Sec. 6. Sec. 7 provides an analysis of our
techniques and compare them with other related methods.
Finally, in Sec. 8, we conclude the paper and provide our
future plans.
2 Related Work
In time-series databases, trend analysis studies changes
of temporal pattern. Similarly, we can extend time with
space when dealing with spatio-temporal data. With spatio-
temporal trend analysis, patterns change with both space and
time. For example, weather or highway traffic patterns are
related to both space and time. Hence, it is essential to pro-
vide a uniform model for finding spatial trends in 2D spatial
or 2D spatio-temporal data.
Spatial trends detection is a rather infant area, and few
research works have been conducted. Ester et. al [9] define
spatial trend as a regular change of non-spatial attribute with
respect to the distance to a given fixed object. They use dis-
tance from the object as the independent variable and differ-
ence of the attribute values as the dependent variable, and
employ linear regression for the analysis. However, we de-
fine spatial trend differently. That is, spatial trend is defined
DFT+quad-tree DWT
Time Complexity O n logn O n Space Complexity O nl og n O n Table 1. Comparison of DFT+quad-tree and
DWT
as changes in the mean value at multi-level of abstractions.
Thus, we can see the trend at both finer and coarser levels
simultaneously. We can also see trend at the entire region as
well as trend at a specific region. Multi-resolution analysis
capability of wavelet can address these challenges naturally,
and these characteristics distinguish our work from [9].
WaveCluster, a multi-resolution clustering algorithm
based on wavelet classifies points by transforming the orig-
inal space and finding dense regions in the transformed
space [16]. Their approach can be considered similar to ours
in that they solve the problem in multi-scale aspects. The
distinction between [16] and our work lies in the goal of the
method. The purpose of WaveCluster is to group the points
at multi-scales while the aim of our research is to discover
spatial trends at multi-level abstractions.
Chan et. al [4] use wavelet transforms and keep the
first few coefficients for similarity searching in time-series
databases. However, they decide the number of coefficients
to be kept by experiments while our method provides algo-
rithms for deciding which coefficients should be kept when
the available space is restricted.
Wu et. al [19] employ SVD (Singular Value Decompo-
sition) and DFT (Discrete Fourier Transform) for reducing
the dimension of feature vectors in the problem of search-
ing images in large image databases. SVD and DFT have
been widely used in time-series databases as well [12, 1].
One of the main challenges with our application is that the
amount of required space to store scientific observation data
is large. Thus, we also need some compression scheme to
reduce the size of the required disk space. SVD and DFT
could be useful in this case. However, SVD and DFT lack
the characteristics to condense data for different sub-regions
and at multi-resolutions. On the other hand, since wavelet
transform analyzes data in a localized manner, 2D TSA-tree
has the mechanism to visualize spatial trends for a particular
region of interest through sub-region selection capability. In
addition, 2D TSA-tree supports multi-resolutionstrend min-
ing through multi-level abstraction mechanism that selects
appropriate levels of the tree depending on the query restric-
tions (such as running time or accuracy). To illustrate, con-
sider the following argument.
The time complexity to compute the wavelet coefficients
of a selected area is O n , where n is the size of a selected
region. However, we can reduce the complexity by perform-
ing some tasks off-line. That is, we precompute the informa-
tion (wavelet coefficients) for the whole region and extract
coefficients associated with the selected region. If we em-
ploy DFT to condense this data, we cannot extract DFT of
the selected region directly from DFT of the entire region.
This is because DFT cannot capture local features since it is
based on different harmonics of cos/sin functions along the
time axis, while wavelet can extract features around particu-
lar time frame because its basis functions are located at var-
ious positions of the time axis. The only possible method
(while keeping O nl og n time complexity for off-line task)
is to perform inverse FFT on the entire region and extract ap-
propriate coefficients, which requires O nl og n time com-
plexity. Another restriction in using SVD or DFT to our
application is related to our major goal, which is providing
some analyzing tools for spatial trends at different levels of
abstractions. Other kinds of single-level techniques such as
SVD and DFT have to reconstruct the entire data to support
multi-level queries.
Another possible approach is to decompose the whole
space into a quad-tree and associate each node with DFT
for that region, which requires additional CPU time for pre-
computing and increases the space requirements. Tab.1
compares DWT (Discrete Wavelet Transforms) and DFT in
terms of time and space complexity when we use the quad-
tree approach. Note that DFT takes O nl og n time to com-
pute Fourier Transforms [7]. As shown in Tab. 1, time com-
plexity of DFT increases by factor of logn while that of
DWT does not change. Therefore, we can argue that the
wavelet approach is the most appropriate and natural for our
application.
3 Background
In Sec. 3.1, we explain the basic concepts of 1D wavelet.
Sec. 3.2 shows how TSA-tree utilizes 1D wavelet for effi-
cient management of time-series data. Later in Sec. 4, we
explain 2D wavelet and extend TSA-tree to support two di-
mensional data. Throughout this paper we will use Haar
wavelet filter for our discussions.
3.1 1D Wavelet Transforms
Wavelet theory involves representing general functions
in terms of simpler, fixed building blocks at different scales
and positions. This has been found to be very useful in sev-
eral areas, such as sub-band filtering, quadratic mirror fil-
ters, and pyramids schemes in the area of signal and image
processing. For the collections of references s ee [3, 5, 6, 8,
14].
For a given wavelet transform, two pairs of sequences
are needed. The first pair is called wavelet analysis f ilter
and the other pair w av el et sy nthetic f il ter, where the for-
mer is for the decomposition of a signal and the latter is
for the reconstruction of the signal. They are uniquely de-
termined by the wavelet transform. In this paper we em-
ploy Haar wavelet, which is the simplest and most popular
wavelet given by H aar [14]. Equations (1) and (2) show
analysis and synthetic filters for Haar wavelet, respectively.
H
a
p
p
G
a
p
p
(1)
H
s
p
p
G
s
p
p
(2)
Basically, Haar wavelet decomposes the signal by re-
placing an adjacent pair of data in a discrete interval with
average and difference of the pair. By recursively repeat-
ing the decomposition process on the averaged sequence,
we get multi-resolution decomposition. Note that in Equa-
tions (1) and (2) we use
p
instead of 2 as a scaling fac-
tor since just averaging cannot preserve Euclidean distance
in the transformed signals. Wavelet coefficients can be de-
fined as detailed coefficients (which are differences of the
pairs) or average in the lowest resolution. Several of the
computed wavelet coefficients have very small magnitudes.
Thus, keeping only the most significant coefficients enables
us to represent the signal in a lower dimension.
3.2 1D TSA-tree
In time-series databases, queries are submitted to iden-
tify trends or surprises within different levels of abstractions
such as within a week, a month or a year. For example, a
trend query could be “Find the cities where temperature has
been increasing during the last month or/and decade”, and a
surprise query could be “Find the cities where temperature
has sudden changes (monthly) during the last year/decade”.
As a consequence, a huge subset of raw time-series data
is required to be retrieved and processed to support multi-
level trend and surprise queries. In [17], we proposed a
novel tree-like data structure termed TSA-tree (stands for
Trend and Surprise Abstractions) for efficient management
of time-series databases. In order to support the multi-level
queries effectively, TSA-tree precomputes trends and sur-
prises at different levels and store them in a tree. The root
node of this tree contains the original time-series while each
internal node (or leaf) is constructed by applying wavelet
transform to its parent. In [17], we proved that by utiliz-
ing the wavelet transform, we can naturally split a time-
series sequence into two nodes where one node captures
the trends and the other the surprises within the original se-
quence. Hence, by traversing down the tree, and applying
wavelet recursively to trend sequences, we increase the level
of abstraction on trends and surprises. Meanwhile, as we
Original Value
a b
c d
Average Value
a b
c d
(a+b+c+d)/4
Detailed Value-1
(D-horizontal)
a b
c d
((b+d)/2-(a+c)/2)/2
Detailed Value-2
(D-vertical)
a b
c d
((a+b)/2-(c+d)/2)/2
Detailed Value-3
(D-diagonal)
a b
c d
((a+d)/2-(b+c)/2)/2
Figure 2. Illustration of 2D wavelet transforms
on 4 data points
traverse down the tree, the size of node decreases by half.
Therefore, the higher the level of abstraction required by the
trend and surprise queries, the better the performance of the
system to support these queries (the rate of improvement is
exponential). In sum, the nodes of TSA-trees can immedi-
ately be used to visualize trends and surprises at different
levels. They not only need a small post processing, but also
are much smaller in size as compared to the original time se-
ries. Hence, the performance is improved due to both elimi-
nating the CPU-boundprocessing and significantly reducing
the I/O cost for data retrieval.
4 2D TSA-tree
Wavelet transform analyzes data in a localized man-
ner. Hence, it can provide the mechanism to support trend
queries at different sub-regions and at multi-resolutions.
Therefore, in this study, to support mining of spatial trends
for two-dimensional data, we extend the TSA-tree model to
2D TSA-tree using 2D wavelet transforms. That is, we ap-
ply 1D wavelet transform on the 2D data set in different di-
mensions/directions to obtain average and detailed values
(e.g., difference of values in the direction of X-axis) from
the original data set. Subsequently, a subset of the obtained
data is used and stored in a 2D TSA-tree as a representa-
tive for the original data. This section describes the basics of
2D TSA-tree. In Sec. 4.1, we discuss how to obtain wavelet
transforms of a 2D data set by giving some intuitive exam-
ples. Next, in Sec. 4.2, we formally define the spl it and
merge operations, which are the required operations to cre-
ate 2D TSA-tree.
4.1 2D Wavelet Transforms
With 1D wavelet transforms, which uses Haar wavelet
filter, each adjacent pair of data in a discrete interval is re-
placed with its average and difference. A similar concept
can be applied to obtain a 2D wavelet transform of data in
a discrete plane. With 2D, each adjacent four points in a
3 5 7 1 9 1 5 9 1 1 1 –1 5 –1 1 1
5 3 7 5 2 –2 –2 4
4 6 -1 -1
2 4 4 6
6 8 2 0
4 14 2 0
4 6 8 10
5 1
-1
0
1 0 3 0 0 –1 –2 1
2 4 4 6 6 8 2 0 4 14 2 0 4 6 8 10
Original Data
1D wavelet
along x-axis
1D wavelet
along y-axis
D-horizontal D-vertical D-diagonal
lowpass signal
Original data Original data
highpass signal
average
Figure 3. Illustration of 2D wavelet transforms
on 16 data points
discrete plane can be replaced by their averaged value and
three detailed values (see Fig. 2). The detailed values (D-
horizontal, D-vertical, and D-diagonal) correspond to the
average of the difference of: 1) the summation of the rows,
2) the summation of the columns, and 3) the summation of
the diagonals. In general, to obtain wavelet coefficients for
2D data, we apply 1D wavelet transform to the data along
X-axis first, resulting in lowpass and highpass signals (av-
erage and difference). Next, we apply 1D wavelet trans-
forms to both signals along Y-axis generating one averaged
and three detailed signals. Consequently, a 2D wavelet de-
composition is obtained by recursively repeating this proce-
dure to the averaged signal. Fig. 3 illustrates the procedure
.
The root node of the tree contains the original data (row-
majored) of the mesh of values (for example, tempera-
tures). First, we apply 1D wavelet transforms along X-axis,
i.e. for each two points along X-axis we compute average
and difference, so we obtain (3 5 7191 59)and (111-
1 5 -1 1 1). Next, we apply 1D wavelet transforms along
Y-axis, for each two points along Y-axis we compute aver-
age and difference. We perform this process recursively un-
til the number of elements of averaged signal becomes 1 or
a threshold is met.
For the purpose of illustration, we use scaling factor 2 instead of
p
.
4.2 Split and Merge Operations
In [17], we introduced two operations termed spl it and
merge in order to construct 1D TSA-tree. Split is the op-
eration that generates a multi-level tree, where each node
contains the wavelet coefficients of the corresponding multi-
level trends and surprises, while merge is the inverse oper-
ation of spl it. In this study, we extend these two operations
to construct a 2D TSA-tree. For the following discussions,
we use an n m matrix to represent a rectangular area whose
size is n m in 2D Euclidean space. Without loss of gener-
ality, we also assume that each row and column of matrix has
starting index 0, and row and column correspond to X-axis
and Y-axis of the 2D space, respectively.
Definition 4.1: C onv ol ution along X-axis is an operation
between n m matrix W = w
ij
and H= h
h
h
l where l m,, the result is an n m matrix Z, where
z
ij
=
P
l k h
k
w
i j k when indices are out of range (i.e.,
j k m) we append zero values to the sequence. We
denote C onv ol ution along X-axis as Z=C onv
x
WH .
Example 4.2 : If W =
and H= , then C onv
x
WH =
.
Definition 4.3: C onv ol ution along Y-axis is an operation
between n m matrix W = x
ij
and H= h
h
h
l where l n, the result is an n m matrix Z, where
z
ij
=
P
l k h
k
w
i k j
when indices are out of range (i.e.,
i k n), we append zero values to the sequence. We
denote C onv ol ution along Y-axis as Z=Conv
y
WH .
Example 4.4 : If W and H are as in Example 4.2,
C onv
y
WH =
.
Definition 4.5 : Down S ampl ing (by 2) along X-axis is an
operation which takes n m matrix W = x
ij
as input and
produces an n m matrix Z as output, where z
ij
w
i j when i and j are integer numbers such that i n and j m . We denote Down S ampl ing along
X-axis as Z Down S ampl e
x
W .
Example 4.6 : If W is as in Example 4.2,
Down S ampl e
x
W .
Definition 4.7: Down S ampl ing (by 2) along Y-axis is
an operation which takes n m matrix W = x
ij
as input
and produces an n m matrix Z as output, where z
ij
w
i j
when i and j are integer numbers such that i n and j m . We denote Down sampl ing
along Y-axis as Z Down S ampl e
y
W Example 4.8 : If W is as in Example 4.2,
Down Sample
y
W - .
In Fig. 4(a), the spl it operation can be defined using the
Down Sampling and C onv ol ution operations along X-
axis and Y-axis. Note that spl it operation is equivalent to
the wavelet decomposition operation for 2D data discussed
in Sec. 4.1. A 2D TSA-tree is constructed by applying spl it
operation on AX
i
’s repeatedly. That is, we start by apply-
ing spl it on X to obtain AX
, D
X
, D
X
, and D
X
.
Subsequently, we split AX
into AX
, D
X
, D
X
, and
D
X
. This procedure repeats k times in order to construct
a 2D TSA-tree with k levels. Fig. 5 shows the struc-
ture of a general 2D TSA-tree. Original data is contained in
the root node. An AX
i
node (averaged values) contains in-
formation about trends, while D
X
i
(D-horizontal), D
X
i
(D-vertical) and D
X
i
(D-diagonal) are the detailed values
that contain information about surprises.
Reversibly, for four equi-sized data AX, D
X, D
X and
D
X, merge operation can be applied to obtain the orig-
inal data X from the averaged and detailed values. The
mer g e operation can be defined using the Up S ampl ing
and C onv ol ution operations along X-axis and Y-axis as in
Fig. 4(b).
Definition 4.9: Up S ampl ing along X-axis is an opera-
tion which takes an n m matrix Z as input and produces
an n m output matrix W with property: w
i j z
ij
and
w
i j 0. We denote Up S ampl ing along X-axis as
W =Up Sample
x
Z .
Example 4.10 : If W is as in Ex-
ample 4.2, Up S ampl e
x
W
.
Definition 4.11: Up Sampling along Y-axis is an opera-
tion which takes an n m matrix Z as input and produces
an n m output matrix W with property: w
i j z
ij
and
w
i j
0. We denote Up S ampl ing along Y-axis as
W =Up Sample
y
Z .
Example 4.12 : If W is as in Example 4.2,
Up S ampl e
y
W B
B
C
C
A
.
The algorithmfor merge operation is illustratedin Fig. 4(b).
Note that the merge operation is equivalent to the wavelet
reconstruction operation for 2D data.
Split operation transfers the input data from one domain
to another. Hence, to avoid false dismissals while searching
fAX
i D
X
i D
X
i D
X
i g spl it AX
i
beg in
T emp C onv
x
AX
i
H
a
T emp C onv
x
AX
i
G
a
T
X Down-Sample
x
T emp T
X Down-Sample
x
T emp T emp-AX
i C onv
y
T
X H
a
T emp-D
X
i C onv
y
T
X G
a
T emp-D
X
i C onv
y
T
X H
a
T emp-D
X
i C onv
y
T
X G
a
AX
i Down-Sample
y
Temp-AX
i D
X
i Down-S ampl e
y
Temp-D
X
i D
X
i Down-S ampl e
y
Temp-D
X
i D
X
i Down-S ampl e
y
Temp-D
X
i end
AX
i
mer g e AX
i D
X
i D
X
i D
X
i beg in
Temp-AX Up-Sample
y
AX
i Temp-D
X Up-Sample
y
D
X
i Temp-D
X Up-Sample
y
D
X
i Temp-D
X Up-Sample
y
D
X
i T
X Conv
y
Temp-AX H
s
Conv
y
Temp-D
X G
s
T
X Conv
y
Temp-D
X H
s
Conv
y
Temp-D
X G
s
Temp Up-Sample
x
T
X Temp Up-Sample
x
T
X AX
i
Conv
x
Temp H
s
Conv
x
T emp G
s
end
(a) Algorithm for split operation (b) Algorithm for merge operation
Figure 4. Split and merge operations
X
1
AX
1 1
X D
1 2
X D
1 3
X D
2
AX
2 1
X D
2 2
X D
2 3
X D
3
AX
3 1
X D
3 2
X D 3 3
X D
Figure 5. 2D TSA-tree
for trends, it is important that such transformation preserves
the energy of the original data (i.e., the sibling nodes of 2D
TSA-tree should preserve the energy of their parent’s spa-
tial data). Furthermore, each node of 2D TSA-tree should
be reconstructed by “merging” its children without losing
any information. Hence, merge operation should also pre-
serve the energy of the data. In the following discussion
we prove energy preserving theorem for spl it assuming the
Haar wavelet. Similar argument for the mer g e operation is
straightforward, thus the proof will be skipped.
Lemma 4.13: Split operation preserves the energy of the
original data set at the first level of a 2D TSA-tree i.e.
jjX jj
jjAX
jj
P
i jjD
i
X
jj
Proof: For simplicity, we assume the size of X is equal to
4. The extension to larger sizes is straightforward. For X x
x
x
x
, define T
X and T
X as follows:
T
X a
a
x
x
x
x
P (3)
T
X d
d
x
x
x
x
Q (4)
where P and Q are determined by the Haar wavelet analysis
filter as follows:
P p
B
B
C
C
A
Q p
B
B
C
C
A
Equations (3) and (4) correspond to the process of 1D
wavelet application to X along X-axis. Now, we define C
and C
as follows:
C
p
B
B
C
C
A
C
p
We can then combine equations 3) and 4) into one using C
:
a
d
a
d
x
x
x
x
C
Now, consider the following two equations that correspond
to the wavelet application along Y-axis:
a
d
a
a
C
1
1
1
2
1
3
1
4 a
Figure 6. Illustration of 2D cells
d
d
d
d
C
Since C
and C
are orthonormal matrices, the following
equation holds.
jjAX
jj
X
i jjD
i
X
jj
a
d
d
d
a
d
a
d
XC
C
X jjX jj
(5)
Lemma 4.14: Split operation preserves the energy of the
original data set at any level of a 2D TSA-tree.
jjX jj
jjAX
k
jj
k
X
i X
j jjD
j
X
i
jj
(6)
Proof: By Lemma 1,
jjX jj
jjAX
jj
X
i jjD
i
X
jj
jjAX
jj
X
i
jjD
i
X
jj
X
i jjD
i
X
jj
jjAX
k
jj
k
X
i X
j jjD
j
X
i
jj
(7)
5 Basic Approaches to User Interaction
This section defines user interactions for spatial trend
mining and shows how 2D TSA-tree is used to answer
the queries for specified regions of interest at the required
resolution or within tolerable response time. In Sec. 5.1,
we show the user interaction through sub-region selection,
where levels of 2D TSA-tree are extracted based on the
selected regions by utilizing wavelet transforms localized
analyses features. Then, in Sec. 5.2, we further enhance the
user interaction by allowing the user to determine the accu-
racy and/or the latency (in addition to regions) of the query
response. This utilizes the multi-level abstraction mecha-
nism of the 2D TSA-tree in which the different levels store
different sizes of data that leads to different resolutions and
response times.
5.1 Support for Sub-region Selection
This section defines user interactions for spatial trend
mining through sub-region selection. It shows how 2D
TSA-tree is used to answer the sub-region selection queries
by extracting levels based on the selected regions. The ex-
traction uses the localized analysis feature of the wavelet
transforms. For the following discussion, assume that the
spatial data set is defined as points in a 2D mesh.
Definition 5.1: [Cell (C)] Each point (P ) in the mesh is a
cell by itself. A set of points is a cell only if they are used
together by 2D wavelet transforms to compute wavelet co-
efficients in some resolution.
Thus, a cell can be viewed as a set of points in a 2D plane.
We denote num
i
as the num-th cel l in i-th level (left to
right, bottom to top). Each cell contains four floating points,
one for the averaged signal, and three for detailed signals. In
Fig. 6, we show an example of points that are grouped as a
cell. For example, a is not a cell since the points that are
grouped are not used together to generate any wavelet coef-
ficients. On the other hand, , , and are considered
as cells at level 1.
In this section, we assume that users can only select the
defined cells for mining purposes. For example, the area
shown through a GUI can be pre-partitioned and the users
select predefined area by clicking within the area. Later, in
Sec. 6.1, we will relax this assumption with another varia-
tion of 2D TSA-tree (C ustomiz ed 2D TSA-tree). In Fig. 7,
we define a simple index structure for 2D TSA-tree, which
groups together those cells that are at the same level, and
in Fig. 8, we show the algorithm used to fetch the appro-
priate cells efficiently depending on the user selection. For
simplicity, we assume that the size of the entire area and se-
lected area correspond to integral power of 4, respectively.
We also assume that the shape of the entire area and selected
area correspond to square, respectively. When a user wants
to submit a trend query, he/she selects an area that corre-
sponds to a cell at a certain level. This would provide the
system with the lower left and upper right coordinates, x y
3
1
2
1
2
2
2
3
2
4
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
10
1
16
1
9
1
11
1
12
1
14
1
13
1
15
3
1
2
1
2
2
2
3
2
4
1
1
1
2
1
3
1
5
1
6
1
4
1
7
1
8
1
9
1
10
1
13
1
14
1
11
1
12
1
15
1
16
…
…
Level 3 Level 2 Level 1
1
2
3
Level
1 2 16
ptr
…
Figure 7. An index structure of 2D TSA-tree
and x k y k , respectively. Consequently, using the al-
gorithm in Fig. 8, the system traverses up the 2D TSA-tree
to retrieve the corresponding wavelet coefficients using two
major steps. First, using the selected area/cell, we find the
level that we should visit. For example, if is selected,
then we should visit level 2 of the 2D TSA-tree and fetch
the appropriate cells while we traverse up the tree. Note that
for this example, we do not need to consider 3rd level of the
2D TSA-tree since contains more informationthan the se-
lected area (e.g., ). Second, we determine the first location
of the selected cell and fetch the cells. As we traverse up the
2D TSA-tree, we increase the number of the cells we should
obtain and repeat this process until we reach the root of the
tree.
Note that the algorithm in Fig. 8 can be viewed as con-
structing a 2D TSA-subtree for the area when precomputed
2D TSA-tree is given. The time complexity of this algo-
rithm is O logk where k is the length of the side of the se-
lected square.
5.2 Support for Multi-level Abstractions
In this section, we further enhance the user interaction
through the multi-level abstraction support, where the user
can determine the accuracy and/or the latency (in addition
to regions) for better utilization of 2D TSA-tree. Users usu-
ally do not have prior knowledge of the structure of the
database or the index structures implemented on top of it.
Hence, when they submit trend queries, they cannot spec-
ify the level of the 2D TSA-tree to get the trends. However,
l: the length of the side for the entire area
k: the length of the side for a selected area
i
j log
k m y
k
l k x
k
(5)
whil e j beg in
g oto Lev el j ptr m and f etch i
cells
j j i i
m
i
m end
Figure 8. An algorithm for fetching the appro-
priate cells
they expect their query to be answered by the database sys-
tem with an acceptable running time or accuracy. Therefore,
we designed our system to utilize two different parameters
(e.g., tolerable accuracy and processing time) that can be
provided by the user to answer the trend query while meet-
ing the user’s restrictions/requirements. Towards this end,
we store extra information within each node of the 2D TSA-
tree (at all the different levels) that can be utilized by the sys-
tem to compute the estimate running time of the query or its
accuracy.
Using Haar wavelet filter, the shape of every cell in our
2D TSA-tree model is represented by a rectangle. Hence, by
assuming the shape of the scale provided is a rectangle, we
can find the best-matched 2D TSA-tree level for the query
region using the size of the scale and the size of cells at each
level of the 2D TSA-tree (that is, by comparing the num-
ber of points in the cell and the scale). Consequently, we
can use the corresponding level of 2D TSA-tree for the trend
mining. The selected region can also be represented by a
set of cells at a higher level of the 2D TSA-tree. For ex-
ample, A
=f g and A
=f g in Fig. 7 occupy
same region, however we cannot compare A
with A
di-
rectly (e.g., distance) since their sizes are not equal. In [17],
we proposed an algorithm for transforming AX
i
of TSA-
tree into higher level for the purpose of trends and surprises
mining. We employ similar approach using Up S ampl ing
and C onv ol ution along X-axis and Y-axis (see Fig. 9 for de-
tails). Using this algorithm, we transform a set of cells from
level i to the set at level j, which makes the size of trans-
formed one equal to the set of cells at level j. For example,
the size of A
is equal to that of T r ansf or m A
.
Definition 5.2: [Accuracy]: When a user’s submitted scale
TX
i
T r ansf or m C
i
beg in
C
i
a set of cel l s at l ev el i
A
i
av er ag e v al ues of each cel l in C
i
TX
i
A
i
s log
siz eof TX
i
;
level s dow nto i
T emp Up-sampl e
x
TX
i
T emp
x
Conv
x
Temp H
s
T emp Up-sampl e
y
Temp
x
TX
i
C onv
y
Temp H
s
end
Figure 9. A transformation algorithm
Notation Definition
absolute error e
abs
jt tj
relative error e
rel
jt -
t j / max ft,1g
modified relative error e
mr el
jt -
tj / fmax f 1, min ft,
t gg
combined error e
c
min f e
abs
, e
rel
g
modified combined error e
mc
min f e
abs
, e
mr el
g
Table 2. Different error measures
is translated into level i of the 2D TSA-tree and the selected
region is matched with a set of cells A
i
= f n
i
: n
i
is n th
cell at level ig, A
i
is said to have 100 accuracy. A set of
cells at lower level j (A
j
), which occupies same selected
area as A
i
, has s accuracy if and only if error between
T r ansf or m A
j
and A
i
is equal to Error .
In Tab. 2, we show five different error measures that can be
used as possible measures for the accuracy [18]. Here, t and
t correspond to the trend at the best-matched level for the
query q and the trend at the lower level, respectively.
Definition 5.3: [Latency]: Latency is the time needed to
fetch the set of cells from the 2D TSA-tree.
Trivially, the latency of fetching lower level cells for some
area is smaller than that of cells at higher levels for that same
area. For every cell at some level, we precompute the error
(see Tab. 2) between the cell and the sets of cells at higher
levels which occupy the same region. For example, for in
Fig. 7, we transform it into level 1 and precompute the error
between f g and the T r ansf or m f g . In ad-
dition, for every set of cells that occupy the same region, we
store the latency (e.g., latency for f g and f g).
When a user submits a trend query with a specified scale,
using the stored information (latency) inside the cells, the
system can provide an estimated processing time for fetch-
ing the trend at the matched level. If the time is acceptable
to the user, the system fetches the cells at the current level
for the trend mining. Otherwise, the user can provide a tol-
erable processing time and the system would fetch the cells
at the appropriate level (lower level) using the stored latency
information. It can also estimate the accuracy of the result
using the stored accuracy information inside the cells. Users
can hence provide the tolerable accuracy, and the system can
show the trend by fetching the cells at the lowest level that
satisfies the accuracy requirement.
6 2D TSA-tree Variations
In this section, we provide two variations of 2D TSA-tree
named C ustomiz ed 2D TSA-tree, and O ptimal 2D TSA-
tree. With C ustomiz ed 2D TSA-tree (Sec. 6.1) we further
enhance the user interaction through sub-region selection by
relaxing the assumption of Sec. 5 that a user can only select a
predefined cell. Here, we assume that a user can select areas
that are not defined as cells by themselves but are covered by
other predefined cells. Moreover, we show how to construct
2D TSA-subtrees for such areas on-the-fly. The size of the
entire 2D TSA-tree is usually larger than the size of the orig-
inal spatial data. Moreover, due to disk space limitations,
sometimes 2D TSA-tee cannot be stored on magnetic disk(s)
in its entirety. Therefore, O ptimal 2D TSA-tree is intro-
duced in Sec. 6.2 to further enhance user-interaction through
multi-level abstractions by saving on storage requirements
while at the same time efficiently support mining of trends.
O ptimal 2D TSA-tree strives on finding an optimal subset
of the 2D TSA-tree to store on disk without introducing (or
by reducing) error.
6.1 Customized 2D TSA-subtree
In Sec. 5, we restricted the users to only select predefined
cells for mining purposes. However, C ustomiz ed 2D TSA-
subtree provides users with more flexibility in area of selec-
tion. That is, a user can select areas that are not defined as
cells but are covered by other predefined cells (or selecting
more than one cell). First, we define the concept of a Cover
as follows:
Definition 6.1: [Cover (CV)] Let CV be a set of cells at
level i, SA be a selected area and X be a set of all points
inside SA, respectively. Then CV is a cover for SA at level
i if and only if the following relation holds: ( c CV,if p
c then p X) and ( p,if p X then c CV , p c);
where p is a point in a 2D mesh.
1
6
1
7
1
10
1
11
Figure 10. Illustration of flexible area selec-
tion
For example, in Fig. 10, f
g is a cover for
the solid-line rectangle in the center while there is no cover
for broken-line rectangle (note that numbers inside the el-
lipses represent the cell numbers). Now, we make a more
relaxed assumption that users can select a rectangular area
if there exists a cover CV for that area. Hence, solid-line
rectangle is a valid selection while broken-line rectangle is
not.
When the user selects a CV, the original 2D TSA-tree
does not have the wavelet coefficients for the CV. How-
ever, it stores the wavelet coefficients for all the cells in SC.
Therefore, to find the trends for such area at different lev-
els, the system has to compute new wavelet coefficients real-
time (i.e., create a customized 2D TSA-subtree). Using the
algorithm in Fig. 8, we can construct a TSA-subtree for each
cell in a cover (CV). Then we merge those subtrees into
a single customized TSA-subtree using 2D wavelet trans-
forms. For example, in Fig. 11, CV contains cells , ,
, and . Hence, to create the customized 2D TSA-
subtree we merge the averaged value from each cell in CV
into one list. Then, we apply 2D wavelet transform to the
list to obtain a new cell (at the selection level) in the desired
customized 2D TSA-subtree. The time complexity of creat-
ing a customized 2D TSA-subtree depends on the size of the
cells of the cover (i.e., the number of points in each cell). If
the size of a cell in the cover is k, then it takes O n
log k
k
to
construct a customized 2D TSA-subtree, where n is the size
of the selected area.
6.2 Optimal 2D TSA-tree
The size of the entire 2D TSA-tree is larger than the size
of the original spatial data. For example, when the size of
original 2D data is n
, the followingequation shows the size
of our 2D TSA-tree, which is much larger than that of the
1
a
1
h
1
v
1
d
2
a
2
h
2
v
2
d
3
a
3
h
3
v
3
d
4
a
4
h
4
v
4
d
4
a
3
a
2
a
1
a
Apply 2D wavelet
transforms
ahvd Newly created cell
1
6
1
7
1
10
1
11
Figure 11. Computation of coefficients for a
selected sub-region
original data.
n
n
n
n
Sometimes, due to disk space limitations, the 2D TSA-
tee cannot be stored on magnetic disk(s) in its entirety.
Therefore, to save on storage requirement while at the same
time efficiently support mining of trends we need to find an
optimal subset of the 2D TSA-tree to store on disk without
introducingany error. To this end, we use the property of 2D
TSA-tree where all the internal nodes can be reconstructed
from leaf nodes (i.e., we can reconstruct AX
i
by merging its
children).
Wavelet coefficients are the part of 2D TSA-tree that are
stored in the leaf nodes and leaf nodes only contain wavelet
coefficients. Thus, we only need to store the leaf nodes since
other nodes (nodes containing trend information) can be re-
constructed from the leaf nodes. Hence, the maximum space
to store a 2D TSA-tree without introducing any error is n
.
In some situations, the size of available space, AS, is less
than n
. In such situations, we need to find the set of op-
timal set of nodes (or set of coefficients) which can fit into
AS. Furthermore, the reconstructed data (X
) from the op-
timal set of nodes (condensed set) should have a minimum
distance to the original input data X (i.e., minimizing the er-
ror). The following lemmas provide us with the relevant ar-
gument.
Lemma 6.2: Suppose we reconstruct X
by dropping some
leaf nodes of 2D TSA-tree. Let the set of dropped nodes be
S. Then, the following equation holds.
jjX X
jj X
node S
jjnodejj
(8)
Lemma 6.3: Suppose we reconstruct X
by dropping some
wavelet coefficients which are contained in the leaf nodes of
p
Figure 12. Method to obtain synthetic data
2D TSA-tree. Let the set of dropped coefficients be S. Then,
the following equation holds.
jjX X
jj X
c S
c
(9)
Lemmas 6.2 and 6.3 state that the amount of error de-
pends on the number and magnitude of the dropped nodes
(the proof for these lemmas are based on Lemma 4.14 and
will be skipped since they are straightforward). In [17],
we introduced several algorithms for finding optimal set of
nodes or coefficients for 1D TSA-tree. Using the results
of Lemmas 6.2 and 6.3 we conclude that we can employ
the same algorithms for 2D TSA-tree. In this paper we
extend [17]’s OTSA-w/tcd (Optimal TSA with tail coeffi-
cients dropping) and OTSA-w/scd (Optimal TSA with se-
lective coefficients dropping) algorithms for our 2D TSA-
tree, which are referred to as 2D OTSA-w/tcd and 2D
OTSA-w/scd. 2D OTSA-w/tcd keeps the first few wavelet
coefficients while OTSA-w/scd keeps the coefficients with
maximum energy. We will compare these methods with
SVD and DFT in Sec. 7.2.2.
7 Performance Analysis
Our application can be defined as visualizing spatial
trends for some selected areas of interest to users at different
resolutionswhen scientific observation data for the entire re-
gion is available. Hence, important criteria for evaluating
the performance of our techniques are: 1) how well we can
visualize trends effectively, 2) how much we can optimize
data when the available storage is limited, and 3) how much
our method is scalable. We conducted some experiments to
evaluate the above and the results are shown in Sec. 7.2
7.1 Experimental Setup
For all the experiments, we used both real data obtained
from our NASA sponsered GENESIS (GPS ENvironment
and Earth Science Information System) project and synthet-
ically generated data. Real data files are composed of 2D
grids of water vapor measurement on certain areas of the
earth. Each file consists of 181 rows corresponding to lat-
itudes between -90 and 90 (in degree), and 362 columns.
The first column indicates the longitude, and the next 361
columns correspond to water vapor measurement at longi-
tudes between -180 and 180 (in degree). Synthetically gen-
erated data is used for conducting scalability experiments
since we could not obtain real data with size more than . In order to generate synthetic data, we use the fact that
atomospheric measurement such as temperature or water va-
por pressure are spatially correlated. That is, we can make
the assumption that two closely located areas have similar
values of (say) temperature. Thus, as shown in Fig. 12, in
order to obtain T
p
that corresponds to the temperature value
at p, we draw a circle C
p
centered at p and then compute
weighted average of the measurement T
q j
at q
j
inside the
circle as the value for T
p
. If there exist no point inside
the circle, we enlarge the radius of C
p
. We now define the
weight for T
q j
such that the measured values between p and
q
j
be inversely proportional to their distance. Denoting the
Euclidean distance between p and q as d p q
j
, then
w
j
d p q
j
(10)
T
p
P
qi C
w
i
T
q i
P
p j C
w
j
(11)
In addition, throughout this section, we will use standard
distortion measure between two 2D data X and
X, which
is computed as
kX X k
kX k
, as our metric for the percentage of
error for reconstruction precision.
7.1.1 Implementation Scheme for 2D TSA-tree
In this section, we describe our implementation scheme for
2D TSA-tree. When a trend query is submitted, we find
the best matched level in 2D TSA-tree, and fetch the corre-
sponding wavelet coefficients to be displayed for the user.
The information to be displayed may not be stored adja-
cently, hence, resulting in large number of I/O operations.
Therefore, it is necessary to store wavelet coefficients in lo-
calized manner to reduce the number of I/O operations.
We use external hashing (hashing for disk file) for the
implementation of 2D TSA-tree. Suppose the block size is
1K bytes. Since each cell has 4 floating points, one bucket
can contain up to 64 cells. Our basic strategy is that we
place the cells for the same area into the same block. We
use 1 bucket if the total number of cells does not exceed
64. In other cases, we use the level number as the bucket
address, and solve the collision problem through chaining.
null
null
Bucket 1
...
null
null
Bucket 2
Bucket 3
Bucket 4
Cells at level 4, 5 and 6
Cells at level 3
1-8, 17-24,33-40,49-56,65-72,91-98
7-16, 25-32, 41-48, 57, 64, 73-90, 99-106
107-114, 123-130, …, 165-172, 181-188
115-122, 131, 138, …, 173-180, 189-196
Figure 13. Implementation of 2D TSA-tree
Figure 14. Spatial trend of water vapor data using Haar wavelet filter
Figure 15. Spatial trend of water vapor data using db6 wavelet filter
5 10 15 20 25 30 35 40
0
10
20
30
40
50
60
70
80
90
100
Percent Space
Percent Error
SVD
FFT
2−D OTSA−scd
2−D OTSA−tcd
5 10 15 20 25 30 35 40
0
10
20
30
40
50
60
70
80
90
100
Percent Space
Percent Error
SVD
FFT
2−D OTSA−scd
2−D OTSA−tcd
(a) Water vapor data I (b) Water vapor data II
Figure 16. Space versus accuracy
Fig. 13 illustrates this scenario. The size of the original data
in Fig. 13 is . Bucket 4 contains all cells at level 4,
5 and 6. In bucket 3, cells at level 3 are stored. In bucket
2, there are 3 collisions. Numbers at the box represent the
num-th cell at level 2. The case for bucket 1 is same as
bucket 2.
7.2 Experimental Results
To demonstrate the effectiveness of our proposed meth-
ods, we evaluate our 2D TSA-tree for various scenarios.
First, we show that our proposed method can be used
to visualize spatial trends at different scales. Second,
we compare our 2D OTSA-w/tcd and 2D OTSA-w/scd
methods with SVD and FFT on the accuracy of recon-
structing original spatial data from the compressed one
when the available storage is limited. Finally, by increasing
the size of data, we investigate the scalability of our method.
7.2.1 Visual Verification of Spatial Trends
Fig. 14 depicts the multi-level trends of water vapor data.
As demonstrated in Fig. 14, more details are captured by the
higher levels (e.g., level 1), while lower levels are abstract
(e.g., level 5). Similar results can be observed in Fig. 15,
which shows the trends of water vapor using db6 wavelet.
As shown, trends in Fig. 15 are smoother than that of Fig. 14.
Different wavelet filters have different extent of smoothness
and complexity. Thus, users can choose different wavelet
filters for mining purpose depending on how much smooth-
ness they desire. If the display resolution of the client’s
monitor is low, one can send much less amount of data while
not sacrificing the user’s visual quality. For example, as
shown in Fig. 15, even though the user requests trend at level
1, trend at level 3 might be visually good enough while its
size is 1/16th of the requested data. Thus, it is necessary to
enable users to provide the size of the scale that they can tol-
erate.
7.2.2 Space versus Accuracy
When the storage space is limited, we cannot store the en-
tire 2D TSA-tree on disks. Thus, we should find 2D OTSA
which can fit into available space with minimum error. In
this section, we compare SVD, FFT, 2D OTSA-w/scd and
2D OTSA-w/tcd in terms of space vs. accuracy. That is, for
a given storage capacity, we compared the accuracy of re-
constructing original data from the condensed one. The re-
sults are shown in Fig. 16. Two data sets are the water vapor
measurement over the entire globe at two different times.
The X-axis represents the available space, which is percent-
age of the space needed to store the entire data. For example,
“25” means that the available disk space is enough to store
25 of the entire data set. The Y-axis represents the rela-
tive error, which is computed as
jjX X
jj
jjX jj
where X
is the original spatial data and X
is the reconstructed data
from compressed one. We conducted experiments on differ-
ent real data sets for our application domain in order to show
that the results are independent from the specific data set.
The results indicate that 2D OTSA-w/scd outperforms other
methods. For DFT and SVD, we kept first few coefficients
with maximum energy. Since our data sets are highly spa-
tially correlated, we can take advantage of inherent locality
of wavelet transform. Thus, our methods (2D OTSA-w/scd
and 2D OTSA-w/tcd) have better performance than that of
DFT and SVD. When the available space is very limited,
SVD performs very poorly due to the fact that the number
of coefficients it can keep is very small. However, it works
well as the available space increases.
7.2.3 Scalability Test
Finally, we study the scalability of our method by varying
the size of the data set and fixing the available space. Dif-
ferent sizes of 2D data ( to ) are gen-
erated synthetically. As shown in Fig. 17, 2D OTSA-w/tcd
has better scalability than that of SVD and DFT. In case of
SVD, it works very poorly since the number of coefficients
it can keep is very small in comparison with the size of data.
Even though DFT has bad performance, it has a better scala-
bility than that of SVD since DFT can concentrate its energy
on the first few coefficients.
8 Conclusion and Future Work
This paper presented techniques and data structures vi-
tal to applications that can benefit from visualizing spatial
trends at different resolutions for some selected areas. These
areas are of interest to users from scientific observation data
for an entire region.
Trivially, due to the I/O and network bottlenecks, trans-
mitting the entire data set for the selected sub-region to the
0 0.5 1 1.5 2 2.5 3
x 10
5
0
10
20
30
40
50
60
70
80
90
100
Size
Percent Error
SVD
FFT
2−D OTSA−tcd
Figure 17. Scalability test
user to visualize trends might result in a very long latency
observed by the user display screen. Therefore, in this paper,
to significantly reduce the amount of retrieved and transmit-
ted data we developed a new data structure named 2D TSA-
tree. Hence, it can condense the entire data set in advance
and at the same time supports sub-region queries and pro-
vides multiple levels of abstractions. Furthermore, by stor-
ing some precomputed information (such as the reconstruc-
tion error and the retrieval time for each level of 2D TSA-
tree) within each node in the tree, 2D TSA-tree can deter-
mine in advance the error and response time of the query re-
sult and trade them for each other. Finally, in order to re-
solve space limitation, we identified a specific subset of the
tree that can be considered as the optimal subset to be kept
on disk, termed 2D OTSA-tree. We conducted many ex-
periments to demonstrate the effectiveness of our proposed
methods. Our results show that spatial trends at a low reso-
lution might be visually good enough to visualize the trends
in a region, while the size of data used is smaller than that of
the original data (at a higher resolution). Second, we com-
pared our 2D OTSA methods with SVD and FFT on the ac-
curacy and scalability and our results show that they outper-
formed the other methods. Since our data sets are highly
spatially correlated, 2D TSA-tree takes advantage of inher-
ent locality of wavelet transform.
We intend to extend this work in three directions. First,
we want to implement 2D TSA-tree and its operations as a
datablade for Informix Universal Server 9.21. This way, we
can automatically convert the GENESIS Level-2 data into
2D TSA-tree to facilitate spatial trend visualization. Next,
we plan to exploit the data contained in the detailed nodes
(i.e., DX
i
’s) of 2D TSA-tree for surprise mining on 2D
spatio-temporal data. Finally, we plan to study the applica-
bility of 2D TSA-tree in the field of On-Line Analytical Pro-
cessing (OLAP). In particular, our C ustomiz ed 2D TSA-
subtree deals with similar challenges as the efficient perfor-
mance of OLAP range sum queries.
References
[1] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity
search in sequencedatabase. Fourth International Conference
on Foundations of Data Organization and Algorithm, 1993.
[2] R. Agrawal, K.I.Lin, H. S. Sawhney, and K. Sim. Fast Similar-
ity Search in the Presence of Noise, Scaling, and Translation
in Time-series Database. VLDB, 1995.
[3] C. S. Burrus, R. A. Gopinath, and H. Guo. Introduction to
wavelets and wavelet transforms : a primer. Prentice Hall,
1998.
[4] K. Chan and A. W. Fu. Efficient time series matching by
wavelets. ICDE, 1999.
[5] C. K. Chui. Wavelets : a tutorial in theory and applications.
Academic Press, 1992.
[6] C. K. Chui. An overview of wavelets. In Approximation The-
ory and Functional Analysis. Academic Press, 1993.
[7] T. H Cormen, C. E. Leiserson, and R. L. Rivest. Introduction
to algorithms. The MIT Press, 1989.
[8] I. Daubechies. Orthonormal bases of compactly supported
wavelets. Communications on Pure and Applied Mathemat-
ics, 41(7):909–996, October 1988.
[9] M. Ester, A. Frommelt, H. P. Kriegel and J. Sander. Al-
gorithms for Characterization and Trend Detection in Spatial
Databases, KDD, 1998.
[10] C.Faloutsos and M.Ranganthan and Y .Manolopoulos Fast
Subsequence Matching in Time-series Datebase, SIGMOD,
Proc. of Annual Conference, Minneapolis, 1994.
[11] JPL. Sea Surface Temperature
http://podaac.jpl.nasa.gov/db/podaac
[12] F. Korn, H. V . Jagadish, and C. Faloutsos. Efficiently support-
ing ad hoc queries in large datasets of time sequences. Pro-
ceedings of the ACM SIGMOD international conference on
Management of data, 26(2):289–300, 1997.
[13] C. Li, P. S. Yu, and V . Castelli. Malm: A framework for min-
ing sequence database at multiple abstraction levels. In Pro-
ceedings of the 1998 ACM 7th international conference on In-
formation and knowledge management, pages 267–272, 1998.
[14] S. Mallat. A theory for multiresolution signal decomposition:
The wavelet representation. IEEE transactions on pattern
Analysis and Machine Intelligent, 11(7):674–693, July 1989.
[15] D. Rafiei and A. Mendelzon. Similarity-based queries for
time series data. SIGMOD, pages 13–24, 1997.
[16] G. Sheikholeslami, S. Chatterjee and A. Zhang. WaveClus-
ter: A Multi-Resolution Clustering Approach for Very Large
Spatial Databases. VLDB, 1998.
[17] C.Shahabi, X.Tian, and W.Zhao. TSA-tree: A Wavelet-
Based Approach to Improve the Efficieny of Multi-Level Sur-
prise and Trend Queries. SSDBM, 2000.
[18] J.S.Vitter and M.Wang. Approximate Computation of Multi-
dimensional Aggregates of Sparse Data using Wavelets. SIG-
MOD, 1999.
[19] D. Wu, D. Agrawal, A. E. Abbadi, A. K. Singh and
T. R. Smith Efficient Retrieval for Browsing Large Image
Databases. CIKM, 1996.
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 694 (1999)
PDF
USC Computer Science Technical Reports, no. 701 (1999)
PDF
USC Computer Science Technical Reports, no. 719 (1999)
PDF
USC Computer Science Technical Reports, no. 826 (2004)
PDF
USC Computer Science Technical Reports, no. 744 (2001)
PDF
USC Computer Science Technical Reports, no. 739 (2001)
PDF
USC Computer Science Technical Reports, no. 733 (2000)
PDF
USC Computer Science Technical Reports, no. 959 (2015)
PDF
USC Computer Science Technical Reports, no. 855 (2005)
PDF
USC Computer Science Technical Reports, no. 785 (2003)
PDF
USC Computer Science Technical Reports, no. 736 (2000)
PDF
USC Computer Science Technical Reports, no. 742 (2001)
PDF
USC Computer Science Technical Reports, no. 868 (2005)
PDF
USC Computer Science Technical Reports, no. 839 (2004)
PDF
USC Computer Science Technical Reports, no. 835 (2004)
PDF
USC Computer Science Technical Reports, no. 948 (2014)
PDF
USC Computer Science Technical Reports, no. 833 (2004)
PDF
USC Computer Science Technical Reports, no. 943 (2014)
PDF
USC Computer Science Technical Reports, no. 748 (2001)
PDF
USC Computer Science Technical Reports, no. 592 (1994)
Description
Cyrus Shahabi, Seokkyung Chung, Maytham Safar and George Hajj. "2D TSA-tree: A wavelet-based approach to improve the efficiency of multi-level spatial data mining." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 740 (2001).
Asset Metadata
Creator
Chung, Seokkyung
(author),
Hajj, George
(author),
Safar, Maytham
(author),
Shahabi, Cyrus
(author)
Core Title
USC Computer Science Technical Reports, no. 740 (2001)
Alternative Title
2D TSA-tree: A wavelet-based approach to improve the efficiency of multi-level spatial data mining (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
15 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270125
Identifier
01-740 2D TSA-tree A Wavelet-Based Approach to Improve the Efficiency of Multi-Level Spatial Data Mining (filename)
Legacy Identifier
usc-cstr-01-740
Format
15 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/