Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
(USC Thesis Other)
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Data-DrivenImageAnalysis, Modeling, Synthesis and Anomaly
Localization Techniques
by
KaitaiZhang
ADissertationPresentedtothe
FACULTYOFTHEGRADUATESCHOOL
UNIVERSITYOFSOUTHERNCALIFORNIA
InPartialFulfillmentofthe
RequirementsfortheDegree
DOCTOROFPHILOSOPHY
(ElectricalEngineering)
August2021
Copyright 2021 KaitaiZhang
TableofContents
ListofTables v
ListofFigures vii
Abstract x
Chapter1: Introduction 1
1.1 SignificanceoftheResearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 TextureAnalysis,ModelingandSynthesis . . . . . . . . . . . . . . . . . . 1
1.1.2 ImageAnomalyDetectionandLocalization . . . . . . . . . . . . . . . . . 3
1.2 BackgroundofResearchMethodologies . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 ConvolutionalNeuralNetworks . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 SuccessiveSubspaceLearning . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 BackgroundofResearchTopics . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 TextureRepresentationandSegmentation . . . . . . . . . . . . . . . . . . 8
1.3.2 TextureAnalysisandSynthesis . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 DynamicTextureSynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.4 ImageAnomalyDetectionandLocalization: PEDENet . . . . . . . . . . . 12
1.3.5 ImageAnomalyLocalization: AnomalyHop . . . . . . . . . . . . . . . . . 13
1.4 ContributionsoftheResearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 TextureAnalysisandSynthesis . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 UnsupervisedTextureSegmentation . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 DynamicTextureSynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.4 ImageAnomalyDetectionandLocalization: PEDENet . . . . . . . . . . . 15
1.4.5 ImageAnomalyLocalization: AnomalyHop . . . . . . . . . . . . . . . . . 16
1.5 OrganizationoftheDissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter2: AData-centricApproachtoUnsupervisedTextureSegmentation 18
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 TexturalfeaturebasedonPrincipleRepresentativePatterns . . . . . . . . . . . . . 20
2.2.1 Pixel-wisedTexture Representation . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 PrincipleRepresentativePattern(PRP) . . . . . . . . . . . . . . . . . . . 22
2.2.3 PRPFeaturesandSegmentation . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 QualitativeSegmentationResults . . . . . . . . . . . . . . . . . . . . . . 24
ii
2.3.2 ComparisonResultsonTextureMosaics . . . . . . . . . . . . . . . . . . . 25
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter3: TextureAnalysisviaHierarchicalSpatial-SpectralCorrelation(HSSC) 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 ReviewofPreviousWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 ProposedHSSCMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 CorrelationofPCA Coefficients . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 CorrelationofSaakCoefficients . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.3 ClassificationwithSharedTransformKernels . . . . . . . . . . . . . . . . 36
3.4 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Chapter 4: Dynamic Texture Synthesis via Long-Range Spatial and Temporal Correla-
tion 40
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 ProposedMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 TwoStreamConvolutionalNetwork . . . . . . . . . . . . . . . . . . . . . 44
4.3.2 Global-awareGram Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.3 Temporal-awareDynamicLoss . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.1 ExperimentSetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.2 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 5: PEDENet: Image Anomaly Localization via Patch Embedding and Density
Estimation 56
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Reconstruction-basedApproach . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.2 PretrainedNetwork-basedApproach . . . . . . . . . . . . . . . . . . . . . 59
5.2.3 One-classClassification-basedApproach . . . . . . . . . . . . . . . . . . 60
5.2.4 Non-imageData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 ProposedMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 PEDENet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.2 LossFunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.3 AnomalyLocalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.1 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.2 PerformanceEvaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 ConclusionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
iii
Chapter6: AnomalyHop: AnSSL-basedImageAnomalyLocalizationMethod 72
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 AnomalyHopMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.1 SSL-basedFeatureExtraction . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.2 ModelingofNormalityFeatureDistributions . . . . . . . . . . . . . . . . 76
6.3.3 AnomalyMapGenerationandFusion . . . . . . . . . . . . . . . . . . . . 78
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5 ConclusionandFutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Chapter7: ConclusionsandFutureWork 86
7.1 SummaryoftheResearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2 FutureResearchDirections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2.1 TextureSynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2.2 ImageAnomalyDetectionandLocalization . . . . . . . . . . . . . . . . . 88
Bibliography 90
iv
ListofTables
2.1 Comparison Results of proposed approach with various segmentation methods
on the Prague Unsupervised Texture Segmentation Benchmark(Part I). Up arrows
indicate better results correspond to large values, and down arrows the opposite.
Boldface highlights the best, and a star denotes the second-best value in each
column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Comparison Results of proposed approach with various segmentation methods on
the Prague Unsupervised Texture Segmentation Benchmark(Part II). Up arrows
indicate better results correspond to large values, and down arrows the opposite.
Boldface highlights the best, and a star denotes the second-best value in each
column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Averaged classification accuracies for the Brodatz and the VisTex datasets that
improveasthenumberofstagesincreases. Thebestperformancenumberisinbold. 38
3.2 Performance comparison of the proposed HSSC method and other state-of-the-art
methodsfortheCUReTtexturedataset. . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Classification accuracy as the number of training images increases with respect to
theCUReTdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 Comparison of Image Anomaly Localization Performance, where the evaluation
metricispixel-wiseAUC-ROC. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 ImageAnomalyDetectionPerformance,wheretheevaluationmetricistheimage-
levelAUC-ROC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 ModelSizeComparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 AblationStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1 The hyper-parameters of spatial sizes and numbers of filters at each hop for the
leatherclass. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 PerformancecomparisonofimageanomalylocalizationmethodsintermsofAUC-
ROCscoresfortheMVTecADdataset,wherethebestresultsineachcategoryare
markedinbold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
v
6.3 Averageinferencetime(insec.) perimagewithInteli7-5930K@3.5GHzCPU. . 82
vi
ListofFigures
1.1 Variouskindsoftexturesexistingintherealworld. . . . . . . . . . . . . . . . . . 2
1.2 Severalexamplesoftexture-dominantimages. . . . . . . . . . . . . . . . . . . . . 2
1.3 Example of Successive Subspace Learning framework. This figure is taken from
[24]withauthors’permission. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Segmentationresultsstepbystep. (a)Atexturemosaicwith5differentcomponents
from Brodatz texture dataset. (b) Raw segmentation results with proposed PRP
features. (c)Finalsegmentationmaskafterpost-processing. . . . . . . . . . . . . . 21
2.2 SpatialdistributionsoffourtypicalPrincipalRepresentationPatterns. . . . . . . . 22
2.3 Visualizationoffeaturesfromdifferenttexturecomponentsandboundaries. . . . . 23
2.4 Segmentationresultsofnaturalscenesandanimalsimagesintherealworld. . . . . 24
2.5 Segmentationresultsofgroundterraintexturesimagesintherealworld. . . . . . . 24
2.6 Segmentationresultsofhistologyimagesintherealworld. . . . . . . . . . . . . . 25
2.7 Example results on Prague dataset(Part I). The first row shows original images,
the second row shows ground truth, the third shows segmentation results from
FSEG[131]andthelastrow showsresultsfromtheproposedmethod. . . . . . . . 27
2.8 Example results on Prague dataset(Part II). The first row shows original images,
the second row shows ground truth, the third shows segmentation results from
FSEG[131]andthelastrow showsresultsfromtheproposedmethod. . . . . . . . 28
2.9 Example results on Prague dataset(Part III). The first row shows original images,
the second row shows ground truth, the third shows segmentation results from
FSEG[131]andthelastrow showsresultsfromtheproposedmethod. . . . . . . . 29
3.1 Illustration of correlation matrices of PCA coefficients of a target texture class, its
matchedandmismatchedones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vii
3.2 Visualizationoftwo-stagecorrelationmatrices: (a)Herringboneweave(Texture)),
and visualizations of its correlation matrices using (b) first-stage and (c) second-
stage Saak coefficients; (d) Woolen cloth (Texture )
0
), and visualizations of its
correlation matrices using (e) first-stage and (f) second-stage Saak coefficients
basedon)’sSaaktransformkernels. . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Per-classaccuracyresultsfor theBrodatzandtheVisTexdatasets,respectively . . . 37
4.1 Examplefordynamictexturesynthesis. . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 This figure gives three representative dynamic texture examples(one row per tex-
ture),wheretexturesarehomogeneoustextures. Thefirstcolumnshowsoneframe
of reference dynamic texture video, the second column shows synthesized results
obtainedbythebaselinemodel,andthethirdcolumnshowsresultsobtainedusing
ourmodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 This figure gives three representative dynamic texture examples(one row per tex-
ture), where textures are structured textures. The first column shows one frame
of reference dynamic texture video, the second column shows synthesized results
obtainedbythebaselinemodel,andthethirdcolumnshowsresultsobtainedusing
ourmodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 This figure shows the importance of local feature correlation in structural texture.
The red boxes above show a pair of texture patches in the same location. They
couldalsobeviewedasvisualizationofthereceptivefieldinthenetwork. . . . . . 47
4.5 This figure shows the importance of middle and long-range motion, Using flag
sequence as an example, we can find that most meaningful motion are actually
crossing multiple frames. Such kind of long-range motion combines with long-
rangespatialstructuregiveusavividvisualeffecttogether. . . . . . . . . . . . . 49
4.6 Examplesofhomogeneousdynamictexture. . . . . . . . . . . . . . . . . . . . . 50
4.7 Dynamictextureswithlong-rangecorrelations(PartI). . . . . . . . . . . . . . . . 51
4.8 Dynamictextureswithlong-rangecorrelations(PartII). . . . . . . . . . . . . . . . 52
4.9 Dynamictextureswithlong-rangecorrelations(PartIII). . . . . . . . . . . . . . . . 52
4.10 Dynamictextureswithmiddle-rangecorrelations(PartI). . . . . . . . . . . . . . . 53
4.11 Dynamictextureswithmiddle-rangecorrelations(PartII). . . . . . . . . . . . . . . 53
4.12 Dynamictextureswithmiddle-rangecorrelations(PartIII). . . . . . . . . . . . . . 54
5.1 Image anomaly localization examples (from left to right): normal images, anoma-
lous images, ground truth of the anomalous region and the predicted anomalous
region by the proposed PEDENet, where the red region indicates the detected
anomalousregion. TheseexamplesaretakenfromtheMVTecADdataset. . . . . . 58
viii
5.2 An overview of the proposed PEDENet. Image are first divided into patches, and
then fed into Patch Embedding (PE) network to compute their patch embeddings,
while the Density Estimation (DE) network guides an implicit Gaussian Mixture
Model (GMM)-inspired clustering in embedding space. After training, normal
patchesareclustered,andoutlierscouldbetreatedasabnormalpatchesatinference
time. ThreeanomalylocalizationresultsfromHazelnutclassareshownasexample. 60
5.3 Overviewoflocationprediction(LP)network. . . . . . . . . . . . . . . . . . . . . 65
5.4 Visualization of anomalous images, labeled ground truths and localization results
oftheproposedPEDENetfor5objectclassesintheMVTecADdataset,wherethe
redcolorisusedtoindicatedetectedanomalyregions. . . . . . . . . . . . . . . . 67
5.5 Visualization of anomalous images, labeled ground truths and localization results
oftheproposedPEDENetfor5objectclassesintheMVTecADdataset,wherethe
redcolorisusedtoindicatedetectedanomalyregions. . . . . . . . . . . . . . . . 68
5.6 Visualization of anomalous images, labeled ground truths and localization results
of the proposed PEDENet for 5 texture classes in the MVTec AD dataset, where
theredcolorisusedtoindicatedetectedanomalyregions. . . . . . . . . . . . . . . 69
6.1 ImageanomalylocalizationexamplestakenfromtheMVTecADdataset(fromleft
to right): normal images, anomalous images, the ground truth and the predicted
anomalous region by AnomalyHop, where the red region indicates the detected
anomalousregion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 ThesystemdiagramoftheproposedAnomalyHopmethod. . . . . . . . . . . . . . 74
6.3 Two anomaly grid images (from left to right): input images, ground truth labels,
predictedheatmap,predicted andsegmentedanomalyregions. . . . . . . . . . . . 81
6.4 Visualization of anomalous images, ground truths, predicted heat maps, predict
masks,andsegmentationresultsoftheproposedAnomalyHopfor5objectclasses
intheMVTecADdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.5 Visualization of anomalous images, ground truths, predicted heat maps, predict
masks,andsegmentationresultsoftheproposedAnomalyHopfor5objectclasses
intheMVTecADdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.6 Visualization of anomalous images, ground truths, predicted heat maps, predict
masks,andsegmentationresultsoftheproposedAnomalyHopfor5textureclasses
intheMVTecADdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
ix
Abstract
Image is one kind of high-dimensional and complicated data source which carries various infor-
mation. Image analysis and modeling is a one of the most fundamental yet important topic in
computer vision and pattern recognition, which has attracted extensive research attention over the
last several decades. This thesis investigates and proposes methods in several important aspects
of image analysis, including texture representation, unsupervised texture segmentation, dynamic
texturesynthesisandimageanomalydetectionandlocalization.
Fortexturerepresentation,ahierarchicalspatial-spectralcorrelation(HSSC)methodisproposed
fortextureanalysis. TheHSSCmethodfirstappliesamulti-stagespatial-spectraltransformtoinput
texture patches, which is known as the Saak transform. Then, it conducts a correlation analysis on
Saak transform coefficients to obtain texture features of high discriminant power. To demonstrate
the effectiveness of the HSSC method, we conduct extensive experiments on texture classification
andshowthatitoffersverycompetitiveresultscomparingwithstate-of-the-artmethods.
For unsupervised texture segmentation, we introduces a data-centric approach to efficiently
extract and represent textural information, which adapts to a wide variety of textures. Based
on the strong self-similarities and quasi-periodicity in texture images, the proposed method first
constructsarepresentativetexturepatternsetforthegivenimagebyleveragingthepatchclustering
strategy. Then,pixel-wisetexturefeaturesaredesignedaccordingtothesimilaritiesbetweenlocal
patches and the representative textural patterns. Moreover, the proposed feature is generic and
flexible,andcanperformsegmentationtaskbyintegratingitintovarioussegmentationapproaches
easily. Extensive experimental results on both textural and natural image segmentation show that
x
the segmentation method using the proposed features achieves very competitive or even better
performancecomparedwiththestat-of-the-artmethods.
For dynamic texture synthesis, its main challenge lies in how to maintain spatial consistency
and temporal consistency in synthesized video. The major drawback of existing dynamic texture
synthesis models comes from poor treatment of the long-range texture correlation and motion
information. Toaddressthisproblem,weincorporateanewlossterm,calledtheShiftedGramloss,
to capture the structural and long-range correlation of the reference texture video. Furthermore,
we introduce a frame sampling strategy to exploit long-period motion across multiple frames.
With these two new techniques, the application scope of existing texture synthesis models can be
extended. That is, they are able to synthesize not only homogeneous but also structured dynamic
texture patterns. Thorough experimental results are provided to demonstrate that our proposed
dynamictexturesynthesismodeloffersstate-of-the-artvisualperformance.
Forimageanomalydetectionandlocalization,aneuralnetworktargetingatunsupervisedimage
anomaly localization, called the PEDENet, is proposed. PEDENet contains a patch embedding
(PE) network, a density estimation (DE) network, and an auxiliary network called the location
prediction(LP)network. ThePEnetworktakeslocalimagepatchesasinputandperformsdimension
reduction to get low-dimensional patch embeddings via a deep encoder structure. Being inspired
by the Gaussian Mixture Model (GMM), the DE network takes those patch embeddings, and then
predicts the cluster membership of an embedded patch. The sum of membership probabilities is
used as a loss term to guide the learning process. The LP network is a Multi-layer Perception
(MLP), which takes embeddings from two neighboring patches as input and predicts their relative
location. TheperformanceoftheproposedPEDENetisevaluatedandcomparedwithstate-of-the-
artbenchmarkingmethodsbyextensiveexperiments.
Forimageanomalylocalization,abrandnewmethodbasedonthesuccessivesubspacelearning
(SSL) framework, called AnomalyHop, is proposed. AnomalyHop consists of three modules: 1)
xi
feature extraction via successive subspace learning (SSL), 2) normality feature distributions mod-
eling via Gaussian models, and 3) anomaly map generation and fusion. Comparing with state-of-
the-artimageanomalylocalizationmethodsbasedondeepneuralnetworks(DNNs),AnomalyHop
is mathematically transparent, easy to train, and fast in its inference speed. Besides, its area under
theROCcurve(ROC-AUC)performanceontheMVTecADdatasetis95.9%,whichisamongthe
bestofseveralbenchmarkingmethods.
xii
Chapter1
Introduction
1.1 SignificanceoftheResearch
Inthisthesis,wefocusondata-drivenimageanalysis,modeling,synthesisandanomalylocalization
techniques, which includes two majors research topics. The first one is texture analysis, modeling
and synthesis, and the second one is image anomaly detection and localization. We will introduce
theirresearchsignificancefromboththeoreticalandpracticalperspectives.
1.1.1 TextureAnalysis,ModelingandSynthesis
Textureisaoneofthemostfundamentalyetimportantcharacteristicofimages,andtextureanalysis
and modeling is an essential and challenging problem in computer vision and pattern recognition,
which has attracted extensive research[1, 3, 4, 18, 20, 28, 29, 31, 71, 76] attention over the last
several decades. As shown in Fig. 1.1, there are various kinds of textures existing in real world
[8, 9, 27, 64, 87], and each of those texture could have completely different appearance while
all of them follow some kind of regular pattern. As a powerful visual cue, texture also play
an important role in human perception, and provides useful information in identifying objects
or regions in images, ranging from multi-spectral satellite data to microscopic images of tissue
samples. Besides, understanding texture is also a key component in many other computer vision
1
topics,includingimagede-noising[43,145,146],imagesuper-resolution[88,102,125]andimage
generation[109].
Figure1.1: Variouskindsoftexturesexistingintherealworld.
Figure1.2: Severalexamplesoftexture-dominantimages.
Basedontheimportantroleoftexture,manyresearchworkshavebeendoneontextureanalysis
[17], classification [32, 50], segmentation [81, 131] and texture synthesis [89] and even dynamic
texture synthesis[37, 49, 55, 84, 127] in the last five decades. There are still a few important yet
challengingproblemsremainingintextureresearchfield. Hereareseveralexamples.
1. Basedontheself-similarityandquasi-periodicityoftexture,couldwefindapowerfultexture
representationwhichcouldbeusedformultipletasksandalsobemathematicallyexplainable?
2
2. Fortexturesegmentation,couldwegainnicesegmentationresultsunderthetotallyunsuper-
visedsetting? Suchkindofmethodwillbehighlyappreciatedandwillbenefitpeoplewhich
in a wide range of industries like autonomous driving, remote sensing and medical image
diagnosis.
3. For dynamic texture, could we successfully understand its fascinating temporal motion and
also synthesize it? And could we even synthesize those dynamic texture with complicated
non-localspatialcorrelation?
In this thesis, we focus on those fundamental yet important problems in texture research field.
Withthehelp ofrecentadvanceofmachine learninganddeeplearning techniques,weshouldthat
severalimprovementcouldbedoneandnewstate-of-the-artperformancecouldbeachieved.
1.1.2 ImageAnomalyDetectionandLocalization
Anomaly detection is a fundamental yet long-standing research topic in machine learning and
patter recognition field, and it widely exists in multiple data modalities, including text, speech,
language,imageandvideo. Forimagedata,itiscalledimageanomalydetection,whichisabinary
classificationproblemthatdecideswhetheraninputimagecontainsananomalyornot. Furtheron,
image anomaly localization is to further localize the anomalous region at the pixel level. Due to
recent advances in deep learning and availability of new datasets, recent research works show an
significant interest in both image-level anomaly detection and pixel-level localization of anomaly
regions.
Image anomaly detection and localization shows important and unique theoretical values.
Image anomaly detection are typically in an unsupervised setting, which means that we only have
anomaly-free images during training. As a result, it could be viewed as one-class classification
problem on image data. One-class classification try to identify objects of a specific class amongst
allobjects,byprimarilylearningfromatrainingsetcontainingonlytheobjectsofthatclass. One-
class classification are considered to be different from and much more difficult than the traditional
3
classificationproblem,whichtriestodistinguishbetweentwoormoreclasseswiththetrainingset
containingobjectsfromalltheclasses. Andthehigh-dimensionalandcomplicatednatureofimage
datacould makeitmorechallengingyetinteresting.
This could become even more interested for image anomaly localization. Most well-studied
pixel-level localization solutions (e.g., semantic segmentation) rely on heavy supervision, where a
largenumberofpixel-levellabelsandmanylabeledimagesareneeded. However,inthecontextof
image anomaly localization, a typical assumption is that only normal (i.e., anomaly-free) images
areavailableinthetrainingstage. Thisisbecauseanomaloussamplesarefewingeneral. Besides,
theyareoftenhardtocollectandexpensiveinlabeling. Tothisend,imageanomalylocalizationis
usuallydoneinanunsupervisedmanner,andtraditionalsupervisedsolutionsarenotapplicable.
Besides all those theoretical significance, image anomaly detection and localization find real-
world applications such as manufacturing process monitoring[107], medical image analysis [105,
106],andvideosurveillanceanalysis[104,142]. Andthosenumerousapplicationsmakethistopic
to be one of the most promising computer vision task, which could make a real impact on human
society.
1.2 BackgroundofResearch Methodologies
In this thesis, all research works are utilizing machine learning techniques. The related machine
learning techniques could be classified into two categories, convolutional neural network(CNN)
andsuccessivesubspacelearning(SSL).
1.2.1 ConvolutionalNeuralNetworks
Convolutionalneuralnetworks(CNNs)areatypeoffeed-forwardneuralnetworkscommonlyused
in vision-related tasks. A CNN usually contains convolutional layers, non-linear activation layers,
poolinglayersandfullyconnectedlayers[46].
4
• Convolutional layer. Convolutional layers contain a group of neurons that are connected to
alocalregion(usuallysquare)oftheinput. Theoutputofeachneuronistheinnerproductof
the neuron’s weights and the input local region. If the neuron contains bias terms, the final
output is the inner product plus bias. The weights and bias of this neuron group are shared
forallregions.
• Non-linearactivationlayer. Thislayerisessentiallyanon-linearactivationfunction,which
istypicallyappliedtotheoutputofaneuron. Themostcommonlyusednon-linearactivation
functionisRectifiedLinearUnit(ReLU),definedas
5¹Gº= max¹0Gº (1.1)
• Pooling layer. Pooling layers downsample an output matrix or tensor along some spatial
dimensions. Themostcommondownsamplingcomputationistakingthemaximumofalocal
region or the average, known as max pooling layer or average pooling layer, respectively.
Pooling layers reduce the amount of parameters as well as computation, so as to control
overfitting.
• Fullyconnectedlayer. Fullyconnected(FC)layersareagroupofneuronsconnectedtothe
whole input region. The resulting output can be viewed as a vector that describes the whole
input. Unliketheconvolutionallayerswhichpreservespatialinformation,theoutputofaFC
layer isaglobalfeaturewithno spatialinformation.
Applications in image classification. A typical architecture of an image classification network is
alinearconnection:
2>=E?>>;2>=E?>>;2>=E?>>; 52 52
where the non-linear activation layers that follow every convolutional layer and FC layer are
omitted. TheneuronsofthelastFClayercorrespondtotheimagecategories,withanoutputrange
5
of¹11º. To convert the output to a probability distribution, representing the probability of the
inputimagebelongingtoeachcategory,asoftmaxoperationisapplied,
?¹2º=
exp¹5¹2ºº
Í
8
exp¹5¹8ºº
(1.2)
where 5¹2º representstheoutputfromtheneuroncorrespondingtocategory2 inthelastFClayer.
Somewell-knownimageclassificationCNNsincludeAlexNet[59],VGG[111]andResNet[52].
AlexNetiscomposedoffive2>=E?>>; modulesfollowedbythreeFClayers. VGGdeepensthe
network by stacking the convolutional layers in the2>=E?>>; module, forming a connection of
2>=E2>=E2>=E?>>;. Allconvolutionallayersinthesamemodulehaveidenticalkernelsize
and neuron numbers. Compared with AlexNet, the kernel size of convolutional layers of VGG is
much smaller, in order to control the number of parameters while deepening the network. ResNet
is no longer a linear architecture. For an ultra deep neural network, the vanishing gradient prob-
lem [56] makes a linear architecture extremely hard to train via backpropagation algorithm [46].
As an ultra deep CNN, ResNet [52] contains skip connections which allow the gradient efficiently
backpropageted to shallow layers during training. With the help of those advance CNN archi-
tectures, great progresses have been achieved in numerous other computer vision tasks, including
semantic segmentation[21, 69, 78, 95, 141], object detection [44, 45, 51, 72, 91, 93] and image
generation[6,36,47,90].
1.2.2 SuccessiveSubspaceLearning
BeinginspiredbyrecentadvancesinConvolutionalNeuralNetworks,Successivesubspacelearning
is an emerging machine learning technique developed by Kuo et al. in recent years [23, 24, 60,
63, 96]. It offers several advantages. First, being different from Fourier and wavelet transforms,
the Saak transform is a data-driven transform that learns transform kernels from training data
samples. Second, all learnable parameters are determined by second order statistics of input
imagesinanunsupervisedmanner. Neitherdatalabelsnorback-propagationisneededintransform
6
Figure 1.3: Example of Successive Subspace Learning framework. This figure is taken from [24]
withauthors’permission.
kernel computation. It has been applied to quite a few applications with impressive performance.
Examplesincludeimageclassification[23,24],imageenhancement[7],imagecompression[118],
deepfake image/video detection [19], point cloud classification, segmentation, registration [58,
137–139], face biometrics [97, 98], texture analysis and synthesis [67, 134], 3D medical image
analysis[74],etc.
Deep-learning methods learn image features indirectly. Given a network architecture, the
networklearnthefilterparametersfirstbyminimizingacostfunctionend-to-end. Then,thenetwork
can be used to generate filter responses, and patch features are extracted as the filter responses at
a certain layer. In contrast, the SSL framework extracts features of image patches directly using a
data-drivenapproach. Thebasicideaistostudypixelcorrelationsinaneighborhood(say,apatch)
and use the principal component analysis (PCA) to define an orthogonal transform, also known as
the Karhunen Loève transform. However, a single-stage PCA transform is not sufficient to obtain
powerfulfeatures. Asequenceofmodificationshavebeenproposedin[23,24,60,63]tomakethe
SSLframeworkcomplete.
ThefirstmodificationistobuildasequenceofPCAtransformsincascadewiththemaxpooling
insertedbetweentwoconsecutivestages. Theoutputofthepreviousstageservesastheinputtothe
currentstage. Thecascadedtransformsareusedtocaptureshort-,mid-andlong-rangecorrelations
ofpixelsinanimage. Sincetheneighborhoodofagraphiscalledahop(e.g.,1-hopneighbors,2-hor
7
neighbors, etc.), each transform stage is called a hop [23]. However, a straightforward cascade of
multi-hopPCAsdoesnotworkproperlyduetothesignconfusionproblemwhichwasfirstpointed
out in [60]. The second modification is to replace the linear PCA with an affine transform that
addsaconstantbiasvectortothePCA responsevector[63]. Thebiasvectorisaddedtoensureall
inputelementstothenexthoparepositivetoavoidsignconfusion. Thismodifiedtransformiscall
the Saab (Subspace approximation with adjusted bias) transform. The input and the output of the
Saabtransformare3Dtensors(including2Dspatialcomponentsand1Dspectralcomponents.) By
recognizingthatthe1Dspectralcomponentsareuncorrelated,thethirdmodificationwasproposed
in [24] to replace one 3D tensor input with multiple 2D tensor inputs. This is named as the
channel-wise Saab (c/w Saab) transform. The c/w Saab transform greatly reduces the model size
ofthestandardSaabtransform.
1.3 BackgroundofResearch Topics
1.3.1 TextureRepresentationandSegmentation
In the past decades, texture analysis and segmentation[17, 57, 79, 92, 110, 124] have been well
studied and numerous methods have been proposed. For these methods, the local features repre-
senting image structural information play a crucial role in segmentation task by integrating into
machine learning frameworks. Especially, various handcrafted features have been designed to
characterize textural patterns. Herein, filter banks are very popular feature extraction schemes,
suchasGaborfilters[57],gradientsfilters,Laplacianfilters,andGaussianfilters. Theresponsesto
such filter banks, as well as statistics based on them, are used to obtain local texture features[75].
Another type of textural features include Local Binary Patterns[85], co-occurrence matrices [22],
andwavelettransforms[5,17].
The first limitation for most of the existing local feature extractors, e.g. filter banks, is the
inefficientadaptivitytovarioustexturecontentsduetothepre-definedfilterparameters. Therefore,
theeffectivenessoftheselocalfeaturesvariessignificantlyfromdifferentclassesoftextureoreven
8
depends on a single texture image. To overcome this problem, a straightforward method [81, 131]
is to manually select a subset of filters from the aforementioned large set of filter banks, with a
belief that there are filters suitable for the textural patterns in the processed images. However,
the manual selection requires human experience and insights, which typically is an expensive and
time-consuming task. In such situation, it would be desirable to have a feature extraction method
which can automatically adapt to various kinds of texture image and learn effective features in a
data-drivenmanner.
Besides the adaptivity, the absent of the global contrast information in existing local features,
thecomputationofwhichisconstrainedinsidesmallimagepatches,makesthemareblindtoimage
context. The contrast information between texture and its context in a image is never taken into
account during the whole feature extraction process, and only utilized in subsequent segmentation
part, which limits the generalization of the local features. In addition, the high dimensionality of
existinglocalfeaturesincreasesthecomputationalburdenssincethesegmentationalgorithmsneed
to calculate the feature distance. FSEG method[131] introduces a matrix factorization model to
bypass iterative high-dimension distance calculations and achieves fast segmentation. The PCA-
MSmethod[81]employsprincipalcomponentanalysis(PCA)toreducedimensionexplicitly,which
shows the inherent redundancy of those handcrafted local features. Therefore, a good solution is
to design a compact texture features to deal with various texture patterns adaptively by leveraging
boththeimagelocalandglobalinformation.
1.3.2 TextureAnalysisandSynthesis
Texture, as a combination of regularity and randomness, provides essential characteristics for
surface and objection recognition. Texture analysis plays a critical role in the analysis of remote
sensing photos, medical images, and many other images. A large amount of research has been
conducted on texture analysis [17], classification [32, 50], segmentation [81, 131] and synthesis
[38, 119, 126] in the last five decades. Despite these efforts, texture analysis is still a challenging
probleminimageprocessing.
9
One major difficulty in texture research is lack of an effective mathematical tool for texture
representation. Being inspired by recent advances in neural-network-inspired image transforms
such as the Saak transform [61] and the Saab transform [62], we adopt the Saak transform as the
texturerepresentationtoolsinceitoffersseveraladvantages. First,beingdifferentfromFourierand
wavelettransforms,theSaaktransformisadata-driventransformthatlearnstransformkernelsfrom
training data samples. Second, all learnable parameters are determined by second order statistics
of input images in an unsupervised manner. Neither data labels nor back-propagation is needed in
transformkernelcomputation.
Another drawbacks of many previous texture research lies in the limited capacity of its texture
representation. As a well known fact, most texture classification, segmentation and synthesis
model adapt completely different texture representation. However, texture is believed to be a kind
ofrelativelysimpleimagesdyetoitsself-similarityandquasi-periodicity. Itisnaturaltobelievethat
textureshouldhaveaunifiedrepresentation,whichcouldbeusedforvarioustexture-relatedtasks.
Due to this reason, a unified texture representation, which could potential combine discriminative
taskandgenerativetask,willbehighlyappreciated. ByemployingtheSaaktransform[61]andthe
Saabtransform[62],weshowthataunifiedtexturemodelcouldbebuildontopofSaaktransform
coefficients[134].
1.3.3 DynamicTextureSynthesis
Given a short video clipe of target dynamic texture as the reference, the dynamic texture synthesis
task is to synthesize dynamic texture video of arbitrary length. Understanding, characterizing,
and synthesizing temporal texture patterns has been a problem of interest in human perception,
computer vision, and pattern recognition in recent years. Examples of such video patterns span
from amorphous matter like flame, smoke, water to well-shaped items like waving flag, flickering
candles and living fountain. A large amount of work has been done on dynamic texture patterns
[37,49,55,84,127,140],includingdynamictextureclassification,anddynamictexturesynthesis.
Researchontexturevideosequencesfindnumerousrealworldapplications,includingfiredetection,
10
foreground and background separation, and generic video analysis. Furthermore, characterizing
thesetemporalpatternshavetheoreticalsignificanceinunderstandingthemechanismbehindhuman
perceptionoftemporalcorrelationofvideodata.
The first limitation is the existing model fails for texture examples that have the long-range
spatial correlation. Some examples are shown in the 4C, 5C and 6C columns of Fig. 1.1. As
discussed in previous work [41, 117], this drawback is attributed to the fact that the current loss
function (i.e., the Gram loss) is not efficient in characterizing the long-range spatial correlation. It
is shown by experiments that the Gram loss can provide excellent performance in capturing local
imagepropertiessuchastexturepatterns. Yet,itcannothandlelong-rangeimageinformationwell
since such information is discarded in the modeling process. Actually, we observe that existing
solutionsfailtoprovidesatisfactoryperformanceinsynthesizingdynamictextureswithlong-range
aswellasmid-rangecorrelations.
Theseconddrawbackisthatthecurrentmodelcangeneratetextureswithmonotonousmotion,
i.e. relatively smooth motion between adjacent frames. In other words, the generated samples
do not have diversified dynamics. Sometimes, they even appear to be periodic. This is because
that previous models primarily focus on dynamics between adjacent frames, but ignore motion in
a longer period. For example, Funke et al. [41] used the correlation matrix between frames to
represent the motion information, and Tesfaldet et al.[117] adopted a network branch for optical
flow prediction to learn dynamics. This shortcoming yields visible distortion to the perceptual
qualityofsynthesizedresults.
Recently, research has been conducted to solve a similar problem in the context of static
texture image synthesis, e.g., [10, 108, 143]. Berger and Memisevic [10] used a variation of the
Gram loss that takes spatial co-occurrences of local features into account. Zhou et al. [143]
proposed a generative adversarial network (GAN) that was trained to double the spatial extent of
texture blocks extracted from a specific texture exemplar. Sendik and Cohen-Or [108] introduced
astructuralenergylosstermthatcapturesself-similarandregularcharacteristicsoftexturesbased
oncorrelationsamongdeepfeatures.
11
However, in spite of these efforts, dynamic texture synthesis remains to be a challenging and
non-trivial problem for several reasons. First, even it is possible to capture and maintain the
structural information in a single image, it is still difficult to preserve the structural information
in an image sequence. Second, there exist more diversified patterns in dynamic textures due to
various spatio-temporal arrangement. More analysis with illustrative examples will be elaborated
in Sec. 4. To summarize, extending a static texture synthesis model to a dynamic one is not a
straightforwardproblem,whichwillbethefocusofourcurrentwork.
1.3.4 ImageAnomalyDetectionandLocalization: PEDENet
Image anomaly detection is a binary classification problem that decides whether an input image
containsananomalyornot. Imageanomalylocalizationistofurtherlocalizetheanomalousregion
at the pixel level. Due to recent advances in deep learning and availability of new datasets, recent
research works are no longer limited to the image-level anomaly detection result, but also show a
significantinterestinthepixel-levellocalizationofanomalyregions. Imageanomalydetectionand
localization find real-world applications such as manufacturing process monitoring[107], medical
imageanalysis[105,106],andvideosurveillanceanalysis[104,142].
Severalmethodsthatintegratelocalimagefeaturesandanomalydetectionmodelsforanomaly
localization have been proposed recently, e.g., [30, 94]. They first employ a deep neural network
to extract local image features and then apply the anomaly detection technique to local regions.
Although they offer good performance for some datasets, two issues remain. First, they rely
on a large pretrained network, which demands higher computational complexity and memory
requirement. Second, most of them are trained in two stages rather than an end-to-end manner.
That is, image features are extracted in the first stage and then fed into the subsequent stage that
corresponds to an anomaly localization module. There is no guarantee whether the information
essential to anomaly localization is well preserved and passed from the first stage to the second
stage.
12
To address these issues, we present a new neural network model, called the PEDENet, for
unsupervisedimageanomalylocalizationinthiswork. PEDENetcontainsapatchembedding(PE)
network,adensityestimation(DE)networkandanauxiliarynetworkcalledthelocationprediction
(LP) network. The PE network utilizes a deep encoder to get a low-dimensional embedding for
localpatches. BeinginspiredbytheGaussianmixturemodel(GMM),theDEnetworkmodelsthe
distribution of patch embeddings and computes the membership of a patch belonging to certain
modalities. It helps identify outlying artifact patches. The sum of membership probabilities is
used as a loss term to guide the learning process. The LP network is a Multi-layer Perception
(MLP),which takespatch embeddingsasinput andpredicts therelativelocation ofcorresponding
patches. TheperformanceoftheproposedPEDENetisevaluatedandcomparedwithstate-of-the-art
benchmarkingmethodsbyextensiveexperiments.
1.3.5 ImageAnomalyLocalization: AnomalyHop
As introduced before, image anomaly localization is a technique that identifies the anomalous
region of input images at the pixel level. It finds real-world applications such as manufacturing
processmonitoring[107],medicalimagediagnosis[105,106]andvideosurveillanceanalysis[83,
104]. It is often assumed that only normal (i.e., anomaly-free) images are available in the training
stagesinceanomaloussamplesarefewtobemodeledeffectivelyrareand/orexpensivetocollect.
State-of-the-art image anomaly localization methods adopt deep learning. Many of them
employ complicated pretrained neural networks to achieve high performance yet without a good
understanding of the basic problem. To get marginal performance improvements, fine-tuning and
otherminormodificationsaremadeonatry-and-errorbasis.
A new image anomaly localization method, called AnomalyHop, based on the successive
subspacelearning(SSL)frameworkisproposedinthiswork. ThisisthefirstworkthatappliesSSL
totheanomalylocalizationproblem. AnomalyHopconsistsofthreemodules: 1)featureextraction
via successive subspace learning (SSL), 2) normality feature distributions modeling via Gaussian
models,and3)anomalymapgenerationandfusion. Ascomparedwithdeep-learning-basedimage
13
anomaly localization methods, AnomalyHop is mathematically transparent, easy to train, and fast
initsinferencespeed. Besidesthat,itsareaundertheROCcurve(ROC-AUC)performanceonthe
MVTecADdatasetis95.9%,whichisthestate-of-the-artperformance.
1.4 Contributionsofthe Research
1.4.1 TextureAnalysisandSynthesis
We proposed a hierarchical spatial-spectral correlation (HSSC) method for texture analysis and
synthesisinthiswork. Maincontributionsinthisworkaresummarizedbelow.
1. We propose a data-driven feature extraction scheme from Saak transform coefficients. It
is called the Hierarchical Spatial-spectral Correlation (HSSC) method. To demonstrate the
effectivenessoftheHSSCmethod,weconductextensiveexperimentsontextureclassification
andshowthatitoffersverycompetitiveresultscomparingwithstate-of-the-artmethods.
2. We also proposed a texture synthesis method based on spatial-spectral correlation and Saak
transformkernelsfrommultiplestages,whichprovesthatHSSCmethodcouldofferaunified
texture representation for both discriminative and generative task. Extensive experiments
showtheproposedmethodcouldofferexcellentvisualperformance.
1.4.2 UnsupervisedTextureSegmentation
We introduces a data-centric approach to efficiently extract and represent textural information,
whichadaptstoawidevarietyoftextures. Maincontributionsinthisworkaresummarizedbelow.
1. Theproposedmethodcouldencodebothlocaltexturefeatureandglobalcontrastinformation
toformpowerfultexturefeature,andexperimentalresultsonbothtexturalandnaturalimage
segmentation show that the segmentation method using the proposed features achieves very
competitiveorevenbetterperformancecomparedwiththestat-of-the-artmethods.
14
2. The proposed method provide compact feature representation with a tunable hyperparam-
eter to control the feature dimension, which overcome the problem of high computation
complexityintroducedbythehighdimensionalityofexistingfeatures.
3. Comparing with previous unsupervised texture segmentation model, the proposed method
is a data-driven strategy in an unsupervised manner, which make it adaptive to various
textural structures. And it is generic and flexible to be integrated into various segmentation
approacheseasily.
1.4.3 DynamicTextureSynthesis
Inthiswork,weincorporateanewlossterm,calledtheShiftedGramloss,tocapturethestructural
and long-range correlation of the reference texture video. Furthermore, we introduce a frame
samplingstrategytoexploitlong-periodmotionacrossmultipleframes. Maincontributionsinthis
workaresummarizedbelow.
1. With the help of the proposed Shifted Gram loss, we generalize previous baseline model
to a new major category of dynamic texture for the first time, which contains complicate
long-range spatial correlations. Thorough experimental results are provided to demonstrate
thatourproposeddynamictexturesynthesismodeloffersstate-of-the-artvisualperformance.
2. With the help of the proposed frame sampling strategy, we increase the temporal diversity
in the synthesized results in both short and long-period. Also, we show that it also helps to
remove the some minor flaws, like the local hopping, in previous results and get substantial
visualqualityimprovement.
1.4.4 ImageAnomalyDetectionandLocalization: PEDENet
Ourworkhasthefollowingthreemaincontributions.
• We propose PEDENet for unsupervised image anomaly localization. It can be trained in an
end-to-endmannerwithonlynormalimages(i.e. unsupervisedlearning).
15
• Being inspired by the GMM, the DE network models the distribution of patch embeddings
and computes the membership of a patch belonging to certain modalities. It helps find
outlyingartifactpatches.
• Experiments show that PEDENet achieves state-of-the-art performance on unsupervised
anomalylocalizationfortheMVTecADdataset.
1.4.5 ImageAnomalyLocalization: AnomalyHop
Inthiswork,animageanomalylocalizationmethodbasedonthesuccessivesubspacelearning(SSL)
framework,calledAnomalyHop,isproposedinthiswork. AnomalyHopconsistsofthreemodules:
1) feature extraction via successive subspace learning (SSL), 2) normality feature distributions
modelingviaGaussianmodels,and3)anomalymapgenerationandfusion. Maincontributionsin
thisworkaresummarizedbelow.
• We first introduce successive subspace learning into image anomaly localization field, and
propose AnomalyHop, which is mathematically transparent, easy to train, and fast in its
inferencespeed.
• AnomalyHop achieve state-of-the-art performance with area under the ROC curve (ROC-
AUC) performance on the MVTec AD dataset is 95.9%, which is among the best of several
benchmarkingdeeplearning-basedmethods.
1.5 OrganizationoftheDissertation
The rest of the dissertation is organized as follows. In this Chapter, we review the research
background, including related machine learning and deep learning techniques, texture analysis,
unsupervised texture segmentation, dynamic texture synthesis and image anomaly detection and
localization. We also demonstrate the significance and contribution of the research. In Chapter
2, we propose a data-centric approach to efficiently extract and represent textural information,
16
whichadaptstoawidevarietyoftextures. InChapter3,weproposeahierarchicalspatial-spectral
correlation (HSSC) method is proposed for texture analysis. In Chapter 4, we propose a updated
two-streamCNNincorporatingtwonoveltechniques,theShiftedGramlossandaframesampling
strategy, to capture the structural and long-range correlation of the reference texture video and
exploit long-period motion across multiple frames. In Chapter 5, we propose the PEDENet for
unsupervisedimageanomalylocalization,whichintegrateimagerepresentationlearningandlocal
feature density estimation models. In Chapter 6, we propose a new image anomaly localization
method,calledAnomalyHop,basedonthesuccessivesubspacelearning(SSL)framework. Finally,
concludingremarksandfutureresearchdirectionsaregiveninChapter7.
17
Chapter2
AData-centricApproach to Unsupervised Texture
Segmentation
2.1 Introduction
Imagesegmentationisoneofthemostfundamentaltasksinimageprocessingandcomputervision
research, which has a wide range of applications like autonomous driving, remote sensing and
medical image diagnosis. Herein, texture segmentation is a more frequently occurring problem
appearing in various circumstances, which partitions an image into multiple regions with similar
textural patterns. However, due to the complexity of the textures, their segmentation is more
challengingcomparedwiththatonnaturalimages,wherethestructuresaremoreregular.
Inthispaper,wefocusonunsupervisedsegmentationoftexturedimages,wherenotrainingdata
with ground truth is available. In addition, here we do not use domain-specific or scene-specific
knowledge,whicharetypicallynotavailableintherealworld.
In the past decades, texture analysis and segmentation[17, 57, 79, 124] have been well studied
and numerous methods have been proposed. For these methods, the local features representing
image structural information play a crucial role in segmentation task by integrating into machine
learning frameworks. Especially, various handcrafted features have been designed to characterize
textural patterns. Herein, filter banks are very popular feature extraction schemes, such as Gabor
filters[57], gradients filters, Laplacian filters, and Gaussian filters. The responses to such filter
18
banks, as well as statistics based on them, are used to obtain local texture features[75]. Another
type of textural features include Local Binary Patterns[85], co-occurrence matrices [22], and
wavelettransforms[5,17].
For unsupervised texture segmentation, those features should be further processed with some
well-defined algorithms, e.g., graph cut, clustering[57, 120], and region merging. More recent
works employing matrix factorization[131] and energy function minimization [81, 115] have
achieved excellent performance in segmentation. However, these works mainly focus on the
segmentation part without specifically local feature design, and the further improvement of their
performanceisprohibitedbythelimitationsoftheexistinglocalfeatureextractionstrategies.
The first limitation for most of the existing local feature extractors, e.g. filter banks, is the
inefficientadaptivitytovarioustexturecontentsduetothepre-definedfilterparameters. Therefore,
theeffectivenessoftheselocalfeaturesvariessignificantlyfromdifferentclassesoftextureoreven
depends on a single texture image. To overcome this problem, a straightforward method [81, 131]
is to manually select a subset of filters from the aforementioned large set of filter banks, with a
belief that there are filters suitable for the textural patterns in the processed images. However,
the manual selection requires human experience and insights, which typically is an expensive and
time-consuming task. In such situation, it would be desirable to have a feature extraction method
which can automatically adapt to various kinds of texture image and learn effective features in a
data-drivenmanner.
Besides the adaptivity, the absent of the global contrast information in existing local features,
thecomputationofwhichisconstrainedinsidesmallimagepatches,makesthemareblindtoimage
context. The contrast information between texture and its context in a image is never taken into
account during the whole feature extraction process, and only utilized in subsequent segmentation
part, which limits the generalization of the local features. In addition, the high dimensionality of
existinglocalfeaturesincreasesthecomputationalburdenssincethesegmentationalgorithmsneed
to calculate the feature distance. FSEG method[131] introduces a matrix factorization model to
19
bypass iterative high-dimension distance calculations and achieves fast segmentation. The PCA-
MSmethod[81]employsprincipalcomponentanalysis(PCA)toreducedimensionexplicitly,which
shows the inherent redundancy of those handcrafted local features. Therefore, a good solution is
to design a compact texture features to deal with various texture patterns adaptively by leveraging
boththeimagelocalandglobalinformation.
In this paper, we propose a data-centric textural feature extraction method using Principle
Representative Patterns. The proposed method clusters image patches, and selects those cluster
centroidsasmajorpatterns,denotedasPrincipleRepresentativePatterns(PRP).Texturefeaturesare
constructed by measuring the similarities between local patch and those Principle Representative
Patterns. Herein, the contrast information is utilized the proposed feature representation, which
further improves their discriminative power. The proposed PRP features can perform image
segmentation task by further integrating them into existing segmentation algorithms. Since the
proposed method is a data-driven strategy in an unsupervised manner, it is adaptive to various
texturalstructures. Moreover,itofferscompactfeaturerepresentationwithatunablehyperparameter
tocontrolthefeaturedimension. Extensiveexperimentalresultsonbothtexturalandnaturalimage
segmentationtasksshowthesuperiorityoftheproposedmethod.
2.2 TexturalfeaturebasedonPrincipleRepresentativePatterns
2.2.1 Pixel-wisedTextureRepresentation
To tackle the unsupervised texture segmentation problem, we first build a pixel-wised texture
representation. Sincethetexturalinformationisbasedonlocalstructuresinsteadofasinglepixel,
we model our texture representation in basis of image patches. Given an image , the texture
informationofeachpixel8,denotedas)¹8º,isrepresentedasbelow,
)¹8º=
F¹%
8
%
9
º
92 g882 (2.1)
20
(a) (b) (c)
Figure 2.1: Segmentation results step by step. (a) A texture mosaic with 5 different components
from Brodatz texture dataset. (b) Raw segmentation results with proposed PRP features. (c) Final
segmentationmaskafterpost-processing.
Herein,weusethefunctionF¹%
8
%
9
º todescribetherelationshipbetweentwopatches,%
8
and%
9
.
%
8
denotesofasquarepatchofthefixedsizeandcenteredatapixel8.
It is obvious that)¹8º2 '
#
is a high dimension vector, where # is the number of pixels in
image . It will be computationally unacceptable to utilize)¹8º directly. Inspired by the strong
self-similarityandquasi-periodicityoftextureimages,weassumethatthereexistasetofPrinciple
Representative Patterns (PRPs) in image , and they contain all the major textural patterns of the
image. Then, we can only use these PRPs to describe)¹8º to reduce the dimension of the feature
vectorinEq.2.1,withanegligibleinformationloss. Thus,thefeaturevectorcanberewrittenas,
)¹8º=
F¹%
8
%
9
º
%
9
2 'g882 (2.2)
where' denotesthesetofallPrincipleRepresentativePatternsselectedfortheimage.
Ourtexturefeature)¹8º modelaboveconsidersnotonlythelocalpatch%
8
,butalsoitscontrast
with patches among the whole image, including local and non-local ones. Moreover, the whole
processisdrivenbytheprocesseddatainanunsupervisedmanner,andtheusageofPRPsprovides
us more freedoms in dimension control of)¹8º by manipulating the cardinality of '. Another
thingneedtobenoticedisthatweonlyuseimagepatch%
8
heretoillustrateouridea,and%
8
could
actuallybeextendedtogeneralizedfeaturerepresentation6¹%
8
º extractedfrom%
8
.
21
Figure2.2: SpatialdistributionsoffourtypicalPrincipalRepresentationPatterns.
2.2.2 PrincipleRepresentativePattern(PRP)
To select those representative patterns among all possible candidates in image , an effective
method is to select one patch from a collection of very similar patches as the representative one.
WepreformtheK-meansclusteringmethodtogetclustercentroidsfD
8
g
8=1
onalltheimagepatches
of accordingtothefollowingdistancemetric,
argmax
Õ
8=1
Õ
92
8BC¹%
9
D
8
º (2.3)
Those cluster centroids are utilized to form the Principle Representative Pattern set '. The
numberofcentroids, ,couldbetunedashyperparameterinpracticeaccordingtotheperformance
requirementandcomputationsourcelimitation.
The idea of Principle Representative Pattern tries to directly take advantage of the unique
properties of texture images, i.e., quasi-periodicity and self-similarity. In other words, it uses the
fact of high degree of redundancy in texture images. In fact, natural images also contain different
levels of redundancy, and the proposed feature representation can also be extended to natural
images.
22
Figure2.3: Visualizationoffeaturesfromdifferenttexturecomponentsandboundaries.
2.2.3 PRPFeaturesandSegmentation
WiththePrincipleRepresentativePatches,wecouldget)¹8º bycomputingthesimilaritiesbetween
each image patch%
8
and PRPs. Inspired by the famous non-local means denoising algorithm[16],
weemploy Gaussianweightingfunctiontomeasurethesimilaritybetweentwopatches.
F¹%
8
%
9
º=
1
/¹8º
4
k
%
8
%
9k
2
2f
2
(2.4)
wheref ¡ 0 is the standard deviation of the Gaussian kernel, is a smooth factor and /¹8º is a
normalizationconstant.
/¹8º=
Õ
9
4
k
%
8
%
9k
2
2f
2
(2.5)
Based on the above formulation, the proposed textural feature, )¹8º, can be interpreted as a
probability mass function, or an energy spectrum. Each entrance of)¹8º describes the probability
of%
8
tobethismajortexturalpattern,orenergythat%
8
projectstothecorrespondingrepresentative
textural pattern. Higher value means higher probability and intense energy concentration, while
lowervalueofferscomplementarysideevidencestobetterdescribe%
8
.
The proposed method can be further extended to a general feature encoding technique. Re-
placing local patch %
8
with any local feature, our method could encode those local features into
a compact and context-aware form. The generated feature mapf)¹8ºg is also completely flexible
23
Figure2.4: Segmentationresults ofnaturalscenesandanimalsimagesintherealworld.
Figure2.5: Segmentationresultsofgroundterraintexturesimagesintherealworld.
tobecombinedwithanycutting-edgesegmentationalgorithms,suchasMumford-Shahfunctional
[81] and matrix factorization method [131]. The above facts make our method have full potential
toevolvewithemerginglocalfeaturesandsegmentationalgorithmsinthefuture.
2.3 ExperimentalResults
We show segmentation results on various natural and textural images in the real world and, then,
provide quantitative evaluations by comparing the proposed method with other stat-of-the-art
methodsonthePragueUnsupervisedTextureSegmentationBenchmark[48,82].
2.3.1 QualitativeSegmentationResults
We first test it on several natural scene images and animals images. Fig.2(a) illustrates some
examples of the segmentation results using the proposed features, and those images are from
24
Figure2.6: Segmentationresultsofhistologyimagesintherealworld.
Berkeley Segmentation Dataset(BSDS500) [80]. We can find that the proposed approach obtains
all of main regions segmented correctly, even without any usage of object-specific knowledge.
Althoughtherearestillsomeflaws,e.g.,themissingofgoose’sbeakandthemergingfortreesand
theirshadow,thoseflawsarereasonablesincenosemanticmeaningisincludedinourmodel.
We also test the proposed method in the ground terrain textures, which can be seen every-
where and have many variations under different weather and lighting conditions. In Fig.2(b),
some examples from Ground Terrain Outdoor Scenes Dataset(GTOS) [128] are showed with the
corresponding segmentation results. The boundaries between two different terrain are shape and
fineinsegmentationmap,whichprovesthattheproposedmethodhasanexcellentabilitytohandle
differentterraintexturesandevenonthesituationwheretextureisnonuniformandnoisy.
Histology image segmentation which provides precise location information of different tissues
is always desirable by surgeons and researchers. Fig.2(c) shows several histology images with
segmentationresultsusingtheproposedmethod. Wecanseethattheirregularboundariesbetween
twotissuesarewell-located,andbothmainregionsandsecondaryregionscanbewelldistinguished
andseparated.
2.3.2 ComparisonResultsonTextureMosaics
We further compare our method with three state-of-the-art methods on the Prague segmentation
benchmark, which contains 80 color texture mosaics of size 512 512. For each of the 80
texturemosaics,welearnaseparatesetofPrincipleRepresentativePatterns(PRP)andcomputethe
25
Table 2.1: Comparison Results of proposed approach with various segmentation methods on the
Prague Unsupervised Texture Segmentation Benchmark(Part I). Up arrows indicate better results
correspond to large values, and down arrows the opposite. Boldface highlights the best, and a star
denotesthesecond-bestvalueineachcolumn.
Method CS" OS# US# O# C#
PMCFA[86]1 75.32 11.95
9.65
4.51 8.87
Ours 72.75
8.11 9.80 6.76
6.18
PCA-MS[81] 72.27 18.33 9.41 7.25 6.44
FSEG[131] 69.18 14.69 13.64 9.25 12.55
Table 2.2: Comparison Results of proposed approach with various segmentation methods on the
Prague Unsupervised Texture Segmentation Benchmark(Part II). Up arrows indicate better results
correspond to large values, and down arrows the opposite. Boldface highlights the best, and a star
denotesthesecond-bestvalueineachcolumn.
Method CO" CC" I.# II.# RM#
PMCFA[86]2 88.16 90.73
11.84 1.47 3.76
Ours 87.20
87.65 12.80
2.28 3.72
PCA-MS[81] 85.96 91.24 14.40 1.59
4.45
FSEG[131] 84.44 87.38 15.89 2.60 4.51
segmentationresultssubsequently. Theparametersforlearningthefeaturesaresetempiricallyand
remain fixed for all instances in the dataset. The images are converted from RGB color space to
LABcolorspace,whichismoreperceptuallyuniform. Thepatchsizeis77,thenumberofPRPs
is 50, and the scale parameter of Gaussian kernel is 0.1. To avoid usage of prior knowledge about
the number of different textures components, we determine the number of clusters automatically
by search it according to overall intra-cluster distance. Furthermore, the Conditional Random
Field(CRF)[65] with a local voting scheme is utilized to refine the segmentation boundary as
post-processing.
Table 2.1 and 2.2 shows numerical results for various segmentation schemes, where the best
and the second best results are highlighted by boldface fonts and asterisks, respectively. We see
that the proposed method offers the top 2 results on most evaluation metrics. It achieves the best
2Detailed introduction of PMCFA algorithm could be found online in the following link:
https://sites.google.com/site/costaspanagiotakis/research/imagesegmentation
26
Figure 2.7: Example results on Prague dataset(Part I). The first row shows original images, the
secondrowshowsgroundtruth,thethirdshowssegmentationresultsfromFSEG[131]andthelast
rowshowsresultsfromtheproposedmethod.
scoresinoversegmentation(OS)andcommissionerror(C)metrics. Intotal,ourmethodachieves
2 best scores and 5 second-best scores among those 10 indicators. Fig. 3 shows the visualization
resultsofsegmentationontexturemosaicswheretheamountoftexturecomponentschangesfrom
3 to 12 corresponding the images from left to right. We can see that the segmentation results by
ourmethodareveryapproachtothegroundtruthinmostcases,andobviouslyoutperformsFSEG.
Wealsonoticethattheerrorsmainlyoccurinmosaicswithhighernumberofcomponents,e.g.,the
firstandsecondimagesfromtherightside. Thisisbecausetheoptimalhyperparametersettingfor
27
Figure 2.8: Example results on Prague dataset(Part II). The first row shows original images, the
secondrowshowsgroundtruth,thethirdshowssegmentationresultsfromFSEG[131]andthelast
rowshowsresultsfromtheproposedmethod.
an image mosaic with a different number of components should be quite different, but we have to
makeatradeofftogetanunifiedhyperparametersettingforthewholedataset.
2.4 Conclusion
Aneffectivetexturalfeatureextractionmethodforunsupervisedtexturesegmentationwaspresented.
Features are learned from data in an unsupervised manner. They encode local features as well as
28
Figure 2.9: Example results on Prague dataset(Part III). The first row shows original images, the
secondrowshowsgroundtruth,thethirdshowssegmentationresultsfromFSEG[131]andthelast
rowshowsresultsfromtheproposedmethod.
contrast information. It was shown by extensive experimental results that the proposed method
offersthestate-of-the-artperformance.
29
Chapter3
TextureAnalysisviaHierarchical Spatial-Spectral Correlation
(HSSC)
3.1 Introduction
Texture, as a combination of regularity and randomness, provides essential characteristics for
surface and objection recognition. Texture analysis plays a critical role in the analysis of remote
sensing photos, medical images, and many other images. A large amount of research has been
conductedontextureanalysis[17],classification[32,50],segmentation[81,131]andsynthesisin
the last five decades. Despite these efforts, texture analysis is still a challenging problem in image
processing.
One major difficulty in texture research is lack of an effective mathematical tool for texture
representation. Being inspired by recent advances in neural-network-inspired image transforms
such as the Saak transform [61] and the Saab transform [62], we adopt the Saak transform as the
texturerepresentationtoolsinceitoffersseveraladvantages. First,beingdifferentfromFourierand
wavelettransforms,theSaaktransformisadata-driventransformthatlearnstransformkernelsfrom
training data samples. Second, all learnable parameters are determined by second order statistics
of input images in an unsupervised manner. Neither data labels nor back-propagation is needed in
transformkernelcomputation.
30
The main contribution of this work is the proposal of a new texture feature extraction scheme
fromSaaktransformcoefficients. ItiscalledtheHierarchicalSpatial-spectralCorrelation(HSSC)
method. TheHSSCmethodconductsacorrelationanalysisonSaaktransformcoefficientstoobtain
texturefeaturesofhighdiscriminantpower. TodemonstratetheeffectivenessoftheHSSCmethod,
weconductextensiveexperimentsontextureclassificationandshowthatitoffersverycompetitive
results in comparison with state-of-the-art methods. The proposed HSSC method is generic and
flexible,anditcanbeintegratedintovarioustextureanalysistasks.
The rest of this work is organized as follows. Related previous work is reviewed in Sec. 3.2.
The HSSC method is described in Sec. 3.3. Experimental results are given in Sec. 3.4. Finally,
concludingremarksaredrawninSec. 3.5.
3.2 ReviewofPreviousWork
Numerous methods have been proposed for texture analysis and feature representation. Many
methods employ local features to represent the structural information of textures. One popular
texture feature representation and extraction scheme is to convolve input textures with of a set of
filter banks. Examples include: Laws filter [66], Gabor filters [40, 57], wavelet filters [5, 17] etc.
The responses to these filter banks as well as their statistics can be used to obtain local texture
features. Other textural features include local binary patterns (LBP) [85], co-occurrence matrices
[22]. Although these feature extractors work to a certain degree, they share several common
limitations.
All current filter-bank-based texture feature extractors are not efficient in adaption to different
textureclassesduetotheirpre-definedfilterparameters. Theyarenotdata-driven. Theeffectiveness
oftexturefeaturesmayvarysignificantlyinmultipletextureclassesandevenindifferentregionsof
a single texture image. Besides, some quasi-periodic textures own similar patterns in a relatively
large region. To analyze textures in a local window may not be powerful enough. Local features
couldalsobeaffectedbyrandomnesssothattheyarenotasstableasonewouldexpect. Thiscould
31
lead to inaccurate classification and segmentation results. In other words, it is desired to analyze
textures with a flexible window size, and a hierarchical multi-layer texture analysis method seems
tobeanaturalchoice.
Anewdata-driventransform,calledtheSaak(Subspaceapproximationwithaugmentedkernels)
transform, was proposed in [61]. It has a set of orthonormal transform kernels so that its inverse
transform can be performed in a straightforward manner. The Saak transform has two main
ingredients: principal-component-analysis-based(PCA-based)subspaceapproximationandkernel
augmentation. The latter is needed to resolve the sign confusion problem. Being different from
thewavelettransform,theSaaktransformcancapturefeaturesofveryfineresolutionyetinalarge
region. The wavelet transform is a linear transform of a single layer while the Saak transform is a
nonlineartransformofmultiplelayers.
3.3 ProposedHSSCMethod
The HSSC method can be categorized into two types: 1) class-specific modeling, and 2) class-
independent modeling. We focus on the class-specific case first, and describe the single-layer
spatial-spectral correlation (SSC) and its hierarchical (i.e., multi-layer) extension in Secs. 3.3.1
and3.3.2,respectively. Then,class-independentmodelingispresentedinSec. 3.3.3.
3.3.1 CorrelationofPCACoefficients
We view texture as a two-dimensional random field that exhibits quasi-periodic patterns. We can
first conduct correlation analysis on image pixels of different lags to build the correlation matrix
in the spatial domain. Then, we obtain eigenvalues and eigenvectors from the correlation matrix.
This process is called the principal component analysis (PCA) if we keep a subset of eigenvectors
withthelargesteigenvaluestospanasignalsubspace. Thedetailedprocedureiselaboratedbelow.
32
We first collect texture patches from a source texture,), to form a collection of patches for
texture),denotedby
(=f%
8
j82 g (3.1)
where8 is the position index, is the position index set, %
8
2 '
#
is the patch located in position
8, and# is the patch size. The patch size is a tunable parameter and could be adjusted according
to specific application need and computational efficiency consideration. Note that we can obtain
different patch samples by shifting only several pixels and, as a result, we can get a rich set of
patchesfromonlyseveralsourceimages.
WeconductPCAonmembersin( toobtainasetof principalcomponents(ofdimension#)
andusethemasconvolutionkernels. Foreachpatch,weget PCAcoefficients. Next,wecompute
thecorrelationmatrix,',ofthese PCAcoefficients. ThecorrelationmatrixofPCAcoefficients
isadiagonalmatrix:
'¹89º 0 89 = 1 88< 9 (3.2)
withdecreasingdiagonalelements:
'¹88º '¹99º88 9 (3.3)
ThesepropertiesaredirectconsequencesofusingPCAkernelsinsignaltransform.
Then,wecanusethesediagonalelementsasthetexturefeature. Forunknowninputtexture-,
wecandeterminewhetheritbelongstotextureclass) bythefollowingtwosteps:
1. Collectpatchesfrom- andrepresentthesepatchesusingthePCAassociatedwithtexture);
2. ComputethecorrelationofPCAcoefficientsobtainedfromthepreviousstep.
If- belongs to class), the correlation matrix should be a diagonal one and its diagonal elements
shouldbesimilartothatobtainedfromthetrainingtexturesofclass). Generallyspeaking,wecan
use the distance of two correlation matrices to measure the closeness of- and textures in class).
Examples of matched and unmatched test cases are given in Fig. 3.1, where correlation matrices
33
(a)targettexture (b)matchedone (c)mismatchedone
Figure 3.1: Illustration of correlation matrices of PCA coefficients of a target texture class, its
matchedandmismatchedones.
arevisualizedbypixelbrightness(thebrighterthelargervalue). Clearly,iftwotexturesarevisually
similar, their correlation matrices in the PCA domain of the target texture class are more similar.
Sincetheabove-mentionedanalysisconsiderscorrelationsofspectralcomponentsoflocalpatches,
itiscalled thespatial-spectralcorrelationanalysis.
3.3.2 CorrelationofSaakCoefficients
One limitation of the analysis using PCA coefficients is shown in Fig. 3.2. Textures) and)
0
are
different while the correlation matrices of their PCA coefficients are quite similar. It is desired to
conduct multi-stage PCA as an extension. However, a straightforward cascade of multiple PCA
transforms will lead to the sign confusion problem [60]. A kernel augmentation scheme was
proposedin[25,61]toaddressthisissue.
FortheexamplegiveninFig. 3.2,wedeterminethetwo-stageSaaktransformkernelsbasedon
target texture), which is Herringbone Weave, and compute correlation matrices of the first-stage
andthesecond-stageSaaktransformcoefficients,respectively. Thevisualizationsofthesematrices
aregiveninFig. 3.2(b)and(c),respectively.
34
For input texture image-, we do a hypothesis test to check whether it belongs to texture class
) ornot. Weapplythetwo-stageSaaktransformtopatchescollectedfrom- usingkernelslearned
fromtextureclass) andbuildcorrelationmatricesofthefirst-andthesecond-stageSaaktransform
coefficients. If - happens to be Herringbone Weave as well, we will obtain correlation matrices
that are similar to those in Figs. 3.2 (b) and (c). However, if it is a different texture class (say,
Woolen cloth), we show visualizations of their first-stage and second-stage correlation matrices in
Figs. 3.2 (e) and (f), respectively. Although its first-stage correlation matrix is similar to that of
the target texture, its second-stage one is quite different. We will compare texture classification
performanceusingmulti-stageSaaktransformsinSec. 3.4.
(a)Texture) (b)Firststage (c)Secondstage
(d)Texture)
0
(e)Firststage (f)Secondstage
Figure3.2: Visualizationoftwo-stagecorrelationmatrices: (a)Herringboneweave(Texture)),and
visualizationsofitscorrelationmatricesusing(b)first-stageand(c)second-stageSaakcoefficients;
(d)Woolencloth(Texture)
0
),andvisualizationsofitscorrelationmatricesusing(e)first-stageand
(f)second-stageSaakcoefficientsbasedon)’sSaaktransformkernels.
Tobetterunderstandandinterpretourmethod,wefurtherformalizeourmodelintoastatistical
problem. PCAcomponentsorhigherstageSaakkernels,whicharecomputedusingtexturepatches,
couldbeunderstoodasasetofrepresentativepatternoftexture. Theyareusedtomeasuresimilarity
35
betweenunknowntexture-andtargettexture). Eachchanneloffilteredimagescontainsaresponse
of - against an associated pattern. Then, we compute the correlation between those responses.
The diagonal terms are auto-correlations of individual channels while the non-diagonal terms are
cross-correlations between different channels. Thus, our approach attempts to capture the energy
distributionofrepresentativetexturepatternsderivedfromPCAandtheSaaktransform.
3.3.3 ClassificationwithSharedTransformKernels
The HSSC method described in Secs. 3.3.1 and 3.3.2 is suitable for hypothesis testing; namely,
checkwhetherunknowntexture- belongstotargettexture). Foratextureclassificationproblem
with " target textures, we need to conduct the test " times. The computational cost is higher.
To reduce the computational cost, we can collect patches from all texture classes in Eq. (3.1) and
conductthePCAortheSaaktransformsbasedontransformkernelslearnedfrommixedclasses.
Generally speaking, the class-independent transform reduces the computational cost at the
expense of a lower classification performance. However, the classification performance can be
compensated in another manner. That is, we can increase of the number of principal components
ateachSaaktransformstage. Althoughthecomplexitywillstillincreasealittlebit,itisstilllower
thanthesolutionusingdifferenttransformkernelsfordifferenttargettextures.
3.4 ExperimentalResults
In this section, we conduct experiments to demonstrate the effectiveness of the proposed HSSC
methodonseveraltextureclassificationbenchmarkdatasets.
Class-Specific Transforms. First, we report experimental results using class-specific trans-
formsfortheBrodatzandtheVisTexdatasets. TheBrodatztexturedatasetcontains155monochrome
images with standardized viewpoint and scale, which are collected from the Brodatz book [15].
TheVisTexdataset1 wasinitiallybuiltasanalternativetotheBrodatzdataset. Itdoesnotconform
1http://vismod.media.mit.edu/vismod/imagery/VisionTexture/
36
(a)BoadatzDataset b)VisTexDataset
Figure3.3: Per-classaccuracyresultsfortheBrodatzandtheVisTexdatasets,respectively
to rigid frontal plane perspectives and studio lighting conditions. In this experiment, we select 17
and20classesfromBrodatzandVisTexdataset,respectively. Imageswithnon-texturebackground
andsignificantvisualvariancearediscarded.
Thekernelsizesaresetto88and1212fortheBrodatzandtheVistexdatasets,respectively.
The responses from the first quarter of kernels are selected for correlation computation. These
parametersremainthesamethroughouttheexperiments.
In the training phase, we compute correlation matrices for Saak coefficients in various stages
for each target texture class. In the testing phase, we pass each test texture to each class-specific
transformanddeterminethecorrespondingcorrelationmatrices. Then,wecomputetheFrobenius
distance between the derived correlation matrices and those obtained in the training phase. The
textureclassthatgivesthesmallestFrobeniusdistanceisthepredictedclass. Theper-classaccuracy
resultsforBrodatzandVisTexaregiveninFigs. 3.3(a)and(b),respectively.
We also show the averaged classification accuracy in Table 3.1. As the stage number becomes
larger, the classification becomes better. Texture features have stronger discriminant power as the
stage number becomes higher. This is because it can capture features in a larger receptive field.
Features from the first stage are not powerful enough. They tend to suffer from easy-to-confuse
samples. As we proceed to the second and the third stages, accuracy increases gradually. This is
especiallyobviousforthemorechallengingVisTexdataset.
37
Table 3.1: Averaged classificationaccuracies for theBrodatz and theVisTex datasets thatimprove
asthenumberofstagesincreases. Thebestperformancenumberisinbold.
StageNumber I II III IV
Brodatz 95.4 97.7 98.7 98.7
VisTex 78.8 89.1 96.2 97.4
Class-Independent Transforms. Next, we conduct experiments on the CUReT dataset [33]
usingshared(orclass-independent)Saaktransformkernels. TheCUReTtexturedatasetcontains61
texturesclasses. Imagesineachclassaretakenfromthesamematerialbutwithdifferentviewpoints
andlightingconditions. Also,variationsofbackground,shadowingandsurfacenormalsmakethe
classificationtaskchallenging. Weadoptaprepossessingstepsimilartothatin[122]. Asubsetof
imageswithaviewingangleapproximatelylessthan30degreesisselectedinourexperiment. This
yields about 40 images per class. A central region of size 200200 is cropped from each image
todiscardnon-texturebackground. Thedatasetisrandomlysplitintotrainingandtestingsets. We
applythe3-stageSaaktransformsandcomputecorrelationmatricesbasedononthe3rd-stageSaak
coefficients.
We set the number of Saak transform kernel to 8 at each stage and the kernel size is 5 5.
The coefficients of the correlation matrices are selected as the feature vector, which is fed into a
linearSVMclassifier. TheclassificationaccuracyisshowninTable3.2. Asshowninthetable,the
proposedHSSCmethodofferscompetitiveresultscomparingtootherleadingmethods.
Table 3.2: Performance comparison of the proposed HSSC method and other state-of-the-art
methodsfortheCUReTtexturedataset.
Methods Accuracy
Textons[50] 98.5
BIF[32] 98.6
VLAD 98.8
Histogram[14] 99.0
KCB 97.7
Ours 98.7
38
Manymodernmachinelearningalgorithmsdemandalargeamountoflabeleddata. Toexamine
thisissue,wetestthemodelcapabilitywithrespecttothenumberoftrainingpatchesfortheCUReT
dataset in Table 3.3. We see from this table that the classification accuracy of the proposed HSSC
methodbecomessaturatedwhenthenumberoftrainingpatchesisstillatareasonablerange.
Table 3.3: Classification accuracy as the number of training images increases with respect to the
CUReTdataset.
No. oftrainingpatches 500 1000 2000 3000
ClassificationAccuracy 90.4 96.9 98.2 98.7
3.5 Conclusion
An effective hierarchical spatial-spectral correlation (HSSC) method was proposed for texture
analysis and classification. It applies a multi-stage Saak transform to input texture patches and
thenconductscorrelationanalysisonSaaktransformcoefficientstoobtaintexturefeaturesofhigh
discriminantpower. Extensiveexperimentsontextureclassificationwiththreebenchmarkdatasets
were conducted to demonstrate the effectiveness of the HSSC method. Both class-specific and
class-independenttransformkernelswereexamined.
39
Chapter4
DynamicTextureSynthesis via Long-Range Spatial and
TemporalCorrelation
4.1 Introduction
Given a short video clipe of target dynamic texture as the reference, the dynamic texture synthesis
task is to synthesize dynamic texture video of arbitrary length. Understanding, characterizing,
and synthesizing temporal texture patterns has been a problem of interest in human perception,
computervision,andpatternrecognitioninrecentyears. Examplesofsuchvideopatternsspanfrom
amorphousmatterlikeflame,smoke,watertowell-shapeditemslikewavingflag,flickeringcandles
and living fountain. A large amount of work has been done on dynamic texture patterns [37, 49,
55,84,127],includingdynamictextureclassification,anddynamictexturesynthesis. Researchon
texturevideosequencesfindnumerousrealworldapplications,includingfiredetection,foreground
andbackgroundseparation,andgenericvideoanalysis. Furthermore,characterizingthesetemporal
patternshavetheoreticalsignificanceinunderstandingthemechanismbehindhumanperceptionof
temporalcorrelationofvideodata.
Ascomparedwithstatictextureimagesynthesis,thedynamictexturesynthesisproblemliesin
the3Dspace. Thatis,itneedsnotonlytogenerateanindividualframeasastatictextureimagebut
also to process the temporal information to build a coherent image sequence. The main challenge
indynamictexturestudyistomodelthemotionbehavior(ordynamics)oftextureelements. Itisa
40
Figure4.1: Examplefordynamictexturesynthesis.
non-trivialandchallengingproblem. Thankstotherapiddevelopmentandsuperiorperformanceof
deep learning methods, many papers have been published with amazing visual effects in dynamic
texture synthesis. Despite the progress, there are still several drawbacks in existing models as
pointedoutin[117]and[41].
The first limitation is the existing model fails for texture examples that have the long-range
spatial correlation. Some examples are shown in the 4C, 5C and 6C columns of Fig. 4.1. As
discussed in previous work [41, 117], this drawback is attributed to the fact that the current loss
function (i.e., the Gram loss) is not efficient in characterizing the long-range spatial correlation. It
is shown by experiments that the Gram loss can provide excellent performance in capturing local
imagepropertiessuchastexturepatterns. Yet,itcannothandlelong-rangeimageinformationwell
since such information is discarded in the modeling process. Actually, we observe that existing
solutionsfailtoprovidesatisfactoryperformanceinsynthesizingdynamictextureswithlong-range
aswellasmid-rangecorrelations.
Theseconddrawbackisthatthecurrentmodelcangeneratetextureswithmonotonousmotion,
i.e. relatively smooth motion between adjacent frames. In other words, the generated samples
do not have diversified dynamics. Sometimes, they even appear to be periodic. This is because
that previous models primarily focus on dynamics between adjacent frames, but ignore motion in
a longer period. For example, Funke et al. [41] used the correlation matrix between frames to
41
Figure 4.2: This figure gives three representative dynamic texture examples(one row per texture),
wheretexturesarehomogeneoustextures. Thefirstcolumnshowsoneframeofreferencedynamic
texture video, the second column shows synthesized results obtained by the baseline model, and
thethirdcolumnshowsresultsobtainedusingourmodel.
represent the motion information, and Tesfaldet et al.[117] adopted a network branch for optical
flow prediction to learn dynamics. This shortcoming yields visible distortion to the perceptual
qualityofsynthesizedresults.
Beingmotivatedbytheabovetwodrawbacks,weproposeanewsolutiontoaddresstheminthis
paper. First,weincorporateanewlossterm,calledtheShiftedGramloss,tocapturethestructural
and long-range correlation of the reference texture video. Second, we introduce a frame sampling
strategy to exploit long-period motion across multiple frames. The solution is implemented using
an enhanced two-branch convolutional neural network. It can synthesize dynamic texture with
42
Figure 4.3: This figure gives three representative dynamic texture examples(one row per texture),
where textures are structured textures. The first column shows one frame of reference dynamic
texture video, the second column shows synthesized results obtained by the baseline model, and
thethirdcolumnshowsresultsobtainedusingourmodel.
long-range spatial and temporal correlations. As shown in Fig. 4.1, the proposed method could
takecareofbothhomogeneousandstructuredtexturepatternswell. Extensiveexperimentalresults
willbegiventoshowthesuperiorityoftheproposedmethod.
4.2 RelatedWork
Recently, research has been conducted to solve a similar problem in the context of static texture
synthesis, e.g.,[10, 108, 143]. Berger and Memisevic [10] used a variation of the Gram loss that
takesspatialco-occurrencesoflocalfeaturesintoaccount. Zhouetal. [143]proposedagenerative
43
adversarialnetwork(GAN)thatwastrainedtodoublethespatialextentoftextureblocksextracted
from a specific texture exemplar. Sendik and Cohen-Or [108] introduced a structural energy loss
term that captures self-similar and regular characteristics of textures based on correlations among
deepfeatures.
However, in spite of these efforts, dynamic texture synthesis remains to be a challenging and
non-trivial problem for several reasons. First, even it is possible to capture and maintain the
structural information in a single image, it is still difficult to preserve the structural information
in an image sequence. Second, there exist more diversified patterns in dynamic textures due to
various spatio-temporal arrangement. More analysis with illustrative examples will be elaborated
in Sec. 4.4. To summarize, extending a static texture synthesis model to a dynamic one is not a
straightforwardproblem,whichwillbethefocusofourcurrentwork.
4.3 ProposedMethod
In this section, we will briefly introduce the baseline model, including the network structure, the
learning method and the synthesis process. Then, we will propose several techniques to address
long-range spatialandtemporalcorrelationsintheunderlyingreferencetexturesequence.
4.3.1 TwoStreamConvolutionalNetwork
The baseline model to be used in our work is the two-stream CNN proposed by Tesfaldet et al.
[117]. It is constructed from two convolutional networks (CNNs), an appearance stream and a
dynamics stream. Such kind of two-stream design enables the model to factorize appearance and
dynamicsofdynamictextureandproceedwiththeanalysisindependently. Similartopreviouswork
on textures synthesis, the networks summarizes an input dynamic texture into a set of statistics
computedfromfeaturemaps,andthenthosefeaturestatisticsareusedasbenchmarktosynthesize
newvideosimilartotheinputdynamictexture.
44
During the synthesis process, it optimizes a randomly initialized noise pattern to drive those
featurestatisticsfromthisnoiseinputtomatchthoseoftheinputdynamictexture. Toachievethat,
the model conducts Gradient Descent and Back-propagation during the synthesis process, and it
optimizesthelossfunctionwithrespecttoeachpixeltomatchthosefeaturestatistics. Meanwhile,
all the parameters from appearance stream and dynamics stream are pre-trained and remain fixed
duringthesynthesisprocess.
AppearanceStreamThedesignofappearancestreamfollowsthespatialtexturemodelwhich
is first introduced by [42]. We use the same publicly available normalized VGG19 network [112]
which is first used by [42]. Previous research has shown that such kind of CNN pre-trained on an
objectclassificationtaskcanbeveryeffectiveatcharacterizingtextureappearance.
Todescribetheappearanceofaninputdynamictexture,wefirstfeedeachframeofinputvideo
intothenetwork,anddenotethefeaturemapsatlayer; as
;
8
2 '
"
;
,where
;
8
isthe8
C
vectorized
feature map of layer;, and"
;
is the number of entrances in each feature map of layer;. Then, the
pair-wiseproductofeachvectorizedfeaturemap,i.e. Grammatrixisdefinedas
;
89
=
1
"
;
"
;
Õ
:
;
8:
;
9:
=
1
"
;
;
8
;
9
(4.1)
where meansinnerproductoftwovectors. Inpractice,peoplecomputeGrammatrixondifferent
layerstocapturemulti-scalefeatures.
Duringthesynthesisprocess,weinitializearandomnoisesequence
ˆ
,andthenfeedtheminto
networktoproceedwithitsGrammatrixrepresentation
ˆ
. Thentheapparentlosscouldbedefined
as
L
;
??40A0=24
=
Õ
89
¹
;
89
ˆ
;
89
º
2
(4.2)
In practice, we further normalize it with number of feature maps in that layer and number of input
frames. Thefinalapparentlossisweightsumfromallselectedlayersdefinedas
L
??40A0=24
=
Õ
;
F
;
L
;
??40A0=24
(4.3)
45
Dynamics stream The dynamic stream follows the baseline model [117] employing a pre-
trainedopticalflowpredictionnetwork,whichtakeseachpairoftemporallyconsecutiveframesas
input. Thestructureofdynamicstreamisdescribedasfollowing. Thefirstlayerconsistsof323
convolution filters of spatial size 1111. Then, a squaring activation function and 55 spatial
max-pooling with a stride of one are applied sequentially. Then, a 11 convolution layer follows
with64filtersfollows. Finally,toremovelocalcontrastdependence,aL1divisivenormalizationis
applied. Tocapturedynamic,i.e. texturemotion,inmultiplescales,aimagepyramidisemployed,
andimage ateachscaleisprocessedindependentlyandconcatenatedtogetthefinalfeature.
To describe the dynamic of an input dynamic texture, we feed each pair of consecutive frames
into the network, and conduct exactly same Gram matrix computing as the appearance stream.
Here,weuse todenoteGrammatrixindynamicstream. Duringthesynthesisprocess,wecould
have
ˆ
computed from the generated dynamic texture
ˆ
. Then the dynamic loss could be defined
as
L
;
H=0<82
=
1
"
;
Õ
89
¹
;
89
ˆ
;
89
º
2
(4.4)
Thefinalapparentlossisweightedsumfromallselectedlayersdefinedas
L
H=0<82
=
Õ
;
F
;
L
;
H=0<82
(4.5)
4.3.2 Global-awareGramLoss
TheeffectivenessofGrammatricescouldbeexplainedasitcapturesthecoherenceacrossmultiple
feature maps at a single location. In another word, if we view the pre-trained CNN as a well-
designedfiltersets,thecorrelationbetweenresponsesofdifferentfiltersetsdescribetheappearance
of texture. And this kind of coherence is so powerful for texture that we could even use a shallow
network with random weights to synthesize static texture [121], even without the help from the
complicatedpre-trainedCNN.Andthesimilarmethodshavebeenadoptedinmanyothercomputer
46
Figure 4.4: This figure shows the importance of local feature correlation in structural texture. The
red boxes above show a pair of texture patches in the same location. They could also be viewed as
visualizationofthereceptivefieldinthenetwork.
vision topics, like image style transfer and impainting. However, as shown in 4.1, it still has
non-negligibledrawbacks.
Formtheequation,wecouldfindthattheGrammatricesaretotallyblindtotheglobalarrange-
ment of objects inside each frame of the reference video. Specifically, we could see that it makes
pair-wisefeatureproductionineachsinglelocationandthentakethespatialaveraging. Asaresult,
allspatialinformationiserasedduringtheinnerproductcalculation,whichmakesitfailtocapture
structuralinformationinarelativelargescale. Herein,inspiredby[10],weintegratethefollowing
shiftedGrammatricestoreplacetheoriginalGrammatrices.
ˆ
;
89
=
1
"
;
)¹
;
8
º)¹
;
9
º (4.6)
where) isaspatialtransformoperator. Forexample,) couldbehorizontalshift,verticalshift,and
even shift along diagonal direction. By this way, we could further capture the correlation between
localfeatureandfeaturefromitsneighborhood. Andwecouldadaptvariousspatialtransformwith
different angel and aptitude. Gram matrices only make use of correlations between feature maps
whereasimagestructuresarerepresentedbythecorrelationsinsideeachfeaturemap.
47
The importance and effectiveness of those spatial correlation between local feature and its
neighborhood could be illustrated with Figure 4.3. If the dynamic texture has a homogeneous
appearance,thetwotexturepatcheswillhaveveryweakcorrelation. Forexample,inhomogeneous
texture like smoke, if we fix the left patch in the synthesis process, we could have many possible
texture appearance in the location of right patch, as long as it consists with its local neighbor in a
naturalway. Onthecontrary,inastructuraltexturelikeflag,thosetwopatchesinthesamelocation
willhavea strongerdependency. Ifwefixoneofthem,theotheronemustobeytheconstraint.
Then, an appearance loss with a spatial transform operator) could be computed accordingly.
Wecoulddesignedasetofdifferent) togetacomprehensiveappearanceloss.
4.3.3 Temporal-awareDynamicLoss
Theretwocriticalrequirementstosynthesizeperceptuallynaturaltextures: spatialconsistencyand
temporal consistency. The proposed method in the previous section is designed to render textures
when it has long-range structure within each frame. In other words, it handles the non-uniformity
inspatialdomain. So,averynaturalextensionisthat,isthereexistingnon-uniformityintemporal
domain? And can the previous model handle that? More specifically, here we use non-uniformity
intemporaldomaintoreferthelong-timeandnon-localmotionindynamictexture.
Actually, this is a quite obvious problem when we try to generalize the model from homo-
geneous texture to structure texture. First, texture with long-range structure naturally owns more
complicatedmotionpatternswithtimeelapsing. Second,maintaininglong-rangestructuresstable
and continuous in continuous frames will be much harder comparing with homogeneous texture.
Oppositely, the synthesized result from the previous model often looks periodic, which could be
a huge drawback and damage the perceptual quality of those synthesized results. Moreover, we
also observe the previous model could generate some local motion hopping in continuous frames,
whichalsoimpliesthetemporalconsistencyneedtobeimproved.
In the baseline model, it uses an optical flow prediction network to get features that capture
the dynamic between every two consecutive frames. And those features are further utilized in
48
Figure4.5: Thisfigureshowstheimportanceofmiddleandlong-rangemotion,Usingflagsequence
as an example, we can find that most meaningful motion are actually crossing multiple frames.
Such kind of long-range motion combines with long-range spatial structure give us a vivid visual
effecttogether.
the synthesis process. So, in the previous method, all temporal dynamic in synthesis results is
fromframepairsbetweentwoconsecutiveframes,andthosetemporaldynamicfeaturesarefurther
averagedoverthewholevideo. Thatmakesthemodelonlylearnthetemporalmotion/dynamicina
veryshorttimeperiod,andignoreanypotentiallong-rangetemporalpatterninthereferencevideo.
Here we employ a simple but effective approach to encode both short-range motion and long-
range motion. We proposed a multi-period frame sampling strategy to modify the dynamic loss.
Rather than simply using each consecutive frame pairs, we first set a sampling intervalC, and then
take8thframewith¹8¸CºthframeasapairtocomputethedynamiclosswithintervalC asfollow
L
H=0<82
¹Cº=
Õ
8;
V
;
L
;
H=0<82
¹C;8º (4.7)
ThefinaldynamiclossisweightsumfromallselectedtimeintervalCs
L
H=0<82
=
Õ
C
U
C
L
H=0<82
¹Cº (4.8)
whereU
C
is a pre-defined hyperparameter to control the trade-off between short-range and long-
rangemotion.
49
Figure4.6: Examplesofhomogeneousdynamictexture.
Consequently,thenewtotallosswillbeweightedsumofenhancedL
??40A0=24
andL
H=0<82
,
andthesynthesiscouldbeproceededfollowingourbaselinemodel[117]asdescribedinSec. 4.3.1.
4.4 Experiments
In this section, we show our experimental results on both homogeneous and structural dynamic
texturevideo,andthenfurthercomparetheproposedmethodwiththebaselinemodel. Giventheir
temporalnature,ourresultsarebestviewedasvideosinsupplementarymaterials.
4.4.1 ExperimentSetting
Our two-stream architecture is implemented using TensorFlow. Results were generated using one
NVIDIA Titan X GPU and synthesis time ranges between 10 to 30 minutes to generate 8 frames
withanimageresolutionof256by256. Toachieveafaircomparison,alldynamictexturesamples
showninthispaperareselectedfromDynTexDataset[87]followingthebaselinemodel[117].
50
Figure4.7: Dynamictextureswithlong-rangecorrelations(PartI).
Dynamic textures are implicitly defined as the local minimum of the loss function. Textures
are generated by optimizing with respect to the pixels of the video. Diversified results could be
obtained by initializing the optimization process using I.I.D. Gaussian noise, and the non-convex
property of the loss function also provides extra variations. Consistent with previous works[42,
117],weuseL-BFGS[70]optimization.
For appearance stream, we use Conv layer 1, Pooling layer 1, Pooling layer 2, Pooling layer
3, Pooling layer 4 to computeL
??40A0=24
. ForL
??40A0=24
, here we use horizontal and vertical
shift as our), and the shift distances are set as 8, 16, 32, and 128. ForL
H=0<82
, we choose time
intervalas1,2,and4. Allothersettingsremainthesameasbaselinetoachieveafaircomparison.
4.4.2 ExperimentalResults
Wefurtherconductexperimentsonseveraltypicaldynamictexturesequencewhichcontainstrong
structural information. To achieve a fair comparison, here we list synthesized results from both
baselineandourmodel,framebyframe.
Homogeneous Dynamic Texture. We first test our methods on several typical homogeneous
dynamic texture sequences. The purpose is to verify that the proposed method could maintain the
51
Figure4.8: Dynamictextureswithlong-rangecorrelations(PartII).
Figure4.9: Dynamictextureswithlong-rangecorrelations(PartIII).
advantagesofbaselinemodelandkeeptheabilityofsynthesizingexcellenthomogeneousdynamic
textures. Fig. 4.6showssomeexamplesofsynthesizeddynamictextureframebyframe.
Structure with long-range correlation. As for dynamic texture with long-range structure,
our proposed method outperforms the baseline model on all these sequences, and several typical
exampled are shown in Fig. 4.7, 4.8, 4.9. Noteworthily, proposed techniques work well even all
those sequences have quite different forms of long-range correlation. In the flag sequences, the
structuralinformationexternalizesastheboundarybetweentwocolors,whichisanirregularcurve
andacrossthewholedimension. Tomakeanaturalsynthesis,theboundarymustbesharpandclear
ineachframe,andalsohaswavymotionastheotherpartoftheflag. Inthefountainsequence,the
52
Figure4.10: Dynamictextureswithmiddle-rangecorrelations(PartI).
Figure4.11: Dynamictextureswithmiddle-rangecorrelations(PartII).
structuralinformationisshownbythestablefountainskeletonbehindtheflowingwater. Unlikethe
flag,thefountainsequencerequiresthoseedgestable,andanymotionwillleadtofailureresults. In
the shower sequence, the difficulty is how to teach the model to learn the discrete water trajectory
as well as the uniform background. As more results shown on our project page, there are more
kinds of different form of long-range correlations in dynamic texture sequences, but our proposed
methodcouldhandletheminanunifiedmanner.
Structure with middle-range correlation. We also notice that there are a lot of texture in
between of homogeneous texture and structural texture. As shown in Fig.4.10, 4.11 and 4.12 ,
some kind of dynamic textures have structures which is not global and obvious, but such kind of
53
Figure4.12: Dynamictextureswithmiddle-rangecorrelations(PartIII).
structures are still critical for perceived quality of our synthesized results. For example, in the
candlesequence(Fig. 4.10),wemustkeepedgeofthosecandlesapproximatedcircular. Similarly,
weneedtokeepthesnake’storsonaturalandsmoothinthesnakesequence.
Another interesting thing about such kind of sequences, as we point out in previous section,
is the results from baseline model show many unexpected local hopping. More specifically, those
hoppingaresuddenchangesatlocalstructureinconsecutiveframes. Duetotheirtemporalnature,
we recommend to view those results in video form to fully understand this phenomenon. The
reasonbehindthoseflaws,asweanalyzed,isthelackofenoughconstraintforthoselocalstructures
leads to discontinuity in the time dimension. By introducing better activation statistics, our model
showsbetterresultsonsuchkindofsequencesaswell.
We have explored proposed techniques thoroughly and found a few limitations which we leave
as potential future improvements. First, although dynamic textures are highly self-similar in the
spatialdomain,thetemporalmotioninadynamictexturevideoismuchmorecomplicated. Second,
likemostpreviousmodelsandsomeworksinsimilarareas,thegeneratedvideoonlyhasarelatively
low resolution. In other words, those generated videos are more like visual effects but not real
videos with vivid details. So, there is still a non-neglected distance to generate dynamic textures
withbetter temporalmotioninahigherresolution.
54
4.5 Conclusion
Twoeffectivetechniquesfordynamictexturesynthesiswerepresentedandprovedeffective. Com-
pared with the baseline model, the enhanced model could encode coherence of local features as
well as the correlation between local feature and its neighbors, and also capture more complicated
motion in the time domain. It was shown by extensive experimental results that the proposed
methodoffersstate-of-the-artperformance.
55
Chapter5
PEDENet: ImageAnomaly Localization via Patch Embedding
andDensityEstimation
5.1 Introduction
Image anomaly detection is a binary classification problem that decides whether an input image
containsananomalyornot. Imageanomalylocalizationistofurtherlocalizetheanomalousregion
at the pixel level. Due to recent advances in deep learning and availability of new datasets, recent
research works are no longer limited to the image-level anomaly detection result, but also show a
significantinterestinthepixel-levellocalizationofanomalyregions. Imageanomalydetectionand
localization find real-world applications such as manufacturing process monitoring[107], medical
imageanalysis[105,106],andvideosurveillanceanalysis[104,142].
Most well-studied localization solutions (e.g., semantic segmentation) rely on heavy supervi-
sion,wherealargenumberofpixel-levellabelsandmanylabeledimagesareneeded. However,in
the context of image anomaly detection and localization, a typical assumption is that only normal
(i.e., artifact-free) images are available in the training stage. This is because anomalous samples
are few in general. Besides, they are often hard to collect and expensive in labeling. To this end,
image anomaly localization is usually done in an unsupervised manner, and traditional supervised
solutionsarenotapplicable.
56
Severalmethodsthatintegratelocalimagefeaturesandanomalydetectionmodelsforanomaly
localization have been proposed recently, e.g., [30, 94]. They first employ a deep neural network
to extract local image features and then apply the anomaly detection technique to local regions.
Although they offer good performance for some datasets, two issues remain. First, they rely
on a large pretrained network, which demands higher computational complexity and memory
requirement. Second, most of them are trained in two stages rather than an end-to-end manner.
That is, image features are extracted in the first stage and then fed into the subsequent stage that
corresponds to an anomaly localization module. There is no guarantee whether the information
essential to anomaly localization is well preserved and passed from the first stage to the second
stage.
To address these issues, we present a new neural network model, called the PEDENet, for
unsupervisedimageanomalylocalizationinthiswork. PEDENetcontainsapatchembedding(PE)
network,adensityestimation(DE)networkandanauxiliarynetworkcalledthelocationprediction
(LP) network. The PE network utilizes a deep encoder to get a low-dimensional embedding for
localpatches. BeinginspiredbytheGaussianmixturemodel(GMM),theDEnetworkmodelsthe
distribution of patch embeddings and computes the membership of a patch belonging to certain
modalities. It helps identify outlying artifact patches. The sum of membership probabilities is
used as a loss term to guide the learning process. The LP network is a Multi-layer Perception
(MLP),which takespatch embeddingsasinput andpredicts therelativelocation ofcorresponding
patches. TheperformanceoftheproposedPEDENetisevaluatedandcomparedwithstate-of-the-art
benchmarkingmethodsbyextensiveexperiments.
Ourworkhasthefollowingthreemaincontributions.
• We propose PEDENet for unsupervised image anomaly localization. It can be trained in an
end-to-endmannerwithonlynormalimages(i.e. unsupervisedlearning).
• Being inspired by the GMM, the DE network models the distribution of patch embeddings
and computes the membership of a patch belonging to certain modalities. It helps find
outlyingartifactpatches.
57
Figure 5.1: Image anomaly localization examples (from left to right): normal images, anomalous
images,groundtruthoftheanomalousregionandthepredictedanomalousregionbytheproposed
PEDENet,wheretheredregionindicatesthedetectedanomalousregion. Theseexamplesaretaken
fromtheMVTecADdataset.
• Experiments show that PEDENet achieves state-of-the-art performance on unsupervised
anomalylocalizationfortheMVTecADdataset.
The rest of this letter is organized as follows. Related previous work is reviewed in Sec. 5.2.
The proposed PEDENet is presented in Sec. 5.3. Experimental results are shown in Sec. 5.4.
Finally,concludingremarksaregiven inSec. 5.5.
5.2 RelatedWork
Some related previous work is reviewed in this section. For image anomaly detection and lo-
calization, there are three commonly used approaches: 1) reconstruction-based, 2) pretrained
58
network-basedand3)one-classclassification-basedapproaches. TheyarereviewedinSecs. 5.2.1,
5.2.2and5.2.3,respectively. Finally,workonnon-imageanomalydataisbrieflymentionedinSec.
5.2.4.
5.2.1 Reconstruction-basedApproach
Based on the fact that only normal samples are available during training, the reconstruction-based
approach utilizes the extraordinary capability of neural network to focus on normal samples’
characteristic. Early work mainly uses the autoencoder and its variants [13, 73, 99, 103]. Since
they are trained with only normal training data, it is unlikely for them to reconstruct abnormal
images in the testing stage. As a result, the pixel-wise difference between the input abnormal
image and its reconstructed image can indicate the region of abnormality. However, this approach
is problematic due to inaccurate reconstructions or poorly calibrated likelihoods. More recently,
several methods exploit factors on top of the reconstruction loss such as the incorporation of an
attention map [123] or a student-teacher network to exploit intrinsic uncertainty in reconstruction
[12]. A similar idea is to adopt the inpainting model as an alternative [68], [133]. It first removes
part of an image and then reconstructs the missing part based on the visible part. The difference
betweentheremovedregionanditsinpaintedresultcouldtelltheabnormalitylevelofthatregion.
Although they show promising results, the resolution of anomaly maps is somehow coarse due to
theheavycomputationalburden.
5.2.2 PretrainedNetwork-basedApproach
The performance of some computer vision algorithms can be improved by transfer learning using
discriminative embeddings from pretrained networks. Along this line of thought, some models
combine image features obtained by pretrained networks with anomaly detection algorithms. For
example,thenearest-neighboralgorithmisusedin[30]toexaminewhetheranimagepatchofatest
imageissimilartoanyknownnormalimagepatchesinthetrainingset. TheGaussiandistributionis
adoptedby[94]tofitthedistributionoflocalfeaturesthatareextractedfromapretrainednetwork.
59
Figure 5.2: An overview of the proposed PEDENet. Image are first divided into patches, and then
fed into Patch Embedding (PE) network to compute their patch embeddings, while the Density
Estimation (DE) network guides an implicit Gaussian Mixture Model (GMM)-inspired clustering
in embedding space. After training, normal patches are clustered, and outliers could be treated as
abnormal patches at inference time. Three anomaly localization results from Hazelnut class are
shownasexample.
However,sincethepretrainednetworksarenotspeciallyoptimizedfortheimageanomalydetection
task,theresultingmodelusuallyhasalargemodelsize.
5.2.3 One-classClassification-basedApproach
Another natural idea is to adopt the Support Vector Data Description (SVDD) classifier. SVDD
is a classic one-class classification algorithm derived from the Support Vector Machine (SVM). It
maps all normal training data into a kernel space and seeks the smallest hypersphere that encloses
the data in the space. Anomalies are expected to be located outside the learned hypersphere. Ruff
et al. [100] first incorporated this idea in a deep neural network for non-image data anomaly
detection and then extended it to an unsupervised setting [101]. They used a neural network to
mimic the kernel function and trained it with the radius of the hypersphere. This modification
allowsanencodertolearnadata-dependenttransformation,thusenhancingdetectionperformance
on high-dimensional and structured data. Later, Liznerski et al. [77] generalized it to image
anomalydetectionbyapplyingaSVDD-inspiredpseudo-HuberlosstotheoutputmatrixofaFully
ConvolutionalNetwork(FCN).Itoffersfurtherimprovementsinasemi-supervisedsetting.
60
5.2.4 Non-imageData
Foranomalydetectiononnon-imagedata,aDeepAutoencodingGaussianMixtureModel(DAGMM)
was proposed in [144]. It combines dimensionality reduction and density estimation for unsuper-
vised anomaly detection. Our proposed PEDENet is different from DAGMM in three aspects.
First, PEDENet is designed for image anomaly localization while DAGMM targets at identifying
ananomalousentity. Second,DAGMMemploysanautoencoder(AE)toreconstructtheinputdata
andconcatenateslatentfeatureswithreconstructionerrorsfordensityestimation. PEDENetadopts
a PE network to learn local features and then applies the DE network to patch embeddings. Third,
DAGMM demonstrates its performance on non-image data whose dimension is relatively lower.
For example, the dimension of the latent space can be as low as one. In contrast, the dimension of
patchembeddingfeaturesissignificantlyhigherinPEDENet.
5.3 ProposedMethod
5.3.1 PEDENet
An overview of the PEDENet is shown in Fig. 5.2, where the Hazelnut class from the MVTec
AD dataset is used as an illustrative input. PEDENet conducts class-specific training and testing.
It contains a patch embedding (PE) network, a density estimation (DE) network, and an auxiliary
network called the location prediction (LP) network. The PE network takes an image patch as
its input and performs dimension reduction to get low-dimensional patch embeddings via a deep
encoder structure. The DE network takes a patch embedding as the input and predicts its cluster
membership under the framework of the Gaussian Mixture Model (GMM). These two networks
are shown in Fig. 5.2. The LP network is a multi-layer perception (MLP). It takes a pair of patch
embeddingsastheinputandpredictstherelativelocationofthemasdepictedinFig. 5.3.
PE Network. With an image patch as the input, the PE network outputs its low-dimensional
embedding. Inspired by [130], we adopt a hierarchical encoder that embodies a larger encoder
61
with several smaller encoders. That is, we divide an input patch, ?, into 2 2 sub-patches and
feedeachofsmallersub-patchesintoasmallerencoder. Theoutputsofthesmallersub-patchesare
aggregatedbasedontheiroriginalpositions. Then,theyareencodedbyalargerencodertogetthe
embeddingofpatch?. Theaboveprocesscanbesummarizedas:
z= PEN¹p;)
%#
º (5.1)
where PEN denotes the PE network and z 2 R
/
is the low-dimensional embedding output of
?, and / is the dimensionality of the embedding space. The PE network is implemented as a
composite mapping consisting of smaller encoders, a larger encoder and the aggregation process
withlearnableparameters)
%#
asshownintheright-hand-sideofEq. (5.1).
DENetwork. Giventhelow-dimensionalembeddingofaninputpatch,theDEnetworkpredicts
its cluster membership values with respect to multiple Gaussian modalities under the Gaussian
Mixture Model (GMM). A GMM assumes data samples are generated from a finite number of
Gaussiandistributionswithunknownparameters. TheExpectation-Maximization(EM)algorithm
can be used to optimize those parameters iteratively. In the EM algorithm, the Expectation step
computes the posterior probability (i.e. the so-called cluster membership) of each sample while
the Maximization step updates the GMM parameters, including the mean vector, the covariance
matrixandthemixturecoefficientofeachGaussiancomponent.
Note that GMM may suffer from a sub-optimal solution. Furthermore, it is challenging to
combineitwithaneuralnetworktoachieveend-to-endlearning. ByfollowingtheideainDAGMM
[144], we propose the DE network to estimate the cluster membership for each patch embedding,
whichcorrespondstotheexpectationstepoftheEMalgorithm:
$= softmax¹DEN¹z;)
#
ºº (5.2)
where DEN indicates the DE network, z is the low-dimensional embedding generated by the PE
network, and )
#
denotes parameters of the DE network. After the softmax normalization, $ is
62
a vector inR
, where is a hyper-parameter representing the number of Gaussian components in
theGMM.
After getting the membership prediction $, we update GMM parameters. This corresponds to
the maximization step of the EM algorithm. Here, we useq
:
, -
k
and
:
to represent the mixture
coefficient,themeanvector,thecovariancematrixofthe:-thcomponent,1 : . Forabatch
of# patchembeddings,wehave
q
:
=
#
Õ
8=1
W
8:
#
2R (5.3)
-
k
=
Í
#
8=1
W
8:
z
i
Í
#
8=1
W
8:
2R
/
(5.4)
2
:
=
Í
#
8=1
W
8:
¹z
i
-
k
º¹z
i
-
k
º
)
Í
#
8=1
W
8:
2R
//
(5.5)
Then,wecanexpresstheprobabilityofapatchembedding z
i
informof
%¹z
i
º=
Õ
:=1
q
:
exp¹
1
2
¹z
i
-
k
º
1
:
¹z
i
-
k
º
)
º
p
2cj
:
j
2»01¼ (5.6)
wherejj means the determinant of a matrix. If a patch embedding, z
i
, has a large probability, it
meansthecorrespondingpatchislikelytobeanormalpatch,andviceversa.
LPNetwork. InspiredbyPatch-SVDD[130],weemployeeLPnetworkasanauxiliarynetwork
that predicts the relative location of two neighboring patches in a self-supervised manner. For an
arbitrary input patch, p, we sample another patch p
0
from one of its eight neighbors as shown in
Fig. 5.3. We feed patches p and p
0
into the PE network to get their patch embeddings, z and
z
0
, respectively. The relative location of patch p
0
against patch p is encoded to a one-hot vector
l2R
8
,usedasthelabelinthetrainingas:
ˆ
l = LPN¹zz
0
;)
!%#
º (5.7)
whereLPNdenotestheLPnetworkand)
!%#
denotesitsnetworkparameters.
63
5.3.2 LossFunction
WeproposethefollowinglossfunctiontotrainthePE,DEandLPnetworksjointly:
L=_
1
L
#
¸_
2
L
!%#
¸_
3
L
A46
(5.8)
where_
1
,_
2
, and_
3
are parameters adjustable by users to assign a different weight to each loss
terms. ThethreetermsintherightofEq. (5.8)areelaboratedbelow.
ThefirstlosstermisneededfortheDEnetwork. Sincethetotalprobability,%¹z
i
º,modelsthe
likelihoodofobserving z
i
asanormalpatch. Bymaximizingtheaveragetotalprobabilityofinput
patches,
L
#
=
1
#
log
#
Õ
8=1
%¹z
i
º (5.9)
theDEnetworkistrainedtodescribethedistributionofpatchesembeddingwithanimplicitGMM
while the PE network can be optimized simultaneously. The second loss term is the cross-entropy
lossbetween; and
ˆ
; fortheLPnetworkasgiveninEq. (5.9):
L
!%#
=
8
Õ
8=1
;
8
;>6¹
ˆ
;
8
º (5.10)
Finally, we add a regularization term to prevent singularity in the GMM, which occurs when the
determinant of any
:
degenerates to zero. We add the regularization loss term to penalize small
valuesofdiagonalelements:
L
A46
=
Õ
:=1
/
Õ
I=1
1
:I
(5.11)
End-to-endTraining. WiththetotallossfunctiongiveninEq. (5.8),allthreenetworkscould
be jointly optimized using the back-propagation algorithm. The DE network and the LP network
are two parallel branches concatenated to the PE network. That is, the output of the PE network,
i.e. patch embedding, serves as the input to the DE and the LP networks. The DE network takes a
patchembeddingdirectly. TheLPnetworktakesthedifferenceofembeddingsofapairofadjacent
64
Figure5.3: Overviewoflocationprediction(LP)network.
patches and use their relative position as the training label. In the training, parameters of all three
networksareupdatedsimultaneouslysoastoachieveend-to-endtrainingforthewholePEDENet.
5.3.3 AnomalyLocalization
We obtain low-dimensional patch embeddings from the PE network after training, which could be
used to localize anomalies. Low-dimensional patch embeddings could easily be integrated with
various well-defined anomaly detection models such as one-class SVM [26] and SVDD [116].
To better illustrate the effectiveness of learned low-dimensional patch embeddings, we adopt the
nearestneighborretrievalintheembeddingspacetolocalizeanomalouspixels,whicharealsoused
inpreviousworks[30,130].
For every patch ? with stride( in test image -, we use its !
2
distance to the nearest normal
patchintheembeddingspacetodefineitsanomalyscore:
(¹pº= min !
2
¹p p
#>A<0;
º (5.12)
The pixel-wise anomaly score can be calculated by averaging anomaly scores of all patches to
which this pixel belongs to. An approximate algorithm is adopted to mitigate the computational
65
cost of the nearest neighbor search. The maximum anomaly score of pixels in an image is set to
theimage-levelanomalyscore.
Table5.1: ComparisonofImageAnomalyLocalizationPerformance,wheretheevaluationmetric
ispixel-wiseAUC-ROC.
AEL2 AE SSIM AnoGAN VAE SPADE Patch-SVDD FCDD PEDENet(ours)
Bottle 0.86 0.93 0.86 0.831 0.984 0.981 0.80 0.984
Cable 0.86 0.82 0.78 0.831 0.972 0.968 0.80 0.971
Capsule 0.88 0.94 0.84 0.817 0.990 0.958 0.88 0.943
Hazelnut 0.95 0.97 0.87 0.877 0.991 0.975 0.96 0.970
Metal 0.86 0.89 0.76 0.787 0.981 0.980 0.88 0.973
Pill 0.85 0.91 0.87 0.813 0.965 0.951 0.86 0.960
Screw 0.96 0.96 0.80 0.753 0.989 0.957 0.87 0.972
Teeth brush 0.93 0.92 0.90 0.919 0.979 0.981 0.90 0.979
Transistor 0.86 0.90 0.80 0.754 0.941 0.970 0.80 0.982
Zipper 0.77 0.88 0.78 0.716 0.965 0.951 0.81 0.962
All 10ObjectClasses 0.88 0.91 0.83 0.810 0.976 0.967 0.86 0.970
Carpet 0.59 0.87 0.54 0.597 0.975 0.926 0.93 0.922
Grid 0.90 0.94 0.58 0.612 0.937 0.962 0.87 0.959
Leather 0.75 0.78 0.64 0.671 0.976 0.974 0.98 0.976
Tile 0.51 0.59 0.50 0.513 0.874 0.914 0.92 0.926
Wood 0.73 0.73 0.62 0.666 0.885 0.908 0.89 0.900
All 5 TextureClasses 0.70 0.78 0.29 0.612 0.929 0.937 0.92 0.936
Averageof 15Classes 0.82 0.87 0.74 0.744 0.965 0.957 0.88 0.959
Table 5.2: Image Anomaly Detection Performance, where the evaluation metric is the image-level
AUC-ROC.
GANomaly ITAE Patch-SVDD SPADE MahalanobisAD PEDENet(ours)
Image-LevelAUC-ROC 0.762 0.839 0.921 0.855 0.958 0.928
5.4 Experiments
5.4.1 ExperimentalSetup
To verify the effectiveness of the proposed PEDENet, we conducted experiments on the MVTec
AD dataset[11], whichis acomprehensive anomalylocalization dataset collectedfrom realworld
scenarios. It consists of images belonging to 15 classes, including 10 object classes and 5 texture
classes. Foreachclass,thereare60-391trainingimagesand40-167testimages,whereimagesizes
66
Figure 5.4: Visualization of anomalous images, labeled ground truths and localization results of
the proposed PEDENet for 5 object classes in the MVTec AD dataset, where the red color is used
toindicate detectedanomalyregions.
variesfrom700700to10241024. Thetrainingsetcontainsonlynormalimageswhilethetest
set contains both normal and anomalous images. Examples of normal and anomalous images are
showninFig. 5.1. Wetrainandevaluateanomalylocalizationalgorithmsforeachclassseparately,
which is known as class-specific evaluation. To train a model, all training images are first resized
to256256and,then,patchesofsize6464arerandomlysampledfromtheseresizedimages.
Ourproposedsolutionconsistsofthreenetworks: 1)thePEnetwork,2)theDEnetwork,and3)
the LP network. The PE network consists of eight convolutional layers and one output layer. The
filter size in all convolutional layers is the same, i.e. 33. The filter numbers for convolutional
layers are 32, 64, 128, 128, 64, 64, 32, 32, 64 and LeakyReLU [54] with slope 0.1 is used as
activation function. The tanh activation is used in the output layer to normalize the output to
the range of [-1.0,1.0]. Both the DE and LP networks are multilayer perceptrons (MLPs) with
the LeakyReLU activation of slope 0.1. The DE network has three hidden layers of 128, 64, 32
67
Figure 5.5: Visualization of anomalous images, labeled ground truths and localization results of
the proposed PEDENet for 5 object classes in the MVTec AD dataset, where the red color is used
toindicate detectedanomalyregions.
neurons, respectively. The LP network has two hidden layers of 128 neurons per layer. The input
totheLPnetworkissubtractionoffeaturesfromtwoneighboringpatches.
WetrainallnetworksusingtheAdamoptimizerwithlearningrate0.0001. Thebatchsizesare
128 image patches for the DE network and 36 pairs of adjacent patches for the LP network. All
experiments are conducted on a machine equipped with an Intel i7-5930K CPU and an NVIDIA
GeForceTitanXGPU.
5.4.2 PerformanceEvaluation
Anomaly Localization. We compare the image anomaly localization performance of several
methodswithrespecttotheMVTecADdataset[11]inTable5.1,wheretheevaluationmetricisthe
pixel-wise area under the receiver operating characteristic curve (AUC-ROC). The benchmarking
methodsarelistedbelow.
68
Figure 5.6: Visualization of anomalous images, labeled ground truths and localization results of
theproposedPEDENetfor5textureclassesintheMVTecADdataset,wheretheredcolorisused
toindicate detectedanomalyregions.
• Reconstruction approach: AE L2 [11], AE SSIM [13], Variational Autoencoder(VAE) and
AnoGAN[106].
• Pre-trainednetworkbasedapproach: SPADE[30].
• One-classclassificationapproach: Patch-SVDD[130],andFCDD[77].
Our proposed PEDENet achieves 97.0% for the average of 10 object classes (the 2nd best), 93.6%
for the average of 5 texture classes (the 2nd best), and 95.9% for the average of all 15 classes (the
2nd best). Note that there is no single method that performs the best in all cases. PEDENet is
second to SPADE in the average of 10 object classes while it is second to Patch-SVDD in the
average of the 5 texture classes. The performance differences among SPADE, Patch-SVDD and
PEDENetareactuallyquitesmall. Thus,itisfairtosaythatPEDENetisoneofthestate-of-the-art
methodsforimageanomalylocalization.
69
Anomaly Detection. We compare image anomaly detection performance in Table 5.2, where
the evaluation metric is the image-level area under the receiver operating characteristic curve
(AUC-ROC).Thebenchmarkingmethodsinclude: GANomaly[2],ITAE[39],Patch-SVDD[130],
SPADE [30] and MahalanobisAD [94]. Our proposed PEDENnet reaches 92.8%. It outperforms
both SPADE and Patch-SVDD. It is only second to MahalanobisAD. Thus, PEDENnet also offers
state-of-the-artimageanomalydetectionperformance.
Visualization of Localized Anomalies. Anomalous images, labeled ground truths and local-
ization results of the proposed PEDENet for 10 object classes and 5 texture classes in the MVTec
ADdatasetarevisualizedinFigs. 5.4,5.5and5.6,respectively. Anomalousregionsarehighlighted
in red. A region is more likely to be anomalous if it has stronger red color. As shown in those
figures, anomalous regions could be accurately detected and localized by the proposed PEDENet.
NotethatrelativelysmallandinconspicuousdefectscanstillbespottedsuchasCapsule,Hazelnut
andPillexamplesinFig. 5.4and5.5.
Model Size. Most state-of-the-art methods that give the best performance take the pre-trained
networkapproachsuchasSPADE[30],andMahalanobisAD[94]. Asaresult,theirmodelsizesare
usuallyquitelarge. WecomparethemodelsizesofthesethreemethodsinTable5.3intermsofthe
number of model parameters. The numbers of model parameters of SPADE and MahalanobisAD
are 145 and 36 of that of PEDENet. A smaller model size means lower memory requirement
which is critical in real-world deployment. Besides being a light-weighted network, PEDENet is
trained using the training data in the MVTec AD dataset only. In contrast, the two benchmarking
methodsleveragegiantmodelsthatarepre-trainedbymillionsoflabeledimages.
Table5.3: ModelSizeComparison
Methods Pre-trainedModel #ofParameters
SPADE WideResNet-50 68M
MahalanobisAD EfficientNet-B4 17M
PEDENet(ours) None 0.47M
70
Table5.4: AblationStudy
L
#
L
!%#
AUC-ROC
p
0.810
p
0.894
p p
0.959
Ablation Study. We conduct ablation study to understand the impact of each loss term in Eq.
(5.8)totheperformanceoftheproposedPEDENet. Thatis,weremoveeitherL
#
orL
!%#
and
train the model under the same condition. We see from Table 5.4 that the adoption of bothL
#
andL
#
asshowninEq. (5.8)improvestheanomalylocalizationperformance.
Challenge of Texture Classes. It is worthwhile to point out that almost all methods have an
obviousperformancegapbetweentheobjectclassesandthetextureclasses. AsshowninTable5.1,
theperformancegapvariesfrom3%[130]toalmost20%[13]. Asmentionedin[130],theoptimal
hyper-parameters for objects and textures are likely to be different and there is a trade-off to get
thebestaverageperformance. Actually,textureandregularimagesaretreateddifferentlyinimage
processing. Theremanyuniquemethodsdevelopedspecificallyfortextures[17,134–136]. Unlike
regular images, there are strong self-similarity and quasi-periodicity in textures, which could be
exploitedtoachievefurtherperformanceimprovement.
5.5 ConclusionandFuture Work
Anewneuralnetworkmodel,calledthePEDENet,wasproposedforimageanomalydetectionand
localization. Itisalightweightmodelandoffersstate-of-the-artanomalydetectionandlocalization
performance. There are some future research topics. First, the MVTec AD dataset is still too
small. It is valuable to collect more samples and build a larger dataset. Second, it is desired
to find a more effective solution for texture anomaly detection and localization. Third, there are
many hyper-parameters in the proposed PEDENnet, which are finetuned by trial and error. A
more systematic approach is needed. Finally, it is interesting to find an altenative image anomaly
detectionandlocalizationsolutionthatiseffectiveandinterpretable.
71
Chapter6
AnomalyHop: AnSSL-based Image Anomaly Localization
Method
6.1 Introduction
Image anomaly localization is a technique that identifies the anomalous region of input images at
the pixel level. It finds real world applications such as manufacturing process monitoring [107],
medical image diagnosis [105, 106] and video surveillance analysis [83, 104]. It is often assumed
that only normal (i.e., anomaly free) images are available in the training stage since anomalous
samplesarefewtobemodeledeffectivelyrareand/orexpensivetocollect.
Thereisagrowinginterestinimageanomalylocalizationduetotheavailabilityofanewdataset
called the MVTec AD [11] (see Fig. 6.1). State-of-the-art image anomaly localization methods
adoptdeeplearning. Manyofthememploycomplicatedpretrainedneuralnetworkstoachievehigh
performanceyetwithoutagoodunderstandingofthebasicproblem. Togetmarginalperformance
improvements,finetuningandotherminormodificationsaremadeonatry-and-errorbasis. Related
workwillbereviewedinSec. 6.2.
A new image anomaly localization method, called AnomalyHop, based on the successive
subspace learning (SSL) framework is proposed in this work. This is the first work that applies
SSL to the anomaly localization problem. AnomalyHop consists of three modules: 1) feature
extractionviasuccessivesubspacelearning(SSL),2)normalityfeaturedistributionsmodelingvia
72
Figure 6.1: Image anomaly localization examples taken from the MVTec AD dataset (from left to
right): normalimages,anomalousimages,thegroundtruthandthepredictedanomalousregionby
AnomalyHop,wheretheredregionindicatesthedetectedanomalousregion.
Gaussian models, and 3) anomaly map generation and fusion. They will be elaborated in Sec.
6.3. As compared with deep-learning-based image anomaly localization methods, AnomalyHop
is mathematically transparent, easy to train and fast in its inference speed. Besides, as reported
in Sec. 6.4, its area under the ROC curve (ROC-AUC) performance on the MVTec AD dataset is
95.9%, which is the state-of-the-art performance. Finally, concluding remarks and possible future
extensionswillbegiveninSec. 6.5.
73
Figure6.2: ThesystemdiagramoftheproposedAnomalyHopmethod.
6.2 RelatedWork
Ifthenumberofimagesinanimageanomalytrainingsetislimited,learningnormalimagefeatures
in local regions is challenging. We classify image anomaly localization methods into two major
categoriesbasedonwhetheramethodreliesonexternaltrainingdata(say,theImageNet)ornot.
With External Training Data. Methods in the first category rely on pretrained deep learning
models by leveraging external data. Examples include PaDiN [34], SPADE [30], DFR [129] and
CNN-FD[83]. Theyemployapretraineddeepneuralnetwork(DNN)toextractlocalimagefeatures
and, then, use various models to fit the distribution of features in normal regions. Although some
offer impressive performance, they do rely on large pretrained networks such as the ResNet [53]
andtheWide-ResNet[132]. SincethesepretrainedDNNsarenotoptimizedfortheimageanomaly
detection task, the associated image anomaly localization methods usually have large model sizes,
highcomputationalcomplexityandmemoryrequirement.
Without External Training Data. Methods in the second category exploit neither pretrained
DNNs nor external training data. They learn image local features based on normal images in
the training set. For example, Bergmann et al. developed the MVTec AD dataset in [11] and
used an autoencoder-like network to learn the representation of normal images. The network can
reconstruct anomaly-free regions of high fidelity but not for anomalous regions. As a result, the
pixel-wise difference between the input abnormal image and its reconstructed image reveals the
regionofabnormality. Asimilarideawasdevelopedusingtheimageinpaintingtechnique[68,133].
Traditionalmachinelearningmodelssuchassupportvectordatadescription(SVDD)[116]canalso
beintegratedwithneuralnetwork,wherenovellosstermsarederivedtolearnlocalimagefeatures
from scratch [77, 130]. Generally speaking, methods without external training data either fail to
74
provide satisfactory performance or suffer from a slow inference speed [130]. This is attributed to
diversifiedcontentsofnormalimages. Forexample,the10objectclassesandthe5textureclasses
in the MVTec AD dataset dataset are quite different. Their capability in representing features of
localregionsofdifferentimagesissomehowlimited. Ontheotherhand,over-parameterizedDNN
modelspretrainedbyexternaldatamayoverfitsomedatasetsbutmaynotbegeneralizabletoother
unseen contents such as new texture patterns. It is desired to find an effective and mathematically
transparentlearningmethodtoaddressthischallengingproblem.
SSLandItsApplications. SSLisanemergingmachinelearningtechniquedevelopedbyKuoetal.
inrecentyears[23,24,60,63,96]. Ithasbeenappliedtoquiteafewapplicationswithimpressive
performance. Examples include image classification [23, 24], image enhancement [7], image
compression [118], deepfake image/video detection [19], point cloud classification, segmentation,
registration [58, 137–139], face biometrics [97, 98], texture analysis and synthesis [67, 134], 3D
medicalimageanalysis[74],etc.
6.3 AnomalyHopMethod
AnomalyHop belongs to the second category of image anomaly localization methods. Its system
diagram is illustrated in Fig. 6.2. It contains three modules: 1) feature extraction, 2) modeling of
normalityfeaturedistributions,and3)anomalymapgeneration. Theywillbeelaboratedbelow.
6.3.1 SSL-basedFeatureExtraction
Deep-learning methods learn image features indirectly. Given a network architecture, the network
learn the filter parameters first by minimizing a cost function end-to-end. Then, the network can
be used to generate filter responses, and patch features are extracted as the filter responses at a
certain layer. In contrast, the SSL framework extracts features of image patches directly using a
data-drivenapproach. Thebasicideaistostudypixelcorrelationsinaneighborhood(say,apatch)
and use the principal component analysis (PCA) to define an orthogonal transform, also known as
75
the Karhunen Loève transform. However, a single-stage PCA transform is not sufficient to obtain
powerfulfeatures. Asequenceofmodificationshavebeenproposedin[23,24,60,63]tomakethe
SSLframeworkcomplete.
ThefirstmodificationistobuildasequenceofPCAtransformsincascadewiththemaxpooling
insertedbetweentwoconsecutivestages. Theoutputofthepreviousstageservesastheinputtothe
currentstage. Thecascadedtransformsareusedtocaptureshort-,mid-andlong-rangecorrelations
ofpixelsinanimage. Sincetheneighborhoodofagraphiscalledahop(e.g.,1-hopneighbors,2-hor
neighbors, etc.), each transform stage is called a hop [23]. However, a straightforward cascade of
multi-hopPCAsdoesnotworkproperlyduetothesignconfusionproblemwhichwasfirstpointed
out in [60]. The second modification is to replace the linear PCA with an affine transform that
addsaconstantbiasvectortothePCA responsevector[63]. Thebiasvectorisaddedtoensureall
inputelementstothenexthoparepositivetoavoidsignconfusion. Thismodifiedtransformiscall
the Saab (Subspace approximation with adjusted bias) transform. The input and the output of the
Saabtransformare3Dtensors(including2Dspatialcomponentsand1Dspectralcomponents.) By
recognizingthatthe1Dspectralcomponentsareuncorrelated,thethirdmodificationwasproposed
in [24] to replace one 3D tensor input with multiple 2D tensor inputs. This is named as the
channel-wise Saab (c/w Saab) transform. The c/w Saab transform greatly reduces the model size
ofthestandardSaabtransform.
6.3.2 ModelingofNormalityFeatureDistributions
WeproposethreeGaussianmodelstodescribethedistributionsoffeaturesofnormalimages,which
areextractedinSec. 6.3.1.
6.3.2.1 Location-awareGaussianModel
Iftheinputimagesofanimageclassarewellalignedinthespatialdomain,weexpectthatfeatures
atthesamelocationareclosetoeachother. Weuse-
=
89
todenotethefeaturevectorextractedfrom
apatchcenteredatlocation¹89º ofacertainhopinthe=thtrainingimage. Byfollowing[34],we
76
modelthefeaturevectorsofpatchescenteredatthesamelocation¹89º byamultivariateGaussian
distribution,N¹`
89
89
º. Its sample mean is`
89
= #
1
Í
#
==1
-
=
89
and its sample covariance matrix
is
89
=¹#1º
1
#
Õ
==1
¹-
=
89
`
89
º¹-
=
89
`
89
º
)
¸n
where# is the number of training images of an image class andn is small positive number. The
termn isaddedtoensurethatthesamplecovariancematrixispositivesemi-definite.
6.3.2.2 Location-IndependentGaussianModel
Forimagesofthesametextureclass,theyhavestrongself-similarity. Besides,theyareoftenshift-
invariant. Thesepropertiescanbeexploitedfortexture-relatedtasks[134,135]. Forhomogeneous
fine-granulartextures,wecanuseasingleGaussianmodelforalllocalimagefeaturesateachhopand
callitthelocation-independentGaussianmodel. Themodelhasitsmean`=¹#,º
1
Í
89=
-
=
89
anditscovariancematrix
=¹#,1º
1
Õ
89=
¹-
=
89
`
89
º¹-
=
89
`
89
º
)
¸n
where # is the number of training images in one texture class, and and, are pixel numbers
alongtheheightandthewidthoftextureimages.
6.3.2.3 Self-referenceGaussianModel
Both location-aware and location-independent Gaussian models utilizes all training images to
capturethenormalityfeaturedistributions. However,imagesofthesameclassmayhaveintra-class
variations which cannot be well captured by location-aware and location-independent Gaussian
models. One example is the grid class in the MVTec AD dataset. Different images may have
different grid orientations and lighting conditions. To address this problem, we train a Gaussian
model with the distribution of features from a single normal image and call it the self-reference
77
Gaussian. Again,wecomputethesamplemeanas`=¹,º
1
Í
89
-
89
andthesamplecovariance
matrixas
=¹,1º
1
Õ
89
¹-
89
`º¹-
89
`º¸n
Forthissetting,weonlyusenormalimagesinthetrainingsettodeterminethec/wSaabtransform
filters. The self-reference Gaussian model is learned from the test image at the testing time. For
morediscussion,werefertoSec. 6.4.
6.3.3 AnomalyMapGenerationandFusion
WithlearnedGaussianmodels,weusetheMahalanobisdistance,
"¹-
89
º=
q
¹-
89
`
89
º
1
89
¹-
89
`
89
º
)
astheanomalyscoretoshowtheanomalouslevelofacorrespondingpatch. Higherscoresindicate
ahigherlikelihoodtobeanomalous. Bycalculatingthescoresoveralllocationsofahop,weform
an anomaly map at each hop for an input test image. Finally, we re-scale all anomaly maps to the
samespatialsizeandfusethemtoyieldthefinalanomalymap.
78
6.4 Experiments
Dataset and Evaluation Metric. We evaluate our model on the MVTec AD dataset [11]. It has
5,354 images from 15 classes, including 5 texture classes and 10 object classes, collected from
real-world applications. The resolution of input images ranges from 700700 to 10241024. The
training set consists of normal images only while the test set contains both normal and abnormal
images. Thegroundtruthofanomalyregionsisprovidedfortheevaluationpurpose. Theareaunder
the receiver operating characteristics curve (AUC-ROC) [11, 35] is chosen to be the performance
evaluationmetric.
ExperimentalSetupandBenchmarkingMethods. First,weresizeimagesofdifferentresolutions
tothesameresolutionof224224. Next,weapplythe5-stagePixelhop++toallclassesforfeature
extraction as shown in Fig. 6.2. The spatial sizes,11, and the number,:, of filters at each hop
aresearchedintherangeof21 7and2 : 5,respectively. The22max-poolingisused
between hops. The optimal hyper-parameters at each hop are class dependent. A representative
case for the leather class is given in Table 6.1. The optimal hyper-parameters of all 15 classes can
be found in our Github codes. We compare AnomalyHop against seven benchmarking methods.
Four of them belong to the first category that leverages external datasets. They are PaDiM [34],
SPADE [30], DFR [129] and CNN-FD [83]. Three of them belong to the second category that
solely relies on images in the MVTec AD dataset. They are AnoGAN [106], VAE-grad [35] and
Patch-SVDD[130].
Table 6.1: The hyper-parameters of spatial sizes and numbers of filters at each hop for the leather
class.
HopIndex 1 2 3 4 5
b 5 5 3 2 2
k 4 4 4 4 4
AUC-ROC Performance. We compare the AUC-ROC scores of AnomalyHop and seven bench-
marking methods in Table 6.2. As shown in the table, AnomalyHop performs the best among all
methods with no external training data. Although Patch-SVDD has close performance, especially
79
Table6.2: PerformancecomparisonofimageanomalylocalizationmethodsintermsofAUC-ROC
scoresfortheMVTecADdataset,wherethebestresultsineachcategoryaremarkedinbold.
Pretrainedw/ ExternalData w/o Pretraining
PaDiM[34] SPADE [30] DFR[129] CNN-FD[83] AnoGAN[106] VAE-grad[35] Patch-SVDD[130] AnomalyHop
Carpet 0.991 0.975 0.970 0.720 0.540 0.735 0.926 0.942
Grid 0.973 0.937 0.980 0.590 0.580 0.961 0.962 0.984
¢
Leather 0.992 0.976 0.980 0.870 0.640 0.925 0.974 0.991
Tile 0.941 0.874 0.870 0.930 0.500 0.654 0.914 0.932
Wood 0.949 0.885 0.930 0.910 0.620 0.838 0.908 0.903
Avg. ofTextureClasses 0.969 0.929 0.946 0.804 0.576 0.823 0.937 0.950
Bottle 0.983 0.984 0.970 0.780 0.860 0.922 0.981 0.975
Cable 0.967 0.972 0.920 0.790 0.780 0.910 0.968 0.904
Capsule 0.985 0.990 0.990 0.840 0.840 0.917 0.958 0.965
Hazelnut 0.982 0.991 0.990 0.720 0.870 0.976 0.975 0.971
MetalNut 0.972 0.981 0.930 0.820 0.760 0.907 0.980 0.956
Pill 0.957 0.965 0.970 0.680 0.870 0.930 0.951 0.970
Screw 0.985 0.989 0.990 0.870 0.800 0.945 0.957 0.960
¢
Toothbrush 0.988 0.979 0.990 0.770 0.900 0.985 0.981 0.982
Transistor 0.975 0.941 0.800 0.660 0.800 0.919 0.970 0.981
Zipper 0.985 0.965 0.960 0.760 0.780 0.869 0.951 0.966
Avg. of ObjectClasses 0.978 0.976 0.951 0.769 0.826 0.928 0.967 0.963
Avg. of AllClasses 0.975 0.960 0.949 0.781 0.743 0.893 0.957 0.959
for the object classes, its inference speed is significantly slower as shown in Table 6.3. The best
performanceinTable6.2isachievedbyPaDiM[34]thattakesthepretrained50-layerWideResNet
asthefeatureextractorbackbone. Itssuperiorperformancelargelydependsonthegeneralizability
of the pretrained network. In practical applications, we often encounter domain-specific images,
which may not be covered by external training data. In contrast, AnomalyHop exploits the statis-
tical correlations of pixels in short-, mid- and long-range neighborhoods and obtain the c/w Saab
filters based on PCA. It can tailor to a specific application domain using a smaller number of
normal images. Furthermore, the Wide-ResNet-50-2 model has more than 60M parameters while
AnomalyHophasonly100KparametersinPixelHop++,whichisusedforimagefeatureextraction.
ThreeGaussianmodelsareadoptedbyAnomalyHoptohandle15differentclassesinTable6.2,
and corresponding visualization results could be found in Fig. 6.4, 6.5 and 6.6. Results obtained
using location-aware, location-independent and self-reference Gaussian models are marked with
,
,
¢
, respectively. The object classes are well-aligned in the dataset so that the location-aware
Gaussian model is more suitable. For texture classes (e.g. carpet and wood classes), the location-
independentGaussianmodelisthemostfavorablesincethetextureclassesareusuallyhomogeneous
across the whole image. The location information is less relevant. The grid class is a special one.
Ononehand,thegridimageishomogeneousacrossthewholeimage. Ontheotherhand,different
80
grid images have different rotations, lighting conditions and viewing angles as shown in Fig. 6.3.
Asaresult,theself-referenceGaussianmodeloffersthebestresult.
Figure 6.3: Two anomaly grid images (from left to right): input images, ground truth labels,
predictedheatmap,predictedandsegmentedanomalyregions.
Inference Speed. The inference speed is another important performance metric in real-world
image anomaly localization applications. We compare the inference speed of AnomalyHop and
the other three high-performance methods in Table 6.3, where all experiments are conducted with
Intel I7-5930K@3.5GHz CPU. We see that AnomalyHop has the fastest inference speed. It has a
speed-upfactorof4x,22xand28xwithrespecttoPaDIM,Patch-SVDDandSPADE,respectively.
SPADE and Patch-SVDD are significantly slower because of expensive nearest neighbor search.
For DNN-based methods, their feature extraction can be accelerated using GPU hardware, which
appliestoAnomalyHop,too. Ontheotherhand,imageanomalylocalizationisoftenconductedby
edgecomputingdevicesinmanufacturinglines. GPUcouldbetooexpensiveforthisenvironment.
Although training complexity is often ignored since it has to be done once, it is worthwhile to
mention that the training of AnomalyHop is very efficient. It takes only 2 minutes to train an
AnomalyHopmodelforeachclasswiththeabove-mentionedCPU.
81
Table6.3: Averageinferencetime (insec.) perimagewithInteli7-5930K@3.5GHzCPU.
Methods Inference Time Speed Up
SPADE [30] 6.80 1
Patch-SVDD[130] 5.23 1.3
PaDiM[34] 0.91 7.5
AnomalyHop 0.24 28.3
6.5 ConclusionandFuture Work
An SSL-based anomaly image localization method, called AnomalyHop, was proposed in this
work. It is interpretable, effective and fast in both inference and training time. Besides, it offers
state-of-the-art anomaly localization performance. AnomalyHop has a great potential to be used
inareal-worldenvironmentduetoitshighperformanceaswellaslowimplementationcost.
Although SSL-based feature extraction in AnomalyHop is powerful, its feature distribution
modeling (module 2) and anomaly localization decision (module 3) are still primitive. These two
modules can be improved furthermore. For example, it is interesting to leverage effective one-
classclassificationmethodssuchasSVDD[116],subspaceSVDD[113]andmultimodalsubspace
SVDD[114]. Thisisanewtopicunderourcurrentinvestigation.
82
Figure6.4: Visualizationofanomalousimages,groundtruths,predictedheatmaps,predictmasks,
and segmentation results of the proposed AnomalyHop for 5 object classes in the MVTec AD
dataset.
83
Figure6.5: Visualizationofanomalousimages,groundtruths,predictedheatmaps,predictmasks,
and segmentation results of the proposed AnomalyHop for 5 object classes in the MVTec AD
dataset.
84
Figure6.6: Visualizationofanomalousimages,groundtruths,predictedheatmaps,predictmasks,
and segmentation results of the proposed AnomalyHop for 5 texture classes in the MVTec AD
dataset.
85
Chapter7
ConclusionsandFutureWork
7.1 SummaryoftheResearch
Inthisthesis,wefocusondata-drivenimageanalysis,modeling,synthesisandanomalylocalization
techniques, which includes two majors research topics. The first one is texture analysis, modeling
andsynthesis,andthesecondoneisimageanomalydetectionandlocalization.
For texture analysis, modeling and synthesis, our researches focus on three problems: unsu-
pervised texture segmentation, texture analysis and synthesis and dynamic texture synthesis. For
image anomaly detection and localization, we propose two state-of-the-art method, which utilize
deeplearningtechniquesandsuccessivesubspacelearningtechniques,respectively
Unsupervised Texture Segmentation An effective textural feature extraction method for un-
supervisedtexturesegmentationwaspresented. Featuresarelearnedfromdatainanunsupervised
manner. They encode local features as well as contrast information. It was shown by extensive
experimentalresultsthattheproposedmethodoffersthestate-of-the-artperformance.
TextureAnalysisandSynthesisAneffectivehierarchicalspatial-spectralcorrelation(HSSC)
methodwasproposedfortextureanalysisandclassification. Itappliesamulti-stageSaaktransform
to input texture patches and then conducts correlation analysis on Saak transform coefficients to
obtaintexturefeaturesofhighdiscriminantpower. Extensiveexperimentsontextureclassification
with three benchmark datasets were conducted to demonstrate the effectiveness of the HSSC
method. Bothclass-specificandclass-independenttransformkernelswereexamined.
86
DynamicTextureSynthesisTwoeffectivetechniquesfordynamictexturesynthesiswerepre-
sentedandprovedeffective. Comparedwiththebaselinemodel,theenhancedmodelcouldencode
coherence of local features as well as the correlation between local feature and its neighbors, and
alsocapturemorecomplicatedmotioninthetimedomain. Itwasshownbyextensiveexperimental
resultsthattheproposedmethodoffersstate-of-the-artperformance.
ImageAnomalyDetectionandLocalization: PEDENetAnewneuralnetworkmodel,called
the PEDENet, was proposed for image anomaly detection and localization. It can be trained in an
end-to-endmannerwithonlynormalimages(i.e. unsupervisedlearning). PEDENetjointlylearns
localimagefeaturesandconductsdensityestimationinfeaturespace. Itisalightweightmodeland
offersstate-of-the-artanomalydetectionandlocalizationperformance.
Image Anomaly Detection and Localization: AnomalyHop An SSL-based anomaly image
localization method, called AnomalyHop, was proposed in this work. It is interpretable, effective
andfastinbothinferenceandtrainingtime. Besides,itoffersstate-of-the-artanomalylocalization
performance. AnomalyHop has a great potential to be used in a real-world environment due to its
highperformanceaswellaslowimplementationcost.
7.2 FutureResearchDirections
Basedonourcurrentresults,wefurtherproposefollowingfutureresearchdirections.
7.2.1 TextureSynthesis
There is a significant improvement in the quality of example-based texture synthesis techniques
by taking advantage of convolutional neural network. However, this only works well when the
semanticallysignificantfeaturesintheimageareatthecorrectscaleforthenetwork,andinpractice
thereceptivefieldofafeatureatanintermediatelayerforcommonCNNarchitecturesisrelatively
small. ThepopularCNNarchitectures,likeVGG,usedby[42]andothers,aretrainedon224224
pixelimages,inwhichrelevantfeatureswillbequiteabitsmaller. ThoselimitationsofCNNprevent
87
currenttexturesynthesismodelsofferingexcellentresultsforhigh-resolutiontextureimageswhich
havecomplicatedstructureinmultiplescales.
Morespecifically,givenahigh-resolutionsourceimage,foroptimalresultswithbothfinedetails
and global arrangement, an model must scale down that image until the pixel-scale of the features
of interest match the receptive field of the appropriate semantic layer of the network. This limits
the resolution of the rendered image, and further breaks down for source images with textures at
multiplescales. Themodelmustchoosetocaptureonescaleoftextureattheexpenseoftheother.
On one hand, the rich and vivid details are critical for human perception for synthesized
texture. Andexploringthesecretbehindthosekindofcomplicatedphenomenonwillbeextremely
interesting. On the other hand, with the help of recent advance of Neural-network-inspired Image
Transforms, we set our final goal as to update the texture representation model to achieve high-
resolutionmulti-scaletexturesynthesis.
7.2.2 ImageAnomalyDetectionandLocalization
The first potential direction is to build a large-scale dataset for image anomaly detection and
localization task. The current popular benchmark dataset, MVTec AD dataset, is still too small.
It only contains a few hundred of images for each class, and it only cover 15 classes of different
images. However, image anomaly detection and localization is considered to be one of the most
practical computer vision task, and it owns numerous real-world applications. A small dataset is
notabletoprovideenoughdiversity,andeasytoleadtosaturationandover-fitting. So,itisvaluable
tocollectmoresamplesandbuildalargerdataset.
The second potential direction is to solve the slow inference speed problem for proposed
PEDENet model. As discussed in Chapter 5, the Density Estimation Network mimics Gaussian
MixtureModelandoutputthetotalprobabilityofinputpatchembeddings. Herewemainlyusethe
output total probability as a novel loss term that helps training, however, the Density Estimation
Networkcouldbeadirectandfastinferenceapproachforanomalylocalization. Thepredictedtotal
probabilitycouldbeandirectindicatoroftheanomalouslevelforinputpatch. Whenthepredicted
88
totalprobabilityislowthanaapre-chosenthreshold,itmeansthecorrespondingpatchislikelyto
beananomaly.
Comparing with the nearest neighbor search, this approach shows obvious advantages in in-
ference speed, since pixel-wise anomaly scores could be obtained by a single forward pass of
Patch Embedding Network and Density Estimation Network. Also, it does not rely on any other
successivemodel,whichcouldmaketrainingmoreconciseandeasier.
The third one is to further explore successive subspace learning-based approach on the basis
of AnomalyHop. Although SSL-based feature extraction in AnomalyHop is powerful, its feature
distributionmodelingandanomalylocalizationdecisionarestillprimitive. Thesetwomodulescan
beimprovedfurthermore. Forexample,itisinterestingtoleverageeffectiveone-classclassification
methodssuchasSVDD[116]andsubspaceSVDD[113],whichareexpectedtomodelthefeature
distributionmoreaccuratelyandefficiently. Thisisanewtopicunderourcurrentinvestigation.
89
Bibliography
1. Ahonen, T. & Pietikäinen, M. Soft histograms for local binary patterns in Proceedings of
theFinnishsignalprocessingsymposium,FINSIG5 (2007),1.
2. Akcay, S., Atapour-Abarghouei, A. & Breckon, T. P. Ganomaly: Semi-supervised anomaly
detectionviaadversarialtraininginAsianconferenceoncomputervision(2018),622–637.
3. Amadasun, M. & King, R. Textural features corresponding to textural properties. IEEE
Transactionsonsystems,man,andCybernetics19, 1264–1274(1989).
4. Andrearczyk, V. & Whelan, P. F. Using filter banks in convolutional neural networks for
textureclassification. PatternRecognitionLetters84, 63–69(2016).
5. Arivazhagan, S. & Ganesan, L. Texture segmentation using wavelet transform. Pattern
RecognitionLetters24, 3197–3203(2003).
6. Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein generative adversarial networks in
Internationalconferenceonmachinelearning (2017),214–223.
7. Azizi,Z.,Lei,X.&Kuo,C.-C.J.Noise-AwareTexture-PreservingLow-LightEnhancement
in 2020 IEEE International Conference on Visual Communications and Image Processing
(VCIP) (2020),443–446.
8. Bell, S., Upchurch, P., Snavely, N. & Bala, K. Material recognition in the wild with the
materials in context database in Proceedings of the IEEE conference on computer vision
andpatternrecognition (2015),3479–3487.
9. Bell, S., Upchurch, P., Snavely, N. & Bala, K. Opensurfaces: A richly annotated catalog of
surfaceappearance. ACMTransactionsongraphics(TOG)32, 111(2013).
10. Berger, G. & Memisevic, R. Incorporating long-range consistency in cnn-based texture
generation. arXivpreprintarXiv:1606.01286 (2016).
11. Bergmann, P., Fauser, M., Sattlegger, D. & Steger, C. MVTec AD–A Comprehensive Real-
World Dataset for Unsupervised Anomaly Detection in Proceedings of the IEEE/CVF Con-
ferenceonComputerVisionandPatternRecognition(2019),9592–9600.
12. Bergmann,P.,Fauser,M.,Sattlegger,D.&Steger,C.Uninformedstudents:Student-teacher
anomaly detection with discriminative latent embeddings in Proceedings of the IEEE/CVF
ConferenceonComputerVisionandPatternRecognition(2020),4183–4192.
13. Bergmann, P., Löwe, S., Fauser, M., Sattlegger, D. & Steger, C. Improving Unsupervised
DefectSegmentationbyApplyingStructuralSimilaritytoAutoencodersinProceedingsofthe
14th International Joint Conference on Computer Vision, Imaging and Computer Graphics
Theory and Applications, VISIGRAPP 2019, Volume 5: VISAPP, Prague, Czech Republic,
February 25-27, 2019 (eds Trémeau, A., Farinella, G. M. & Braz, J.) (SciTePress, 2019),
372–380.doi:10.5220/0007364503720380.
14. Broadhurst, R. E. Statistical estimation of histogram variation for texture classification in
Proc.Intl.Workshopontextureanalysisandsynthesis(2005),25–30.
90
15. Brodatz,P. Textures:aphotographicalbumforartistsanddesigners(DoverPubns,1966).
16. Buades,A.,Coll,B.&Morel,J.-M.Anon-localalgorithmforimagedenoisinginComputer
Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on
2(2005),60–65.
17. Chang, T. & Kuo, C.-C. J. Texture analysis and classification with tree-structured wavelet
transform.IEEETransactionsonimageprocessing2,429–441(1993).
18. Chellappa, R. & Chatterjee, S. Classification of textures using Gaussian Markov random
fields.IEEETransactionsonAcoustics,Speech,andSignalProcessing33,959–963(1985).
19. Chen, H.-S., Rouhsedaghat, M., Ghani, H., Hu, S., You, S. & Kuo, C.-C. J. DefakeHop:
A Light-Weight High-Performance Deepfake Detector. arXiv preprint arXiv:2103.06929
(2021).
20. Chen, J., Shan, S., He, C., Zhao, G., Pietikainen, M., Chen, X., et al. WLD: A robust local
imagedescriptor.IEEEtransactionsonpatternanalysisandmachineintelligence32,1705–
1720(2009).
21. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A. L. Deeplab: Semantic
image segmentation with deep convolutional nets, atrous convolution, and fully connected
crfs.IEEEtransactionsonpatternanalysisandmachineintelligence40, 834–848(2017).
22. Chen, P. C. & Pavlidis, T. Segmentation by texture using a co-occurrence matrix and a
split-and-mergealgorithm.Computergraphicsandimageprocessing10, 172–182(1979).
23. Chen, Y. & Kuo, C.-C. J. Pixelhop: A successive subspace learning (ssl) method for object
recognition.JournalofVisualCommunicationandImageRepresentation70,102749(2020).
24. Chen, Y., Rouhsedaghat, M., You, S., Rao, R. & Kuo, C.-C. J. Pixelhop++: A small
successive-subspace-learning-based(ssl-based)modelforimageclassificationin2020IEEE
InternationalConferenceonImageProcessing(ICIP)(2020),3294–3298.
25. Chen, Y., Xu, Z., Cai, S., Lang, Y. & Kuo, C.-C. J. A saak transform approach to efficient,
scalableandrobusthandwrittendigitsrecognitionin2018PictureCodingSymposium(PCS)
(2018),174–178.
26. Chen, Y., Zhou, X. S. & Huang, T. S. One-class SVM for learning in image retrieval in
Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205) 1
(2001),34–37.
27. Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S. & Vedaldi, A. Describing textures in the
wild in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2014),3606–3613.
28. Cimpoi, M., Maji, S., Kokkinos, I. & Vedaldi, A. Deep filter banks for texture recognition,
description,andsegmentation.InternationalJournalofComputerVision118,65–94(2016).
29. Cimpoi,M.,Maji,S.&Vedaldi,A.Deepfilterbanksfortexturerecognitionandsegmentation
in Proceedings of the IEEE conference on computer vision and pattern recognition (2015),
3828–3836.
30. Cohen,N.&Hoshen,Y.Sub-imageanomalydetectionwithdeeppyramidcorrespondences.
arXivpreprintarXiv:2005.02357 (2020).
31. Conners, R. W. & Harlow, C. A. A theoretical comparison of texture algorithms. IEEE
transactionsonpatternanalysisandmachineintelligence,204–222(1980).
32. Crosier,M.&Griffin,L.D.Usingbasicimagefeaturesfortextureclassification.International
journalofcomputervision88,447–460(2010).
91
33. Dana, K. J., Van Ginneken, B., Nayar, S. K. & Koenderink, J. J. Reflectance and texture of
real-worldsurfaces.ACMTransactionsOnGraphics(TOG)18, 1–34(1999).
34. Defard, T., Setkov, A., Loesch, A. & Audigier, R. PaDiM: a Patch Distribution Model-
ing Framework for Anomaly Detection and Localization. arXiv preprint arXiv:2011.08785
(2020).
35. Dehaene, D., Frigo, O., Combrexelle, S. & Eline, P. Iterative energy-based projection on
a normal data manifold for anomaly localization. Internaltional Conference on Learning
Representations(2020).
36. Doersch,C.Tutorialonvariationalautoencoders. arXivpreprintarXiv:1606.05908(2016).
37. Doretto,G.,Chiuso,A.,Wu,Y.N.&Soatto,S.Dynamictextures.InternationalJournal of
ComputerVision51,91–109(2003).
38. Efros, A. A. & Leung, T. K. Texture synthesis by non-parametric sampling in Proceedings
oftheseventhIEEEinternationalconferenceoncomputervision2(1999),1033–1038.
39. Fei, Y., Huang, C., Jinkun, C., Li, M., Zhang, Y. & Lu, C. Attribute restoration framework
foranomalydetection.IEEETransactionsonMultimedia(2020).
40. Fogel, I. & Sagi, D. Gabor filters as texture discriminator. Biological cybernetics 61, 103–
113(1989).
41. Funke,C.M.,Gatys,L.A.,Ecker,A.S.&Bethge,M.Synthesisingdynamictexturesusing
convolutionalneuralnetworks.arXivpreprintarXiv:1702.07006 (2017).
42. Gatys,L.,Ecker,A.S.&Bethge,M.Texturesynthesisusingconvolutionalneuralnetworks
in Advancesinneuralinformationprocessingsystems(2015),262–270.
43. Gilboa, G., Sochen, N. & Zeevi, Y. Y. Texture preserving variational denoising using an
adaptivefidelitytermin Proc.VLsM 3(2003).
44. Girshick, R. Fast r-cnn in Proceedings of the IEEE international conference on computer
vision(2015),1440–1448.
45. Girshick,R.,Donahue,J.,Darrell,T.&Malik,J.Richfeaturehierarchiesforaccurateobject
detection and semantic segmentation in Proceedings of the IEEE conference on computer
visionandpatternrecognition(2014),580–587.
46. Goodfellow,I.,Bengio,Y.&Courville,A. DeepLearning(MITPress,2016).
47. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al.
Generative adversarial nets in Advances in neural information processing systems (2014),
2672–2680.
48. Haindl, M. & Mikes, S. Texture Segmentation Benchmark in Proceedings of the 19th Inter-
national Conference on Pattern Recognition, ICPR 2008 (IEEE Computer Society, Tampa,
FL,USA,2008),1–4.doi:10.1109/ICPR.2008.4761118.
49. Han, T., Lu, Y., Wu, J., Xing, X. & Wu, Y. N. Learning generator networks for dynamic
patterns in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV)
(2019),809–818.
50. Hayman, E., Caputo, B., Fritz, M. & Eklundh, J.-O. On the significance of real-world
conditions for material classification in European conference on computer vision (2004),
253–266.
51. He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask r-cnn in Proceedings of the IEEE
internationalconferenceoncomputervision(2017),2961–2969.
92
52. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition in
Proceedings of the IEEE conference on computer vision and pattern recognition (2016),
770–778.
53. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition in
Proceedings of the IEEE conference on computer vision and pattern recognition (2016),
770–778.
54. He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: Surpassing human-level
performanceonimagenetclassificationinProceedingsoftheIEEEinternationalconference
oncomputervision(2015),1026–1034.
55. Heeger, D. J. & Pentland, A. P. Seeing structure through chaos in Proceedings of the IEEE
MotionWorkshop:RepresentationandAnalysis(1986),131–136.
56. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., et al. Gradient flow in recurrent
nets: thedifficultyoflearninglong-termdependencies2001.
57. Jain,A.K.&Farrokhnia,F.UnsupervisedtexturesegmentationusingGaborfilters.Pattern
recognition24, 1167–1186(1991).
58. Kadam, P., Zhang, M., Liu, S. & Kuo, C.-C. J. R-PointHop: A Green, Accurate and Unsu-
pervisedPointCloudRegistrationMethod.arXivpreprintarXiv:2103.08129(2021).
59. Krizhevsky,A.,Sutskever,I.&Hinton,G.E.Imagenetclassificationwithdeepconvolutional
neuralnetworksin Advancesinneuralinformationprocessingsystems(2012),1097–1105.
60. Kuo, C.-C. J. Understanding convolutional neural networks with a mathematical model.
JournalofVisualCommunicationandImageRepresentation41, 406–413(2016).
61. Kuo,C.-C.J.&Chen,Y.Ondata-drivenSaaktransform.JournalofVisualCommunication
andImageRepresentation50, 237–246(2018).
62. Kuo, C.-C. J., Zhang, M., Li, S., Duan, J. & Chen, Y. Interpretable convolutional neural
networksviafeedforwarddesign.arXivpreprintarXiv:1810.02786 (2018).
63. Kuo,C.-C.J.,Zhang,M.,Li,S.,Duan,J.&Chen,Y.Interpretableconvolutionalneuralnet-
worksviafeedforwarddesign.JournalofVisualCommunicationandImageRepresentation
60, 346–359(2019).
64. Kylberg, G. Kylberg Texture Dataset v. 1.0 (Centre for Image Analysis, Swedish University
ofAgriculturalSciencesand...,2011).
65. Lafferty,J.,McCallum,A.&Pereira,F.C.Conditionalrandomfields:Probabilisticmodels
forsegmentingandlabelingsequencedata(2001).
66. Laws,K.I.RapidtextureidentificationinImageprocessingformissileguidance238(1980),
376–382.
67. Lei,X.,Zhao,G.&Kuo,C.-C.J.NITES:ANon-ParametricInterpretableTextureSynthesis
Methodin2020Asia-PacificSignalandInformationProcessingAssociationAnnualSummit
andConference(APSIPAASC)(2020),1698–1706.
68. Li,Z.,Li,N.,Jiang,K.,Ma,Z.,Wei,X.,Hong,X.,etal.SuperpixelMaskingandInpainting
for Self-Supervised Anomaly Detection in 31st British Machine Vision Conference (2020),
7–10.
69. Lin, G., Milan, A., Shen, C. & Reid, I. Refinenet: Multi-path refinement networks for high-
resolutionsemanticsegmentationinProceedingsoftheIEEEconferenceoncomputervision
andpatternrecognition(2017),1925–1934.
70. Liu,D.C.&Nocedal,J.OnthelimitedmemoryBFGSmethodforlargescaleoptimization.
Mathematicalprogramming45, 503–528(1989).
93
71. Liu, L., Chen, J., Fieguth, P., Zhao, G., Chellappa, R. & Pietikäinen, M. From BoW to
CNN:Twodecadesoftexturerepresentationfortextureclassification.InternationalJournal
ofComputerVision127, 74–109(2019).
72. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., et al. Ssd: Single shot
multiboxdetector in Europeanconferenceoncomputervision(2016),21–37.
73. Liu,W.,Li,R.,Zheng,M.,Karanam,S.,Wu,Z.,Bhanu,B.,etal.Towardsvisuallyexplaining
variationalautoencodersinProceedingsoftheIEEE/CVFConferenceonComputerVision
andPatternRecognition(2020),8642–8651.
74. Liu,X.,Xing,F.,Yang,C.,Kuo,C.-C.J.,Babu,S.,Fakhri,G.E.,etal.VoxelHop:Successive
Subspace Learning for ALS Disease Classification Using Structural MRI. arXiv preprint
arXiv:2101.05131.
75. Liu, X. & Wang, D. Image and texture segmentation using local spectral histograms. IEEE
TransactionsonImageProcessing15,3066–3077(2006).
76. Liu, X. & Wang, D. Texture classification using spectral histograms. IEEE transactions on
imageprocessing12,661–670(2003).
77. Liznerski, P., Ruff, L., Vandermeulen, R. A., Franks, B. J., Kloft, M. & Müller, K.-R.
Explainabledeepone-classclassification.arXivpreprintarXiv:2007.01760(2020).
78. Long,J.,Shelhamer,E.&Darrell,T.Fullyconvolutionalnetworksforsemanticsegmentation
in Proceedings of the IEEE conference on computer vision and pattern recognition (2015),
3431–3440.
79. Lu, J., Wang, G. & Pan, Z. Nonlocal active contour model for texture segmentation. Multi-
mediaToolsandApplications76, 10991–11001(2017).
80. Martin, D., Fowlkes, C., Tal, D. & Malik, J. A Database of Human Segmented Natural Im-
agesanditsApplicationtoEvaluatingSegmentationAlgorithmsandMeasuringEcological
Statisticsin Proc.8thInt’lConf.ComputerVision2(2001),416–423.
81. Mevenkamp,N.&Berkels,B.Variationalmulti-phasesegmentationusinghigh-dimensional
local features in Applications of Computer Vision (WACV), 2016 IEEE Winter Conference
on(2016),1–9.
82. Mikes, S., Haindl, M., Scarpa, G. & Gaetano, R. Benchmarking of Remote Sensing Seg-
mentation Methods. Selected Topics in Applied Earth Observations and Remote Sensing,
IEEEJournalof 8, 2240–2248.doi:10.1109/JSTARS.2015.2416656(2015).
83. Napoletano, P., Piccoli, F. & Schettini, R. Anomaly detection in nanofibrous materials by
CNN-basedself-similarity.Sensors18, 209(2018).
84. Nelson,R.C.&Polana,R.Qualitativerecognitionofmotionusingtemporaltexture.CVGIP:
Imageunderstanding56, 78–89(1992).
85. Ojala, T. & Pietikäinen, M. Unsupervised texture segmentation using feature distributions.
Patternrecognition32, 477–486(1999).
86. Panagiotakis,C.,Grinias,I.&Tziritas,G.Naturalimagesegmentationbasedontreeequipar-
tition, bayesian flooding and region merging. IEEE Transactions on Image Processing 20,
2276–2287(2011).
87. Péteri, R., Fazekas, S. & Huiskes, M. J. DynTex: A comprehensive database of dynamic
textures.PatternRecognitionLetters31,1627–1632(2010).
88. Pickup, L. C., Roberts, S. J. & Zisserman, A. A sampled texture prior for image super-
resolutionin Advancesinneuralinformationprocessingsystems(2004),1587–1594.
94
89. Raad, L., Davy, A., Desolneux, A. & Morel, J.-M. A survey of exemplar-based texture
synthesis.AnnalsofMathematicalSciencesandApplications3,89–148(2018).
90. Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep con-
volutionalgenerativeadversarialnetworks.arXivpreprintarXiv:1511.06434(2015).
91. Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time
object detection in Proceedings of the IEEE conference on computer vision and pattern
recognition(2016),779–788.
92. Reed, T. R. & Dubuf, J. H. A review of recent texture segmentation and feature extraction
techniques.CVGIP:Imageunderstanding57, 359–372(1993).
93. Ren,S.,He,K.,Girshick,R.&Sun,J.Fasterr-cnn:Towardsreal-timeobjectdetectionwith
region proposal networks in Advances in neural information processing systems (2015),
91–99.
94. Rippel,O.,Mertens,P.&Merhof,D.Modelingthedistributionofnormaldatainpre-trained
deepfeaturesforanomalydetection.arXivpreprintarXiv:2005.14140(2020).
95. Ronneberger,O.,Fischer,P.&Brox,T.U-net:Convolutionalnetworksforbiomedicalimage
segmentation in International Conference on Medical image computing and computer-
assistedintervention(2015),234–241.
96. Rouhsedaghat,M.,Monajatipoor,M.,Azizi,Z.&Kuo,C.-C.J.SuccessiveSubspaceLearn-
ing:AnOverview.arXivpreprintarXiv:2103.00121(2021).
97. Rouhsedaghat,M.,Wang,Y.,Ge,X.,Hu,S.,You,S.&Kuo,C.-C.J.Facehop:Alight-weight
low-resolutionfacegenderclassificationmethod.arXivpreprintarXiv:2007.09510(2020).
98. Rouhsedaghat, M., Wang, Y., Hu, S., You, S. & Kuo, C.-C. J. Low-Resolution Face Recog-
nitionInResource-ConstrainedEnvironments.arXivpreprintarXiv:2011.11674(2020).
99. Rudolph, M., Wandt, B. & Rosenhahn, B. Same same but differnet: Semi-supervised defect
detection with normalizing flows in Proceedings of the IEEE/CVF Winter Conference on
ApplicationsofComputerVision(2021),1907–1916.
100. Ruff,L.,Vandermeulen,R.,Goernitz,N.,Deecke,L.,Siddiqui,S.A.,Binder,A.,etal.Deep
one-classclassificationinInternationalconferenceonmachinelearning(2018),4393–4402.
101. Ruff,L.,Vandermeulen,R.A.,Görnitz,N.,Binder,A.,Müller,E.,Müller,K.-R.,etal.Deep
semi-supervisedanomalydetection.arXivpreprintarXiv:1906.02694(2019).
102. Sajjadi, M. S., Scholkopf, B. & Hirsch, M. Enhancenet: Single image super-resolution
through automated texture synthesis in Proceedings of the IEEE International Conference
onComputerVision(2017),4491–4500.
103. Salehi, M., Eftekhar, A., Sadjadi, N., Rohban, M. H. & Rabiee, H. R. Puzzle-AE: Novelty
detectioninimagesthroughsolvingpuzzles.arXivpreprintarXiv:2008.12959(2020).
104. Saligrama, V. & Chen, Z. Video anomaly detection based on local statistical aggregates in
2012IEEEConferenceonComputerVisionandPatternRecognition(2012),2112–2119.
105. Schlegl, T., Seeböck, P., Waldstein, S. M., Langs, G. & Schmidt-Erfurth, U. f-AnoGAN:
Fast unsupervised anomaly detection with generative adversarial networks. Medical image
analysis54, 30–44(2019).
106. Schlegl, T., Seeböck, P., Waldstein, S. M., Schmidt-Erfurth, U. & Langs, G. Unsupervised
anomalydetectionwithgenerativeadversarialnetworkstoguidemarkerdiscoveryinInter-
nationalconferenceoninformationprocessinginmedicalimaging(2017),146–157.
95
107. Scime, L. & Beuth, J. Anomaly detection and classification in a laser powder bed additive
manufacturing process using a trained computer vision algorithm. Additive Manufacturing
19,114–126(2018).
108. Sendik, O. & Cohen-Or, D. Deep correlations for texture synthesis. ACM Transactions on
Graphics(TOG)36,161(2017).
109. Shaham,T.R.,Dekel,T.&Michaeli,T.Singan:Learningagenerativemodelfromasingle
natural image in Proceedings of the IEEE International Conference on Computer Vision
(2019),4570–4580.
110. Shotton, J., Winn, J., Rother, C. & Criminisi, A. Textonboost for image understanding:
Multi-class object recognition and segmentation by jointly modeling texture, layout, and
context.InternationalJournalofComputerVision81, 2–23(2009).
111. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image
recognition.arXivpreprintarXiv:1409.1556 (2014).
112. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image
recognition.arXivpreprintarXiv:1409.1556 (2014).
113. Sohrab, F., Raitoharju, J., Gabbouj, M. & Iosifidis, A. Subspace support vector data de-
scription in 2018 24th International Conference on Pattern Recognition (ICPR) (2018),
722–727.
114. Sohrab, F., Raitoharju, J., Iosifidis, A. & Gabbouj, M. Multimodal subspace support vector
data description.PatternRecognition 110,107648(2021).
115. Storath, M., Weinmann, A. & Unser, M. Unsupervised texture segmentation using mono-
genic curvelets and the Potts model in Image Processing (ICIP), 2014 IEEE International
Conferenceon(2014),4348–4352.
116. Tax, D. M. & Duin, R. P. Support vector data description. Machine learning 54, 45–66
(2004).
117. Tesfaldet, M., Brubaker, M. A. & Derpanis, K. G. Two-stream convolutional networks for
dynamic texture synthesis in Proceedings of the IEEE Conference on Computer Vision and
PatternRecognition(2018),6703–6712.
118. Tseng, T.-W., Yang, K.-J., Kuo, C.-C. J. & Tsai, S.-H. An interpretable compression and
classificationsystem:Theoryandapplications.IEEEAccess 8,143962–143974(2020).
119. Ulyanov, D., Vedaldi, A. & Lempitsky, V. Improved texture networks: Maximizing quality
and diversity in feed-forward stylization and texture synthesis in Proceedings of the IEEE
ConferenceonComputerVisionandPatternRecognition(2017),6924–6932.
120. Unser,M.Textureclassificationandsegmentationusingwaveletframes.IEEETransactions
onimageprocessing4, 1549–1560(1995).
121. Ustyuzhaninov, I., Brendel, W., Gatys, L. A. & Bethge, M. What does it take to generate
naturaltextures?in5thInternationalConferenceonLearningRepresentations,ICLR2017,
Toulon,France,April24-26,2017,ConferenceTrackProceedings(OpenReview.net,2017).
122. Varma, M. & Zisserman, A. A statistical approach to material classification using image
patchexemplars.IEEEtransactionsonpatternanalysisandmachineintelligence31,2032–
2047(2009).
123. Venkataramanan,S.,Peng,K.-C.,Singh,R.V.&Mahalanobis,A.AttentionGuidedAnomaly
LocalizationinImagesin EuropeanConferenceonComputerVision(2020),485–503.
96
124. Wang, G., Lu, J., Pan, Z. & Miao, Q. Color texture segmentation based on active con-
tour model with multichannel nonlocal and Tikhonov regularization. Multimedia Tools and
Applications76,24515–24526(2017).
125. Wang, X., Yu, K., Dong, C. & Change Loy, C. Recovering realistic texture in image super-
resolution by deep spatial feature transform in Proceedings of the IEEE Conference on
ComputerVisionandPatternRecognition(2018),606–615.
126. Wei, L.-Y. & Levoy, M. Fast texture synthesis using tree-structured vector quantization in
Proceedingsofthe27thannualconferenceonComputergraphicsandinteractivetechniques
(2000),479–488.
127. Xie, J., Gao, R., Zheng, Z., Zhu, S.-C. & Wu, Y. N. Learning dynamic generator model by
alternatingback-propagationthroughtime.arXivpreprintarXiv:1812.10587 (2018).
128. Xue, J., Zhang, H., Dana, K. J. & Nishino, K. Differential Angular Imaging for Material
Recognition.in CVPR (2017),6940–6949.
129. Yang, J., Shi, Y. & Qi, Z. DFR: Deep Feature Reconstruction for Unsupervised Anomaly
Segmentation.arXivpreprintarXiv:2012.07122(2020).
130. Yi,J.&Yoon,S.Patch-SVDD:Patch-levelSVDDforAnomalyDetectionandSegmentation
inProceedingsoftheAsianConferenceonComputerVision(2020).
131. Yuan, J., Wang, D. & Cheriyadat, A. M. Factorization-based texture segmentation. IEEE
TransactionsonImageProcessing24, 3488–3497(2015).
132. Zagoruyko,S.&Komodakis,N.Wideresidualnetworks. arXivpreprintarXiv:1605.07146
(2016).
133. Zavrtanik, V., Kristan, M. & Skočaj, D. Reconstruction by inpainting for visual anomaly
detection.PatternRecognition112, 107706(2021).
134. Zhang, K., Chen, H.-S., Wang, Y., Ji, X. & Kuo, C.-C. J. Texture Analysis Via Hierarchical
Spatial-Spectral Correlation (HSSC) in 2019 IEEE International Conference on Image
Processing(ICIP)(2019),4419–4423.
135. Zhang, K., Chen, H.-S., Zhang, X., Wang, Y. & Kuo, C.-C. J. A Data-centric Approach
to Unsupervised Texture Segmentation Using Principle Representative Patterns in ICASSP
2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP)(2019),1912–1916.
136. Zhang, K., Wang, B., Chen, H.-S., Wang, Y., Mou, S. & Kuo, C.-C. J. Dynamic Texture
Synthesis By Incorporating Long-range Spatial and Temporal Correlations. arXiv preprint
arXiv:2104.05940(2021).
137. Zhang, M., Kadam, P., Liu, S. & Kuo, C.-C. J. Unsupervised Feedforward Feature (UFF)
Learning for Point Cloud Classification and Segmentation in 2020 IEEE International
ConferenceonVisualCommunicationsandImageProcessing(VCIP)(2020),144–147.
138. Zhang,M.,Wang,Y.,Kadam,P.,Liu,S.&Kuo,C.-C.J.Pointhop++:Alightweightlearning
model on point sets for 3d classification in 2020 IEEE International Conference on Image
Processing(ICIP)(2020),3319–3323.
139. Zhang, M., You, H., Kadam, P., Liu, S. & Kuo, C.-C. J. Pointhop: An explainable machine
learningmethodforpointcloudclassification.IEEETransactionsonMultimedia22,1744–
1755(2020).
140. Zhao, G. & Pietikainen, M. Dynamic texture recognition using local binary patterns with
an application to facial expressions. IEEE Transactions on Pattern Analysis & Machine
Intelligence,915–928(2007).
97
141. Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network in Proceedings
oftheIEEEconferenceoncomputervisionandpatternrecognition(2017),2881–2890.
142. Zhou, J. T., Du, J., Zhu, H., Peng, X., Liu, Y. & Goh, R. S. M. AnomalyNet: An anomaly
detection network for video surveillance. IEEE Transactions on Information Forensics and
Security14, 2537–2550(2019).
143. Zhou,Y.,Zhu,Z.,Bai,X.,Lischinski,D.,Cohen-Or,D.&Huang,H.Non-stationarytexture
synthesisbyadversarialexpansion.arXivpreprintarXiv:1805.04487 (2018).
144. Zong,B.,Song,Q.,Min,M.R.,Cheng,W.,Lumezanu,C.,Cho,D.,etal.Deepautoencoding
gaussianmixturemodelforunsupervisedanomalydetectioninInternationalConferenceon
LearningRepresentations(2018).
145. Zuo, W., Zhang, L., Song, C. & Zhang, D. Texture enhanced image denoising via gradient
histogram preservation in Proceedings of the IEEE Conference on Computer Vision and
PatternRecognition(2013),1203–1210.
146. Zuo, W., Zhang, L., Song, C., Zhang, D. & Gao, H. Gradient histogram estimation and
preservationfortextureenhancedimagedenoising.IEEEtransactionsonimageprocessing
23,2459–2472(2014).
98
Abstract (if available)
Abstract
Image is one kind of high-dimensional and complicated data source which carries various information. Image analysis and modeling is a one of the most fundamental yet important topic in computer vision and pattern recognition, which has attracted extensive research attention over the last several decades. This thesis investigates and proposes methods in several important aspects of image analysis, including texture representation, unsupervised texture segmentation, dynamic texture synthesis and image anomaly detection and localization. ? For texture representation, a hierarchical spatial-spectral correlation (HSSC) method is proposed for texture analysis. The HSSC method first applies a multi-stage spatial-spectral transform to input texture patches, which is known as the Saak transform. Then, it conducts a correlation analysis on Saak transform coefficients to obtain texture features of high discriminant power. To demonstrate the effectiveness of the HSSC method, we conduct extensive experiments on texture classification and show that it offers very competitive results comparing with state-of-the-art methods. ? For unsupervised texture segmentation, we introduces a data-centric approach to efficiently extract and represent textural information, which adapts to a wide variety of textures. Based on the strong self-similarities and quasi-periodicity in texture images, the proposed method first constructs a representative texture pattern set for the given image by leveraging the patch clustering strategy. Then, pixel-wise texture features are designed according to the similarities between local patches and the representative textural patterns. Moreover, the proposed feature is generic and flexible, and can perform segmentation task by integrating it into various segmentation approaches easily. Extensive experimental results on both textural and natural image segmentation show that the segmentation method using the proposed features achieves very competitive or even better performance compared with the stat-of-the-art methods. ? For dynamic texture synthesis, its main challenge lies in how to maintain spatial consistency and temporal consistency in synthesized video. The major drawback of existing dynamic texture synthesis models comes from poor treatment of the long-range texture correlation and motion information. To address this problem, we incorporate a new loss term, called the Shifted Gram loss, to capture the structural and long-range correlation of the reference texture video. Furthermore, we introduce a frame sampling strategy to exploit long-period motion across multiple frames. With these two new techniques, the application scope of existing texture synthesis models can be extended. That is, they are able to synthesize not only homogeneous but also structured dynamic texture patterns. Thorough experimental results are provided to demonstrate that our proposed dynamic texture synthesis model offers state-of-the-art visual performance. ? For image anomaly detection and localization, a neural network targeting at unsupervised image anomaly localization, called the PEDENet, is proposed. PEDENet contains a patch embedding (PE) network, a density estimation (DE) network, and an auxiliary network called the location prediction (LP) network. The PE network takes local image patches as input and performs dimension reduction to get low-dimensional patch embeddings via a deep encoder structure. Being inspired by the Gaussian Mixture Model (GMM), the DE network takes those patch embeddings, and then predicts the cluster membership of an embedded patch. The sum of membership probabilities is used as a loss term to guide the learning process. The LP network is a Multi-layer Perception (MLP), which takes embeddings from two neighboring patches as input and predicts their relative location. The performance of the proposed PEDENet is evaluated and compared with state-of-the-art benchmarking methods by extensive experiments. ? For image anomaly localization, a brand new method based on the successive subspace learning (SSL) framework, called AnomalyHop, is proposed. AnomalyHop consists of three modules: 1) feature extraction via successive subspace learning (SSL), 2) normality feature distributions modeling via Gaussian models, and 3) anomaly map generation and fusion. Comparing with state-of-the-art image anomaly localization methods based on deep neural networks (DNNs), AnomalyHop is mathematically transparent, easy to train, and fast in its inference speed. Besides, its area under the ROC curve (ROC-AUC) performance on the MVTec AD dataset is 95.9%, which is among the best of several benchmarking methods.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Object localization with deep learning techniques
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
A data-driven approach to image splicing localization
PDF
Advanced visual processing techniques for latent fingerprint detection and video retargeting
PDF
Efficient graph learning: theory and performance evaluation
PDF
Green image generation and label transfer techniques
PDF
Transparent and lightweight medical image analysis techniques: algorithms and applications
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Vision-based and data-driven analytical and experimental studies into condition assessment and change detection of evolving civil, mechanical and aerospace infrastructures
PDF
Green learning for 3D point cloud data processing
PDF
Advanced knowledge graph embedding techniques: theory and applications
PDF
Human appearance analysis and synthesis using deep learning
PDF
Advanced techniques for human action classification and text localization
PDF
Local-aware deep learning: methodology and applications
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
3D object detection in industrial site point clouds
Asset Metadata
Creator
Zhang, Kaitai
(author)
Core Title
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2021-08
Publication Date
07/21/2021
Defense Date
05/19/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dynamic texture synthesis,image anomaly detection,image anomaly localization,OAI-PMH Harvest,texture analysis,texture segmentation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Jenkins, Keith (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
kaitaizh@usc.edu,zhangkaitai1993922@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15614231
Unique identifier
UC15614231
Legacy Identifier
etd-ZhangKaita-9814
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhang, Kaitai
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
dynamic texture synthesis
image anomaly detection
image anomaly localization
texture analysis
texture segmentation