Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
(USC Thesis Other)
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MACHINE LEARNING TECHNIQUES FOR PERCEPTUAL QUALITY ENHANCEMENT AND SEMANTIC IMAGE SEGMENTATION by Qin Huang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2017 Copyright 2017 Qin Huang Contents List of Tables v List of Figures vii Abstract xii 1 Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 False Contour Detection and Removal . . . . . . . . . . . . . . 2 1.2.2 Perceptual Quality Driven Frame-rate Selection . . . . . . . . . 3 1.2.3 Intra Class Difference Guided Semantic Segmentation . . . . . 4 1.2.4 Semantic Segmentation with Reverse Attention . . . . . . . . . 5 1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 6 2 Understanding and Removal of False Contour in HEVC Compressed Images 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Understanding False Contours . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 False Contour and Spatial Stimulus . . . . . . . . . . . . . . . 12 2.3.2 Support Regions and Monotonicity . . . . . . . . . . . . . . . 13 2.3.3 Effects of HEVC Settings . . . . . . . . . . . . . . . . . . . . 16 2.3.4 Effects of Luminance and Color . . . . . . . . . . . . . . . . . 18 2.4 False Contour Detection and Removal . . . . . . . . . . . . . . . . . . 20 2.4.1 False Contour Detection with Evolution of False Contour Can- didate (FCC) Maps . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.2 False Contour Removal . . . . . . . . . . . . . . . . . . . . . . 25 2.4.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.1 False Contour Detection Results . . . . . . . . . . . . . . . . . 29 2.5.2 False Contour Removal Results . . . . . . . . . . . . . . . . . 32 2.5.3 More Comparison Results . . . . . . . . . . . . . . . . . . . . 34 ii 2.5.4 Subjective Test Results . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 38 3 Perceptual Quality Driven Frame-Rate Selection (PQD-FRS) for High-Frame- Rate Video 40 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 FRD-VQA Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.2 Description of FRD-VQA Dataset . . . . . . . . . . . . . . . . 46 3.4 Proposed PQD-FRS Method . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.1 Basic Decision Unit . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.2 Feature Extraction via Influence Maps Construction . . . . . . . 52 3.4.3 Feature Representation . . . . . . . . . . . . . . . . . . . . . . 60 3.4.4 Satisfied User Ratio (SUR) Prediction and Frame-Rate Selec- tion (FRS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5.1 Visualization of Influence Maps . . . . . . . . . . . . . . . . . 63 3.5.2 SUR Prediction Results . . . . . . . . . . . . . . . . . . . . . 63 3.5.3 Subjective Test on PQD-FRS Performance . . . . . . . . . . . 66 3.5.4 Influence of Compression . . . . . . . . . . . . . . . . . . . . 71 3.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 73 4 Learning from Fine-to-Coarse: Intra-Class Difference Guided Semantic Segmentation 74 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 Proposed ICD-FCN Method and FCL Procedure . . . . . . . . . . . . . 77 4.3.1 Overview of ICD-FCN . . . . . . . . . . . . . . . . . . . . . . 77 4.3.2 Proposed FCL Procedure . . . . . . . . . . . . . . . . . . . . . 78 4.3.3 Exploring Intra-Class-Difference (ICD) . . . . . . . . . . . . . 80 4.4 Analyze Fine-to-Coarse Learning . . . . . . . . . . . . . . . . . . . . . 84 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.2 ResNet-101 versus Res-FCN . . . . . . . . . . . . . . . . . . . 86 4.4.3 Effectiveness of Clustering . . . . . . . . . . . . . . . . . . . . 88 4.4.4 Benefits of Fine-to-Coarse . . . . . . . . . . . . . . . . . . . . 89 4.4.5 Other Experiments . . . . . . . . . . . . . . . . . . . . . . . . 90 4.5 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 94 iii 5 Semantic Segmentation with Reverse Attention 95 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3 Proposed Reverse Attention Network (RAN) . . . . . . . . . . . . . . . 98 5.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.2 Proposed RAN System . . . . . . . . . . . . . . . . . . . . . . 100 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6 Conclusion and Future Work 109 6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 111 Bibliography 112 iv List of Tables 2.1 PSNR and SSIM scores of subjective test images. Four groups of scores are provided, corresponding to the compressed image without decon- touring, LL’s method from [61], JG’s method from [53] and proposed FCDR method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Properties of 25 sequences in the FRD-VQA dataset. . . . . . . . . . . 47 3.2 Comparison of predicted SUR values in terms of percentages with dif- ferent feature vectors, where the SVR with a linear or RBF SVR kernel is adopted as the machine learning algorithm. . . . . . . . . . . . . . . 64 3.3 Subjective SUR results of homogeneous test sequences from “Netflix Chimera” dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4 Subjective Test Results of comparing the half- and dynamic-rate sequences against their full-rate ones, along with prediction mean and GoP num- bers in two categories. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1 Experiments on different learning procedures and baseline models with corresponding mean IU performance onreduced PASCAL 2012 valida- tion set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2 Experiment with different sets of [LT, HT] and corresponding mean IU performance with different sub-class numbers. . . . . . . . . . . . . . . 88 4.3 Additional experiments to explore the learning in Step 3. . . . . . . . . 90 4.4 Mean IU (%) benchmark performance with detailed per-class score in PASCAL VOC test 2011. . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5 Mean IU (%) benchmark performance with detailed per-class score in PASCAL VOC test 2012. . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6 Mean IU (%) performance on the reduced VOC 12 validation set and the VOC 2011-12 test set with augmented PASCAL training data and selected COCO images. . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7 Mean IU (%) performance on NYUDv2 validation set with RGB train- ing images, compared with FCN-32s [69] and baseline Res-FCN. . . . . 93 4.8 Mean IU (%) performance on PASCAL Context dataset, compared with FCN-8s [69], CRF-RNN [109] and baseline Res-FCN. . . . . . . . . . 93 v 5.1 Comparison of semantic image segmentation performance scores (%) on the 5,105 test images of the PASCAL Context dataset. . . . . . . . . 104 5.2 Ablation study of different RANs on the PASCAL-Context dataset to evaluate the benefit of proposed RAN. We compare the results under different network set-up with employing dilated decision conv filters, data augmentation, the MSC design and the CRF post-processing. . . . 104 5.3 Comparison of the Mean IU scores (%) of several benchmarking meth- ods for the PASCAL PERSON-Part dataset. . . . . . . . . . . . . . . . 106 5.4 Comparison of the Mean IU scores (%) per object class of several meth- ods for the PASCAL VOC2012 test dataset. . . . . . . . . . . . . . . . 107 5.5 Comparison of the Mean IU scores (%) of several benchmarking meth- ods on the NYU-Depth2 dataset. . . . . . . . . . . . . . . . . . . . . . 107 5.6 Comparison of the Mean IU scores (%) of several benchmarking meth- ods on the ADE20K dataset. . . . . . . . . . . . . . . . . . . . . . . . 108 vi List of Figures 2.1 Illustration of the local support of a contour point. . . . . . . . . . . . . 12 2.2 Comparison of three smooth synthesized images: their original images are shown in (a)-(c), their associated horizontal intensity profile are shown in (d)-(f), and their associated horizontal intensity profile are shown in (g)-(i), respectively. . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 A source image patch in (a) with its coded counterpart in (b), an exem- plary intensity profile along the normal direction in (c), and an intensity profile along the contour direction in (d). . . . . . . . . . . . . . . . . . 14 2.4 Illustration of a monotonic intensity profile along the normal direction for (a) a false contour, and (b) a real contour point. . . . . . . . . . . . 15 2.5 Illustration of four representative false contour scenarios: (a) 1-pixel width, (b) narrow width with a constant gradient, (c) narrow width a different gradient, and (d) broad width with a constant gradient. . . . . . 16 2.6 Illustration of the false contour artifact: (a) the sunset image of size 1920 1080 with a marked smooth area of size 900 600, (b) the original image of the marked area, and (c)-(e) compressed images of the marked area using HM15.0-rext-8.0 with QP=32 with (c) MaxCU=64, CUdepth=4, MaxTU=32, MinTU=4 (PSNR=47.87dB), (d) MaxCU=32, CUdepth=3, MaxTU=16, MinTU=4 (PSNR=46.72dB), (e) MaxCU=16, CUdepth=2, MaxTU=8, MinTU=4 (PSNR=44.68dB). . . . . . . . . . . 17 2.7 Resultant images with the same CU and TU set-up given in Fig. 2.6(e) yet with different deblocking configurations: (a) with the deblocking filter of parameters LoopFiterBetaOffset=6, LoopFilterTcOffset=6 and DeblockingFilterControlPresent=1 (PSNR=44.72dB), (b) disabled deblock- ing filter with LoopFilterDisable=1 (PSNR=44.88dB). . . . . . . . . . 19 2.8 The effect of color luminance and color channels on false contour arti- fact, with (a) result of the flipped luminance channel in Fig. 2.6e, and the results where the same texture map is adopted for (b) the Y channel, (c) the Cb channel and (d) the Cr channel . . . . . . . . . . . . . . . . 20 2.9 (a) Labels of one target pixel and its eight neighbors withk = 0; 1; ; 8; (b) gradient in direction k. . . . . . . . . . . . . . . . . . . . . . . . . 22 vii 2.10 The system diagram of the proposed false contour detection and removal (FCDR) method, where the first 3 steps are conducted for false con- tour detection and the detection result is used to guide the false contour removal process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.11 (a) The luminance channel of coded Sunset image using HM15.0-rext- 8.0 with QP=32 (PSNR=42.51dB), and (b)-(d): the resulting FCC (false contour candidate) maps after Steps 1-3, respectively. . . . . . . . . . . 24 2.12 Detection of the coded sunset image of flipped luminance in (a), with detection result using Weber’s Law in (b). . . . . . . . . . . . . . . . . 25 2.13 The processed sunset image using (a) probabilistic dithering only and (b) both probabilistic dithering and averaging. . . . . . . . . . . . . . . 27 2.14 The coded airplane image is shown in (a) and the FCC maps after the first, the second and the third steps are shown in (b), (c) and (d), respec- tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.15 Detected false contour regions for the coded airplane image using (a) the LL method and (b) the JG method. . . . . . . . . . . . . . . . . . . 30 2.16 Comparison of false contour detection results on an exemplary image without visible false contours: (a) the input building image, (b) the detected result by the LL method, (c) the detected result by the JG method, and (d) the detected result by the FCDR method. . . . . . . . . 31 2.17 Comparison of false contour removal results: (a) the coded airplane image (PSNR=38.88dB, SSIM=0.947), (b) the decontouring result by the LL method (PSNR=38.92dB, SSIM=0.949), (c) the decontouring result by the JG method (PSNR=38.79dB, SSIM=0.944), (d) the decon- touring result by the FCDR method (PSNR=38.82dB, SSIM=0.949). . . 32 2.18 Comparison of false contour removal results in the sky region inside the blue window: (a) the coded image patch (PSNR=38.78dB, SSIM=0.9941), (b) the decontouring result by the LL method (PSNR=38.94dB, SSIM=0.9945), (c) the decontouring result by the JG method (PSNR=40.37dB, SSIM=0.9922), (d) the decontouring result by the FCDR method (PSNR=40.48dB, SSIM=0.9954). 33 2.19 Comparison of false contour removal results in the airplane tail and trail region inside the red window: (a) the coded image patch (PSNR=38.37dB, SSIM=0.945), (b) the decontouring result by the LL method (PSNR=38.34dB, SSIM=0.945), (c) the decontouring result by the JG method (PSNR=37.99dB, SSIM=0.945), (d) the decontouring result by the FCDR method (PSNR=38.35 dB, SSIM=0.946). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.20 An exemplary image coded by H.264 with QP=36 in (a), with (b) the dance image coded by H.264 inside the blue window, (c) the decontoring result of the LL method inside the blue window, (d) the decontoring result of the JG method inside the blue window, (e) the decontoring result of the FCDR method inside the blue window. . . . . . . . . . . . 35 2.21 Six more images used for the subjective test. . . . . . . . . . . . . . . . 38 viii 2.22 Subjective test results: (a) comparison between the proposed FCDR method and the LL method, and (b) comparison between the proposed FCDR and the JG method. . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1 Examples of four FRD-VQA sequences: (a) Fountain: fast motion with complex background, (b) Tunnel: fast motion with simple background, (c) Walking: slow motion with complex background and (d) Slow City: slow motion with simple background. . . . . . . . . . . . . . . . . . . 46 3.2 The MOS values plotted as functions of the frame rate: (a) sequences with frame rates 60, 30 and 15 fps, and (b) sequences with frame rates 50, 25 and 12.5 fps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Comparison of the viewing experience on video sequences with the maximum frame rate and the half maximum frame rate: (a) sequences with a maximum frame rate of 60fps, and (b) sequences with a maxi- mum frame rate of 50fps. . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Illustration of spatial randomness prediction. . . . . . . . . . . . . . . . 53 3.5 Temporal randomness prediction withd = 4 in a GoP. . . . . . . . . . . 56 3.6 Spatial feature maps of HD videos with (a) The luminance channel of Fig. 3.1(a), (b) the spatial randomness map with block-size and block- stride equaling 32, (c) the smoothness map with block size 32 andT lc = 1, (d) the spatial contrast map, (e) the salient map and (f) the spatial- influence map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.7 Three consecutive frames of the HD video “fountain” following the frame in Fig. 3.6(a) are shown in (a)-(c) and their temporal random- ness map, spatio-temporal influence map and weighted spatio-temporal influence map are shown in (d), (e) and (f), respectively. . . . . . . . . . 58 3.8 The block STIM value distribution of exemplary frames of four sequences in Fig. 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.9 Feature heat maps of three exemplary frames of Tunnel (the first row), Running Horses (the second row) and Honeybee (the third row) sequences: frame shots (the first column), TRM (the second column), STIM (the third column) and WSTIM (the last column). . . . . . . . . . . . . . . 62 3.10 Examples of homogeneous sequences for testing: (a) Barscene: a small object sliding on the table, (b) Windturbine: swirling wind turbines, (c) Driving: a first-person view of driving scene and (d) Toddler: a small kid running in the fountain. . . . . . . . . . . . . . . . . . . . . . . . . 66 3.11 Examples of inhomogeneous sequences for dynamic reconstruction: (a) Traffic: very fast moving traffic and changing background, (b) Dance: a scene of group dancing, (c) Biking: group of people biking and skate- boarding on the road and (d) Jockey: a man riding a running horse. . . . 68 3.12 Subjective test results for (a) comparison between the full- and the half- rate cases, and (b) comparison of the full- and the dynamic-rate cases. . 68 ix 3.13 SURs results of subjective test and prediction corresponding to different qualities of the video samples in Fig. 3.1. . . . . . . . . . . . . . . . . 72 4.1 The proposed fine-to-coarse learning (FCL) procedure with intra-class difference guided fully convolutional neural network (ICD-FCN). A designed clustering scheme is designed to relabel the original ground truth into detailed sub-classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 The network architecture of the proposed 3-step FCL procedure: a base- line model with the original (or coarse) class labels is obtained in Step 1; features associated with the sub-class (or fine) characteristics are derived and used for training in Step 2; and the fine and coarse information is combined using the end-to-end training. . . . . . . . . . . . . . . . . . 79 4.3 Illustration of the ICD-class feature extraction, normalization and clus- tering procedures to generate finer ICD-class labels. First, the response maps of all training data are collected at the output of the conv5. Then, a normalization procedure is conducted with respect to the size of the label region. Finally, k-means clustering is performed on the normal- ized 1D feature vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4 Qualitative results on PASCAL VOC 2012 validation set. We present the results from FCN-8s [69], the Res-FCN baseline from step-1 and the proposed ICD-FCN in columns 2-4, with the corresponding ground truth in column 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.5 Sample images close to the centroid of each ICD-class for (a) the ’aero- plane’ class with ResNet-101, (b) the ’aeroplane’ class with the pre- trained Res-FCN, (c) the ’boat’ class with the pre-trained Res-FCN, and (d) the ’bicyle’ class with the pre-trained Res-FCN. . . . . . . . . . . . 87 4.6 Plots of the validation performance in terms of the mean IU with differ- ent training procedures and baseline models as stated in Table . . . . . . 88 4.7 Plots of (a) training losses corresponding to different cluster numbers in Step 2 and (b) the cluster estimation using the Hopkins score and the Silhouette method for the ’aeroplane’ class, where the Hopkins score provides a reference class number at 3 while the Silhouette estimation method offers the optimal cluster number at 4. . . . . . . . . . . . . . . 89 4.8 The filter responses of the ’aeroplane’ class in baseline Res-FCN and the proposed ICD-FCN. The ICD-FCN enables learning of distinctive ’aeroplane’ features, which contribute to more accurate segmentation result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 x 5.1 An illustration of the proposed reversed attention network (RAN), where the lower and upper branches learn features and predictions that are and are not associated with a target class, respectively. The mid-branch focuses on local regions with complicated spatial patterns whose object responses are weaker and provide a mechanism to amplify the response. The predictions of all three branches are fused to yield the final predic- tion for the segmentation task. . . . . . . . . . . . . . . . . . . . . . . 96 5.2 Observations on FCN’s direct learning. The normalized feature response of the last conv5 layer is presented along with the class-wise probability map for ’dog’ and ’cat’. . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3 The system diagram of the reverse attention network (RAN), where CONV org andCONV rev filters are used to learn features associated and not associated with a particular class, respectively. The reverse object class knowledge is then highlighted by an attention mask to generate the reverse attention of a class, which will then be subtracted from the original prediction score as a correction. . . . . . . . . . . . . . . . . . 101 5.4 Qualitative results in the PASCAL-Context validation set with: the input image, the DeepLabv2-LargeFOV baseline, our RAN-s result, and the ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.5 Qualitative results in the NYU-DepthV2 validation set with: the input image, the DeepLabv2-LargeFOV baseline, our RAN-s result, and the ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 xi Abstract Understanding of the contents in image and video is important in both image process- ing and computer vision. Since the digital image and video contents are in the mere form of two-dimensional pixel arrays, the computer needs to be guided to “see” and to “understand”. To resolve these issues, traditional methods include two important steps: extracting effective feature representations and designing efficient decision sys- tem. However with the emerging of big visual data, recent development of deep learning techniques provides a more effective way to simultaneously acquire feature representa- tions and generate the desired output. In this dissertation, we contribute to four works that gradually develop from the tra- ditional method to deep learning based method. Based on the applications, the works can be divided into two major categories: perceptual quality enhancement and semantic image segmentation. In the first part, we focus on enhancing the quality of images and videos by considering related perceptual properties of human visual system. To begin with, we deal with the a type of compression artifacts referred to as “false contour”. We then focus on the visual experience of videos and its relationship with displaying frame-rate. In the second part, we target at generating both low-level detail segmenta- tion and high-level semantic meaning on given images, which requires a more detailed understanding of images. xii Based on the desired targets, our solutions are specifically designed and can effi- ciently resolve the problems with both traditional machine-learning frame-work and state-of-the-art deep learning techniques. xiii Chapter 1 Introduction 1.1 Significance of the Research Development of modern computing resources and algorithms enrich our daily life with both convenience and efficiency. We now rely on the help of computers and smart phones on every aspects of our lives. It seems as if the computer is able to do anything as long as you ask for it. However one major issue is that computers cannot “see” the world the way we do. And therefore it is important to guide the computer in understanding the digital visual contents. Researches on image processing and computer vision problems can be generally divided into two major steps: extracting powerful feature representations and designing efficient decision system. Traditional methods rely on hand-craft features, as well as pre-defined thresholds to generate a necessary condition required for the desired target. A more robust system could be designed taking advantage of machine learning tech- niques if enough training samples are provided. Thanks to the development of big data, millions of image and video contents are now available for training. To better utilize the information in the training, convolutional neural network based deep learning systems become popular in recent years. Specifically, the CNN based methods demonstrate bet- ter ability to acquire powerful feature representations in a simultaneous way. However, CNN based training has a high requirement of hardware and subtle process design. And therefore it should be careful explored to obtain desired results. 1 In this dissertation, we contribute to three works that gradually develop from the traditional method to deep learning based method. Based on the applications, the works can be divided into two major categories: perceptual quality enhancement and semantic image segmentation. In the first part, we focus on enhancing the quality of images and videos by considering related perceptual properties of human visual system. To begin with, we deal with the a type of compression artifacts referred to as “false contour”. We then focus on the visual experience of videos and its relationship with displaying frame-rate. In the second part, we target at generating both low-level detail segmenta- tion and high-level semantic meaning on given images, which requires a more detailed understanding of images. 1.2 Contributions of the Research In this dissertation, we propose three methods to solve three important problems. Firstly, we design a two-stage system to detect and remove false contours in HEVC encoded images. Secondly, we propose a machine learning based system that combines percep- tual properties with spatial image features. Thirdly, we design fine-to-coarse learning procedure which prompts the state-of-the-art Fully Convolutional Neural Network to understand the intra-class difference in semantic segmentation. In brief summary, our proposed methods have following contributions. 1.2.1 False Contour Detection and Removal A contour-like artifact called false contour is often observed in large smooth areas of decoded images and video. Without loss of generality, we focus on detection and removal of false contours resulting from the state-of-the-art HEVC codec. 2 First, we identify the cause of false contours by explaining the human perceptual experiences on them with specific experiments. Next, we propose a precise pixel-based false contour detection method based on the evolution of a false contour candidate (FCC) map. The number of points in the FCC map becomes fewer by imposing more constraints step by step. Special attention is paid to separating false contours from real contours such as edges and textures in the video source. Then, a decontour method is designed to remove false contours in the exact contour position while preserving edge/texture details. Extensive experimental results are provided to demonstrate the superior performance of the proposed false contour detection and removal (FCDR) method in both compressed images and videos. 1.2.2 Perceptual Quality Driven Frame-rate Selection Video of higher frame rates (HFR) reduces the visual artifact in large screen display at the cost of a higher coding bit rate (or transmission bandwidth). In this work, we propose a perceptual quality driven frame rate selection (PQD-FRS) method that assigns a time- varying frame rate to a sequence so as to reduce its transmission cost. The objective of the PQD-FRS method is to offer perceptually indistinguishable experience for a certain percentage of viewers. We first conduct a subjective test to characterize the relationship between human perceived quality and video contents, and build a frame-rate-dependent video quality assessment (FRD-VQA) dataset to serve as the ground truth. Then, we use a machine learning approach for the design of the key module of the PQD-FRS method, called the “satisfied user ratio (SUR) prediction”. The 3 SUR prediction module predicts the percentage of satisfied viewers, who cannot differentiate video quality of a lower and higher frame rate, using the support vector regression (SVR). We propose effective scene classification systems with the help of simple semantic features that describe the semantic ingredients’ appearance, layouts and visual patterns in a scene, with the strong help of robust rough semantic segmentation results. Experimental results confirm that the proposed SUR module can offer a so highly accurate prediction that the PQD-FRS system can dynamically assign a proper frame rate to video without any perceptual quality degradation for a majority of viewers. 1.2.3 Intra Class Difference Guided Semantic Segmentation With the development of Fully Convolutional Neural Network (FCN) and its descen- dants, the field of semantic segmentation has advanced rapidly in recent years. The FCN-based solutions are able to summarize features across training images and gen- erate matching templates for the desired object classes, yet they overlook intra-class difference (ICD) among multiple instances in the same class. Therefore, we present a novel fine-to-coarse learning (FCL) procedure, which first guides the network with designed ’finer’ sub-class labels, and later combines its decision with original ’coarse’ object category through end-to-end learning. A sub-class labeling strategy is designed with deep convolutional features, and the proposed FCL procedure enables a balance between the fine-scale (i.e. sub-class) and the coarse-scale (i.e. class) knowledge. 4 Consistent and significant improvements in semantic segmentation are observed across the PASCAL VOC, NYUDv2 and PASCAL Context datasets. The pro- posed method achieves the state-of-the-art performance under both small and large training datasets. 1.2.4 Semantic Segmentation with Reverse Attention Recent development in fully convolutional neural network enables efficient end-to-end learning of semantic segmentation. Traditionally, the convolutional classifiers are taught to learn the representative semantic features of labeled semantic objects. In this work, we propose a reverse attention network (RAN) architecture that trains the network to capture the opposite concept (i.e., what are not associated with a target class) as well. The RAN is a three-branch network that performs the direct, reverse and reverse- attention learning processes simultaneously. Extensive experiments are conducted to show the effectiveness of the RAN in semantic segmentation. Being built upon the DeepLabv2-LargeFOV , the RAN achieves the state-of-the-art mIoU score (48:1%) for the challenging PASCAL- Context dataset. Significant performance improvements are also observed for the PASCAL-VOC, Person-Part, NYUDv2 and ADE20K datasets. 5 1.3 Organization of the Dissertation The rest of the dissertation is organized as follows. In Chapter 2, we introduce about our work in enhancing the quality of compressed images by detecting and removing the false contours in HEVC encoded images. In Chapter 3, we examine the relation- ship between perceptual quality of videos and the displaying frame-rate, and propose a machine-learning based dynamic frame-rate assignment strategy. In Chapter 4, we look into the state-of-the-art Fully Convolutional Neural Network, and propose a fine- to-coarse learning procedure to prompt the learning of intra-class difference. A reverse attention network structure is proposed and described in Chapter 5 where the network explicitly understand what does not belong to target object class. Finally, we will sum- marize the work and discuss future research directions in Chapter 6. 6 Chapter 2 Understanding and Removal of False Contour in HEVC Compressed Images 2.1 Introduction One major image/video compression artifact is contouring in smooth surface area. Such contours do not exist in the source image/video, but appears in the coded image/video. And for this reason, they are called false contours, which attribute to the quantization effect of compression codec. Although the false contour is a common phenomenon for coded image/video content, it is the dominant one in high resolution HD (high- definition) and UHD (ultra-high-definition) video since other artifacts are well sup- pressed at higher bit-rates with smaller quantization parameters (QPs). As compared with other popular coding standards such as MPEG-2 and H.264/A VC, the high efficiency video coding (HEVC) standard [94] offers better performance in suppressing blocking artifacts by exploiting larger-sized CTBs (coding tree block) and in-loop deblocking filter [74]. However,the problem of false contour still exists due to the unbalanced quantization of the smooth regions. Moreover, although being of a small magnitude, the false contour is still not negligible since it can degrade the per- ceptual quality of decoded video significantly, especially for high-resolution contents when viewed on large screens [85, 86]. Furthermore, thanks to the large CTB sizes, the appearance of false contour is not as easy to identify as it is in H.264, which makes the detection and removal of false contour more difficult. Thus, it is important to understand 7 the mechanism of false contour and to develop an effective method to detect and remove false contours, especially in HEVC decoded video. Research on false contour detection and removal has been conducted by a few researchers in the past and the review on related work will be provided in Sec. 3.2. Although previous methods are showing reasonable performance in removing false con- tour in H.264 or JPEG images, they fail to do well in more general case and especially in HEVC. Our work, on the other hand, has shown robust performance in detecting the very noticeable false contours and remove them effectively. Therefore, our major contributions lie in three aspects: 1) we provide a thorough understanding of false con- tour and discuss related perceptual properties as well as different HEVC settings for false contour to appear; 2) we design a false contour detection method which accurately detects the exact positions of visible false contours in HEVC which highly relies on our designed monotonicity feature of local support region, and 3) we propose a removal method which efficiently reduce the false contours while preserving edge/texture details of the source. To achieve these objectives, we start by exploring the factors contributing to the false contour, with extensive illustrative examples. Specifically, we discuss the appearance of false contour with regard to related perceptual property and HEVC parameters. After- wards, we propose a false contour detection method by imposing a sequence of con- straints such as low-amplitude stimulus, local smoothness and monotonicity. Among these properties, the explored spatial monotonicity turns out to be extremely important to generate a clean false contour detection result compared to previous methods. We then define a false contour candidate (FCC) map and show that the number of points in the FCC map becomes fewer and fewer by imposing more and more constraints step by step. Special attention is paid to separating false contours from real contours such 8 as edges and textures in the original image/video source. Then, we propose a decon- tour method to remove false contours while preserving edge/texture details using prob- abilistic dithering and averaging. Finally, extensive experimental results, including a subjective test on several false contour scenarios are given to demonstrate the superior performance of the proposed false contour detection and removal (FCDR) method. The rest of this paper is organized as follows. Related previous work is reviewed in Section 3.2. The cause of false contours and their properties are examined in Section 2.3. A false contour detection method and a de-contouring method are proposed in Sec- tions 2.4.1 and 2.4.2, respectively, to build the full false contour detection and removal (FCDR) system. Experimental results are presented in Section 2.5. Finally, concluding remarks and future extensions are given in Section 3.6. 2.2 Review of Previous Work Research on false contour detection and decontouring has been conducted by researchers in the past. Several methods have been proposed to solve this problem. They can be clas- sified into two categories depending on the order of their application. If a decontouring method is applied to the source video before its coding, it is a pre-processing technique designed to prevent the contouring artifact prior quantization. Examples of adopting decontouring as a pre-processing module include [80, 22, 34, 75, 54]. On the other hand, if a decontouring method is applied to coded video, it works as a post-processing module. It is designed to detect and remove false contours during or after the video decoding process. Examples of adopting decontouring as a post-processing module include [61, 23, 4, 10, 53, 106]. Although these works have spent many efforts in dis- cussing about the detection and removal of false contour, most of them fail to precisely locate the positions of the false contours and provide effective methods to suppress them. 9 Several decontouring methods perform the pixel-by-pixel processing,e.g. [61]-[10]. For false contour position prediction, Daly and Feng [23] proposed to filter an image using a low-pass filter and then quantize the filtered image. The difference between these two images consists of predicted contours and, thus, they are subtracted from the input image for de-contouring. This method is effective for noise-free images such as computer generated images yet it is not for natural images. Ahn and Kim [4] used fea- tures such as entropy and statistics to detect a flat region and, then, applied shuffling, low-pass filtering and dithering to this region. It can effectively suppress the contouring artifact, yet some texture details may be detected as the flat region and be smoothed as a result. Lee and Lim [61] proposed a two-stage method for false contour detection. First, smooth regions are removed by bit-depth reduction. Next, false contours are sep- arated from texture regions using the directional contrast features. A false contour map is obtained accordingly. With such a map, variable-size directional smoothing is applied to labeled regions for de-contouring. This method focuses on the difference between very smooth and textured regions but fails to detect false contours in more complex con- ditions. Later in [18], Choi and Lee extended the work of [61] by acquiring a expansion area of the contour map and apply a multi-dimensional weighted filtering to preserve the edges as well as removing the contours. However, since the method acquire the posi- tions of false contour using the same method as [61], it is not applicable for complex images. In [10], Bhagavathy and Llach proposed a multi-scale probabilistic dithering method to detect and dither contour pixels based on the probability distribution of their neighborhood. This method is nevertheless computationally expensive in calculating the scale of contours. Some decontouring methods adopt the block-based processing. Yoo and Song [106] proposed a 16x16 marcoblock (MB)-based in-loop decontouring method. The MBs with false contours are first detected by the coding configuration (such as QP and the 10 MB mode) and gradients in four direction. Then, a dithering method is used to remove the false contour by adding pseudo-random noise. Jin and Goto [53] proposed a block- based contour detection method by exploiting the discrete cosine transform (DCT) coef- ficients. Blocks with few textures are identified and the ratio of brightness and variance is calculated to determine whether a block contains false contours or not. For detected blocks, decontouring is applied by dithering the DC values with a trade-off between gra- dient smoothness and block-edge smoothness being taken into account. Although being of lower complexity in comparison with the pixel-based methods, the work of [53] is difficult to optimize by considering both the gradient smoothness and the block-edge smoothness at the same time. Besides, texture detail may get smoothened together with the contour artifact if they co-exist in the same block. More recently, several methods have been proposed to deal with false contours in HEVC and high-resolution contents.In [98], Yanxiang et al. proposes a block-based method to derive contour regions and apply dithering blocks of two sizes (32x32 and 4x4) to remove the false contours. However similar to the other block-based methods, this method cannot precisely detect false contour positions and might include real tex- ture informations due to block-based operations. In [55], Seyun, Sungkwon and Nam propsed a bit-depth expansion method to use Bayesian estimation to reconstruct the image with high bit-depth. However, since this method apply its estimators on smooth regions which are calculated by bit-depth expansion, it cannot reflect the precise posi- tions of the false contour either. 2.3 Understanding False Contours It is important to understand the properties and causes of false contours so that they can be accurately detected and removed. As is commonly know, the false contour is more 11 likely to appear in smooth regions without complex texture of high gradient. However, we find spatial monotonicity to be another crucial criteria for false contour to appear after compression, which is not explored in any work before. In this section, we first study conditions for false contours to appear. Then, we discuss the appearance of false contours in HEVC video and examine the influence of different coding parameters. Finally, we describe several perceptual characteristics that govern human perception on false contours. 2.3.1 False Contour and Spatial Stimulus The false contour phenomenon can be best explained by a simple example. The local support of a contour point is a fan-shaped region whose sides are along its normal and contour directions as shown in Fig. 2.1, respectively. The fan-shaped region can be approximated by a parallelogram if the curvature at the contour point is small. As shown in Fig. 2.1, it is apparent that the local fan regions are smooth and without any sharp edges and thus serves as a candidate background for false contour. However the low- gradient changes across the two fan areas intrigue us and we look further into it to check what causes human visual system (HVS) sensitive to it. Figure 2.1: Illustration of the local support of a contour point. In 2.2 (a) to (c), we present three synthesized images with generally smooth back- ground, whose intensity profiles are shown in 2.2 (d) to (f) with gradient profiles shown in (g) to (i). It can be seen that 2.2 (a) and (b) has a uniform horizontal gradient of 1 12 and 2 respectively, where 2.2 (c) has a changing of gradient in the center position of the image. From this we can see that although the intensity is changing all through three synthesized images, the contour can be observers only in the last image. This shows that the changing of intensity does not necessarily contributes to contours, but the breaking of monotonicity in the distribution would result in a sensitive contour for HVS to observe. (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 2.2: Comparison of three smooth synthesized images: their original images are shown in (a)-(c), their associated horizontal intensity profile are shown in (d)-(f), and their associated horizontal intensity profile are shown in (g)-(i), respectively. 2.3.2 Support Regions and Monotonicity In this subsection, we identify the influence of local support region and highlight the importance of spatial monotonicity for false contour to appear in HEVC compressed 13 results. As later discussed in step 3 of Section 2.4, the monotonicity proves to be espe- cially effectively in removing non-visible candidate contours. To demonstrate the fan-shaped region property in real world images, we compare a patch of a source image and its coded counterpart in Fig. 2.3. The image patch is encoded with HM15.0-rext-8.0 with QP=32 and the deblocking filter on. (a) (b) (c) (d) Figure 2.3: A source image patch in (a) with its coded counterpart in (b), an exemplary intensity profile along the normal direction in (c), and an intensity profile along the contour direction in (d). As shown in the result, the original patch does not have any visible contour while its coded one does. To find out the difference, we compare the intensity values of the original and coded patches along the normal and contour directions of a contour point in its local support. A representative case is shown in Figs. 2.3(c) and (d). The profiles of the original and codes patches behave similarly along the normal direction as shown in Fig. 2.3(c). However it can still be seen that, the marked contour positions always reside in the positions with intensity changes in the compressed profile where the monotonicity in the surface normal direction is broken. 14 On the other hand, they behave rather differently along the contour direction. That is, the intensity profile of the coded patch is uniformly distributed while that of the original patch has small fluctuations as shown in Fig. 2.3(d). As a result, we can see that in addition to the uniform smooth distribution aside the contours in the fan-shaped supporting area, the distribution of pixel profile along the contour direction should also be uniform across multiple connected pixels of sufficient length. In addition, this uniform property also resides in the overall intensity distribution across the false contour. More specifically, if we partition the local support of a false contour point into three sub-regions, denoted by subregions A, B and C, as shown in Fig. 2.4(a), the average intensity of subregion B is about the mid-value of the averaged intensity value of subregion A and subregion C, so that the average intensities in these three regions distribute in a monotonic way. Since false contour only appears as it breaks the original distribution of originally smooth region, this property does not always hold for a real object boundary since the contents aside the real edge do not necessarily have to be the same. Therefore, this overall monotonicity can also serves as a necessary condition for false contour. (a) (b) Figure 2.4: Illustration of a monotonic intensity profile along the normal direction for (a) a false contour, and (b) a real contour point. Furthermore, we found that the visibility of false contours is also related to the size of the fan-shaped supporting regions. Here, we consider a simple example that takes only the width of the false contour into account. Four synthesized smooth ramp images 15 are shown in Fig. 2.5. The four images have the similar pattern that each image has two gradient changing points near the center position. However only one false contour is observed in Figs. 2.5(a)-(c) while two false contours between three smooth areas are observed in Fig. 2.5(d). As a matter of fact, the gap between two gradient changing points in Fig. 2.5(a) has only a width of one pixel, while contours in Figs. 2.5(b) and (c) have a gap with width larger than one pixel, yet they are so narrow that humans can perceive one only. However, when the contour region become wide enough, it becomes one smooth region by itself two contours appear at its two boundaries. (a) (b) (c) (d) Figure 2.5: Illustration of four representative false contour scenarios: (a) 1-pixel width, (b) narrow width with a constant gradient, (c) narrow width a different gradient, and (d) broad width with a constant gradient. To conclude, in addition to the conventional standard of simply looking for smooth regions, we identify the false contour as the pixel positions where there is gradient change in the surface normal direction which breaks the original monotonic distribution, with large enough uniformly-distributed supporting areas both along and orthogonal to the false contour. 2.3.3 Effects of HEVC Settings Since our target coding environment is HEVC, we want to evaluate the influence of different settings of HEVC on false contours. Some common artifacts, such as blocking artifacts near CU block boundaries are efficiently suppressed by HEVC because of its 16 (a) (b) (c) (d) (e) Figure 2.6: Illustration of the false contour artifact: (a) the sunset image of size 1920 1080 with a marked smooth area of size 900 600, (b) the original image of the marked area, and (c)-(e) compressed images of the marked area using HM15.0-rext-8.0 with QP=32 with (c) MaxCU=64, CUdepth=4, MaxTU=32, MinTU=4 (PSNR=47.87dB), (d) MaxCU=32, CUdepth=3, MaxTU=16, MinTU=4 (PSNR=46.72dB), (e) MaxCU=16, CUdepth=2, MaxTU=8, MinTU=4 (PSNR=44.68dB). adoption of large CU sizes and in-loop deblocking filter design. We want to see whether these new coding tools can help remove false contours as well. In the following examples, we use HM15.0-rext-8.0 for the HEVC coding with the sample adaptive offset (SAO) tool on. We first enable the deblocking filters of standard settings yet without Dbl control, and examine the influence of pre-defined CTB sizes. Specifically, we encode the image content with different coding unit (CU) and transform 17 unit (TU) configurations to find out the influence of multi-size CTBs on the false contour effect. An example is given in Fig. 2.6, which shows a sunset image and its HEVC- coded result of a marked smooth region above the sun. The coded result of the marked region in Fig. 2.6(a) is shown in Fig. 2.6(b). We see that the false contour artifact caused by the HEVC compression is still visible even at high bit-rate (with QP=32 and overall PSNR=42.88dB). For the same test sunset image, the coded smooth region of different parameter set- tings are shown in Figs. 2.6(d) and (e). Generally speaking, the false contour becomes more apparent by decreasing the largest CTB sizes and the corresponding TU sizes. Therefore, even though HEVC offers good performance in suppressing the blocking artifact by adopting the in-loop deblocking filter and larger CU/TU sizes, these tools fail to remove false contours effectively, especially when the video is displayed on a large screen. Furthermore, we examine the influence of the in-loop deblocking filter. We keep the coding profile nearly the same as that in Fig. 2.6(e) except for the parameters related to the deblocking filter. Two coded images are shown in Fig. 2.7 with different deblocking parameters. We modify the beta-offset and the Tc-offset of the loop filter and enable the Dbl control in Fig. 2.7(a). We disable the deblocking filter in Fig. 2.7(b). Although there are some minor PSNR differences in these two decoded images, the change of these parameters does not help remove false contours. The false contour cannot be removed by disabling the deblocking filter either. 2.3.4 Effects of Luminance and Color We finally examine the perceptual related properties to false contour artifact. Specifi- cally, we study the influence of color channels and luminance condition, which is well related to Weber’s Law [6] and masking effect [100, 35]. 18 (a) (b) Figure 2.7: Resultant images with the same CU and TU set-up given in Fig. 2.6(e) yet with different deblocking configurations: (a) with the deblocking filter of param- eters LoopFiterBetaOffset=6, LoopFilterTcOffset=6 and DeblockingFilterControlPre- sent=1 (PSNR=44.72dB), (b) disabled deblocking filter with LoopFilterDisable=1 (PSNR=44.88dB). First, we examine the effect of luminance on false contours. The false contour is more visible in the darker region than the brighter region as demonstrated by the fol- lowing experiment. We flip the luminance channel by setting Y 0 = 255Y for the coded result of Fig. 2.6(e) and show the corresponding result in Fig. 2.8(a). The false contours in the darker region of Fig. 2.6(e) are less visible in Fig. 2.8(a) while those in the brighter region of Fig. 2.6(e) are more visible in Fig. 2.8(a). This can be explained by Weber’s Law [6]. That is, the just noticeable difference (JND) value, K, can be expressed asK = I=I whereI and I is the background intensity and intensity dif- ference, respectively. Therefore, the same amount of distortion in the dark area (with a smallerI value) will result in more visible distortion than that in the bright area (with a largerI value). Next, we examine the effect of color on false contours. As discussed in [84], the HVS is highly sensitive to luminance than to chrominance and the contrast sensitivity of luminance is much lower than that of the chrominance components. Therefore, we examine the color channel influence on false contour in a controllable environment by creating three images, each with the exact same pattern in its one dominating channel. Therefore, we extracted the luminance channel of the image patch in Fig. 2.6(e) and then code it using HEVC with its mono-channel setting (4:0:0) to get a texture map, 19 (a) (b) (c) (d) Figure 2.8: The effect of color luminance and color channels on false contour artifact, with (a) result of the flipped luminance channel in Fig. 2.6e, and the results where the same texture map is adopted for (b) the Y channel, (c) the Cb channel and (d) the Cr channel which is denoted byY c (x;y). Then, Figs. 2.8(b)-(d) show three resulting images where only one of the Y , Cb and Cr channels is replaced by the texture map. We see that the false contour artifact is more visible through the Y channel than through the Cb or Cr channels. Based on this observation, we focus on the luminance channel for false contour detection and removal in Sections 2.4.1 and 2.4.2, respectively. Decontouring of color images will only be considered in Section 2.5. 2.4 False Contour Detection and Removal Based on the study in the last section, an accurate false contour detection method will be developed and a dithering based method to counter the monotonicity of false contours will be proposed in this section. Unlike previous methods that primarily search candi- date false contour areas in smooth regions, the proposed detection method can identify the exact location of the false contour and the proposed removal method can remove 20 them efficiently without affecting the underlying texture region. The block-diagram of the proposed false contour detection and removal (FCDR) method is shown in Fig. 5.1. 2.4.1 False Contour Detection with Evolution of False Contour Candidate (FCC) Maps In this section, we first introduce our step-by-step detection method with the evolution of false contour candidate (FCC) map, which records the changing of all possible false contour positions. In the first step, we want to remove the very smooth area without spatial stimulus; in the second step, we want to remove the area with sharp edge and texture; and in the last step, most importantly we want to check the monotonicity along and orthogonal to the previous candidate positions to derive final false contour positions which are supposed to be sensitive to HVS. All positions in the FCC map are initialized to ‘1’. We will show how some of them will be changed from ‘1’ to ‘0’ step by step below. Step1: ExcludeAreaswithNegligibleGradientChange If the gradient change of a neighborhood is too small, they will be quantized into the same value and no false contour will arise. We can eliminate these pixels at the first step. To do so, we can calculate the difference of a pixel with its eight neighbors. If the magnitude of all differences are less than a pre-determined threshold, we set it to zero in the FCC map. Mathematically, we label a target pixel denoted by p and its eight neighbors with k = 0; 1; ; 8 (see Fig. 2.9(a)). Its gradient along directionk = 1; ; 8 is defined as G k (p) =jI k (p)I 0 (p)j; (2.1) 21 (a) (b) Figure 2.9: (a) Labels of one target pixel and its eight neighbors withk = 0; 1; ; 8; (b) gradient in direction k. where I k (p) is the intensity value of the pixel labeled by k in Fig. 2.9(a). Also, we define the sum of gradient magnitudes in opposite directions (denoted byk andk) as G k;k (p) =G k (p) +G k (p): (2.2) All values of the initial FCC map, denoted by M 0 (p), are set to ‘1’. The value of pixelp is changed from ‘1’ to ‘0’ if it has negligible gradient change as specified by one of the following two conditions: G 1;5 (p)<T 1 and G 3;7 (p)<T 1 ; (2.3) G 2;6 (p)<T 1 and G 4;8 (p)<T 1 ; (2.4) whereT 1 is a pre-defined threshold. The FCC map after this step is denoted byM 1 (p) and shown in Fig. 2.11(b), where a black dot (withM(p) = 1) and a white dot (with M(p) = 0) mean that it associated pixel still remains in the FCC set. Step2: ExcludeAreaswithTexturesandSharpEdge 22 Figure 2.10: The system diagram of the proposed false contour detection and removal (FCDR) method, where the first 3 steps are conducted for false contour detection and the detection result is used to guide the false contour removal process. Next, by exploiting the visual masking effect, we can remove regions with textures and sharp edges from the FCC map. That is, a non-zero pixelp in the FCC map,M 1 (p), is changed from ‘1’ to ‘0’ if any of the following two conditions doesnot hold: MaxfG 1;5 (p);G 2;6 (p);G 3;7 (p);G 4;8 (p)g<T 2 ; (2.5) G 1;5 (p) +G 2;6 (p) +G 3;7 (p) +G 4;8 (p)<T 3 ; (2.6) whereT 2 andT 3 are two thresholds. The above conditions alone could contribute to a texture mapM t , which is later used in false contour removal algorithm. The resulting FCC map is denoted byM 2 (p), and an example is given in Fig. 2.11(c). Step3: ExcludeAreaswithoutMonotonicity As introduced in Section 2.3.2, a false contour point must lie in a smooth region yet with monotonically increasing/decreasing amplitude change. To measure this mono- tonicity properly, one way is to count the numbers of pixels with the same gradient values along and orthogonal to the contour direction of the candidate contour position, which indicates the general monotonicity of local region. Therefore, within the adjacent 23 (a) (b) (c) (d) Figure 2.11: (a) The luminance channel of coded Sunset image using HM15.0-rext-8.0 with QP=32 (PSNR=42.51dB), and (b)-(d): the resulting FCC (false contour candidate) maps after Steps 1-3, respectively. W m -by-1 area in each direction, a false contour point should have sufficiently largeN c andN n values, which denotes the number of pixels of same gradient profile along and orthogonal to the contour direction, namely: N c >T 4 orN n >T 5 ; (2.7) whereT 4 andT 5 are two thresholds. The value of a non-zero pixel inM 2 (p) is changed from ‘1’ to ‘0’ if it does not meet the above two conditions. Afterwards, the mono- tonicity in a local support as given in Fig. 2.4 is further checked. It is used to cope with a special situation where a remaining false contour point is actually a real contour point between two objects. Its value has to be changed from ‘1’ to ‘0’ as well. The resulting FCC map is denoted byM 3 (p), and an example is given in Fig. 2.11(d). As a result, by comparing Figs. 2.11(b)-(d), we see that the number of points in the FCC map is reduced significantly, and the designed monotonicity measurement better refine the false contour detection results. 24 There are several thresholds to select in the above three steps. They are determined empirically based on experimental results and are well related to the resolution of the image. Furthermore, their values can be region-dependent by considering Weber’s Law. That is, we can take the sensitivity of the human visual system (HVS) to false contours in dark regions into account. Take the flipped luminance sunset image as an example. Visually, its right part has no visible artifacts due to the high luminance value. This region is successfully excluded from the FCC map as shown in Fig. 2.12 by taking Weber’s laws into account. (a) (b) Figure 2.12: Detection of the coded sunset image of flipped luminance in (a), with detection result using Weber’s Law in (b). 2.4.2 False Contour Removal In this section, we propose a decontouring method to remove the detected false contour. It consists of two steps: probabilistic dithering and averaging. They are labeled as Steps 4 and 5 in Fig. 5.1, respectively. Dithering is often used to randomize quantized constant values to prevent a large smooth region in images [10]. By probabilistic dithering, we use the neighboring luminance distribution information to reconstruct pixel values in the false contour region. It will result in local randomness so as to suppress the visibility of false contours by exploiting the masking effect [43, 63]. Step 1: Two-side Probabilistic Dithering Here, we propose a probabilistic dithering procedure that is different from traditional dithering method which uses two windows: 25 one in subregion A and the other in subregion C as shown in Fig. 2.4. The adoption of two separate windows is used to cope with the intensity monotonicity aside the contour. If only one window is used, the values of region A might be randomly chosen to replace pixel in region C, which would introduce strange artifact. To achieve this, we first derive the orthogonal direction of the contourk n similar to Step 3 in detection algorithm. For each pixel of candidate contour position, two windows are applied to the contours normal direction, so that the pixels inside the window that are neither detail textures nor sharp edges would be taken notice of and represented asp i ;i2f1::N w g, whereN w is the total number of candidate pixels within each specific window. Then for each applied window, we derive a random number r from [1;N w ], and replace the original intensity value of eachp i as Y dith (p i ) =Y (p r ): (2.8) In contrast with the work in [10] that calculated the probability of dithering values f+1;1; 0g of the detected contour point, our dithering method is conducted in regions along the normal direction of the contour segment. The regional masking effect helps suppress the false contour even more effectively. Based on detected false contour posi- tions, the result of probabilistic dithering is shown in Fig. 2.13(a). We see that most false contours have been well suppressed. Step 2: Averaging Smoothing However, due to the application of random dithering, the neighborhood around contours may appear a little bit noisy. We can apply an averaging operation for noise removal. Note that a straightforward averaging method may not work well since the quantization difference might appear elsewhere so as to introduce new contours. 26 (a) (b) Figure 2.13: The processed sunset image using (a) probabilistic dithering only and (b) both probabilistic dithering and averaging. Since the dithered noise is actually equally important based on its original distribu- tion, we use a simple averaging filter around the dithered positions to remove the random noise. Here to simplify the representation, we define a texture mapM t as the pixels of sharp edge and complex textures based on Equ. 2.5 and 2.6. Then, the sum of the pixel intensities within a sizel w square windowW i around eachp i is calculated if the pixels are not within the texture mapM t , and we replace the dithered pixel intensity ofp i by the average of the sum as Y avg (p i ) = X q2W i [(w(q;M t )Y (q)]= X q2W i [w(q;M t )]; (2.9) wherew() is the indicator function defined as w(p;M) = 8 < : 1; ifp = 2M 0; ifp2M (2.10) . The final decontoured result after both probabilistic dithering and averaging is shown in Fig. 2.13(b). We see that the false contours are well removed and the decontoured region is smooth and clean. 27 2.4.3 Complexity Analysis In this subsection, we evaluate the computation and memory complexities of the pro- posed FCDR method. For an image of sizeNN, its gradient information has to be calculated and stored. This demands O(N 2 ) in computation and O(N 2 ) in memory. Then, the exclusion of the negligible gradient area and texture information is computed in a local window of 3 3 centered at all pixels, which requires an extraO(N 2 ). Since both the texture map and the contemporary results are needed for further detection and removal, we need extra O(2N 2 ) memory to store them. The process of monotonicity check is conducted on remaining valid pixels from the first two steps. Suppose there are M 1 valid pixels remained in the candidate map, the third step of detection takes up to O(M 1 W m ), whereW m is the size of the monotonicity check window. Since the result is directly updated in the previous step, the final false contour detection result does not require extra memory. As a result, the computation and memory complexities for the detection process areO(3N 2 +M 1 W m ) andO(3N 2 ), respectively. Based on the detection result, a probabilistic dithering operation is conducted at every candidate pixel in the detection map. Suppose that there are M 2 pixels in the map, the computation complexity isO(M 2 N w ), whereN w is the total number of candi- date pixels in each dithering window as discussed in Section 2.5.2. The dithering step only requires a temporary memory storage ofO(N w ) to store adjacent pixel values and O(N 2 ) to save the result. The last averaging step demands computation complexity ofO(M 2 N w W 2 i ) as the averaging is conducted in a squared window of sizeW i W i for each pixel with the dithered intensity value. A temporary memory of O(W 2 i ) is required for this step. The final result needs to be updated in an extraO(N 2 ) space to avoid confusion of the dithering results. Thus, the proposed removal method requires O(M 2 N w +M 2 N w W 2 i ) andO(N w +W 2 i +2N 2 ) in computation and memory complex- ities, respectively. 28 As discussed above, the computation complexity depends on the selected window sizes and the contents in the image. For images with fewer candidate contour posi- tions,M 1 andM 2 are smaller, contributing to lower computation complexity. Since our method can remove false candidate pixels efficiently, we have smallerM 1 andM 2 . 2.5 Experimental Results In this section, we compare the proposed FCDR method with the methods of Lee and Lim in [61] and by Jin and Goto in [53]. For convenience, we call them the LL and the JG methods. First, we detaily show the false contour detection and removal results in Sections 2.5.1 and 2.5.2, respectively. Then, we directly present more results using standard video dataset from Derf’s collection and Netflix UHD database. (a) (b) (c) (d) Figure 2.14: The coded airplane image is shown in (a) and the FCC maps after the first, the second and the third steps are shown in (b), (c) and (d), respectively. 2.5.1 False Contour Detection Results We show that the proposed detection method can locate false contours accurately while excluding genuine textures in the source image frames in this subsection. 29 An exemplary image consisting of a flying airplane with the sky background is shown in Fig. 2.14(a). It is encoded using HM15.0-rext-8.0 with QP=36 and the deblocking filter on. We show the step-by-step evolution of the FCC map in Figs. 2.14(b)-(d), respectively. Result in Fig. 2.14(b) shows that the proposed detection method targets at very smooth regions in the first step, removes the airplane region in the second step and checks the local support to obtain the accurate contour position in the third step. For comparison, the false contour detection results obtained by the LL method and the JG method are shown in Figs. 2.15(a) and (b), respectively. The detected contour regions are shown in black. The LL method is a pixel-based method. As shown in Fig. 2.15(a), it identifies most contour areas in the background, yet the detected contour regions are not as accurate as the proposed FCDR method. Furthermore, by zooming into the airplane region, some false positives remain in the LL method but they are completely removed by the proposed FCDR method. The JG method is a block-based method. As a result, its detected false contours are coarser than those obtained by the FCDR method and the LL method as shown in Fig. 2.15(b). We see that part of the aircraft trail region is wrongly included in the detected blocks. This wrong detection will introduce another artifact after the application of the decontouring algorithm. (a) (b) Figure 2.15: Detected false contour regions for the coded airplane image using (a) the LL method and (b) the JG method. To further demonstrate the capability of the proposed FCDR method in preserving the genuine texture in the source image, we conduct an experiment using an image 30 (a) (b) (c) (d) Figure 2.16: Comparison of false contour detection results on an exemplary image with- out visible false contours: (a) the input building image, (b) the detected result by the LL method, (c) the detected result by the JG method, and (d) the detected result by the FCDR method. without visible false contours. An accurate detection algorithm should not return any detected false contour location. We encode a complex image at a high bit-rate (QP=10) so that no visible false contours can be observed in the decoded image as shown in Fig. 2.16(a). The detected false contours obtained by the LL method, the JG method and the proposed FCDR method are shown in black in Figs. 2.16(b)-(d), respectively. We see that the FCDR method has few isolated false points. This example shows that the proposed FCDR method can detect the false contour accurately without confusing with genuine texture contents. Thus, it allows us to use a more aggressive removal method in the designated area. 31 (a) (b) (c) (d) Figure 2.17: Comparison of false contour removal results: (a) the coded air- plane image (PSNR=38.88dB, SSIM=0.947), (b) the decontouring result by the LL method (PSNR=38.92dB, SSIM=0.949), (c) the decontouring result by the JG method (PSNR=38.79dB, SSIM=0.944), (d) the decontouring result by the FCDR method (PSNR=38.82dB, SSIM=0.949). 2.5.2 False Contour Removal Results In this subsection, we present the final decontoured results of proposed method. The coded airplane image of Fig.2.14(a) with two marked windows is shown in Fig. 2.17(a). The blue window focuses on the contour area in the background, while the red window focuses on the area near the aircraft tail and trail to demonstrate the decontouring effect on real contours. We perform three false contour removal methods on this image. Again, they are the LL method, the JG method and the proposed FCDR method. Their results are shown in Figs. 2.17(b)-(d), respectively. Clearly, the proposed FCDR method pro- vides better results than the LL and JG methods. Although the LL method and the JG method alleviate the false contour effect in Fig. 2.17(a), they introduce the undesired decontouring artifact in the trail region. This is especially obvious for the JG method since it applies the block-based processing. 32 (a) (b) (c) (d) Figure 2.18: Comparison of false contour removal results in the sky region inside the blue window: (a) the coded image patch (PSNR=38.78dB, SSIM=0.9941), (b) the decontouring result by the LL method (PSNR=38.94dB, SSIM=0.9945), (c) the decon- touring result by the JG method (PSNR=40.37dB, SSIM=0.9922), (d) the decontouring result by the FCDR method (PSNR=40.48dB, SSIM=0.9954). Although our FCDR method has better perceptual performance, it does not neces- sarily have the best objective scores for the whole image in terms of the PSNR value or the SSIM quality metric measure [99]. This is due to the fact that we use dithering and averaging in false contour removal. The modified pixel values of the decontoured image do not always resemble the original image. Nevertheless, for the contour regions in Fig. 2.18, the PSNR and SSIM values both improve as compared to the coded image patch. They do outperform other methods. Furthermore, we zoom into the blue and the red windows and show their correspond- ing results in Figs. 2.18 and 2.19, respectively. We see clearly that the proposed FCDR method offers the best performance in restoring the original image quality by smoothing the false contour in the sky region without introducing a side effect on real contours (i.e., the decontouring artifact in the airplane tail and trail region). 33 (a) (b) (c) (d) Figure 2.19: Comparison of false contour removal results in the airplane tail and trail region inside the red window: (a) the coded image patch (PSNR=38.37dB, SSIM=0.945), (b) the decontouring result by the LL method (PSNR=38.34dB, SSIM=0.945), (c) the decontouring result by the JG method (PSNR=37.99dB, SSIM=0.945), (d) the decontouring result by the FCDR method (PSNR=38.35 dB, SSIM=0.946). 2.5.3 More Comparison Results In this section, we present more experimental results. Since there is no common false contour benchmark dataset, we select image frames from Derf’s collection and the Net- flix UHD dataset. We select test images using the following criteria: With possible false contours after compression; With textures, edges or real contours in source images; With diverse background and foreground objects. We encode images in the same way as we did for Fig. 2.14(a), using HM15.0-rext- 8.0 with the deblocking filter on. The results are shown in Fig. 2.5.4. We see that the proposed FCDR method clearly outperforms the LL and JG methods in suppressing false contours and preserving the textured area. These images are also used for the subjective test. The results will be reported in Section 2.5.4. 34 (a) (b) (c) (d) (e) Figure 2.20: An exemplary image coded by H.264 with QP=36 in (a), with (b) the dance image coded by H.264 inside the blue window, (c) the decontoring result of the LL method inside the blue window, (d) the decontoring result of the JG method inside the blue window, (e) the decontoring result of the FCDR method inside the blue window. Since the LL and JG methods both target at H.264 environment, we conduct another experiment by applying the LL, JG and FCDR three methods to an image coded by H.264 in Fig. 2.20 for fair comparison. The false contour artifact is very apparent for the dance image due to the dark background. We zoom into the blue window, and compare the coded image and the decontouring results obtained by the three methods in in Figs. 2.20(b)-(e), respectively. Again, the proposed FCDR method is able to well detect and remove the false contours even in H.264 environment. 35 We extend the image-based decontouring framework to HEVC compressed video by applying it frame-by-frame. An exemplary video clip is provided in the supplement material, which is encoded using HM15.0-rext-8.0 with the Random Access with B slices (RAB) and the deblocking loop filter (LP) on. We observe similar performance improvement in perceived subjective quality as reported in the following subsection. 2.5.4 Subjective Test Results Since objective quality metrics do not reflect human subjective quality experience well, we conduct a subjective test on 9 test images with 20 subjects. The nine test images are: 1) tree and sky, 2) Ferris wheel, 3) old city, 4) plane in dawn, 5) yacht and sea, 6) airport night scene, 7) wall and cat, 8) car and seashore, 9) church and sky. Test images #1-3 are shown in Fig. while test images #4-9 are shown in Fig. 2.21. They were coded using HM15.0-rext-8.0 with intra coding and all available CTU sizes. The images are coded with QP=36 and deblocking filters enabled. The decontoured results are obtained by applying the LL method, the JG method and the proposed FCDR method. They are provided in the supplement material. We display two decontoured results on an UHD TV with a random spatial order: Comparison of results obtained by the LL method and the FCDR method; Comparison of results obtained by the JG method and the FCDR method. In each test case, we asked the subject to give one of the three choices: ‘Left is better’, ‘Right is better’ and ‘Both look the same’. The objective scores in PSNR and SSIM are shown in Table 2.1. The subjective test results are plotted in Fig. 2.22, which indicate the percentage of the preferred choice in each test case. The comparison of the LL method and the FCDR method is given in Fig. 36 2.22(a) while the comparison of the JG method and the FCDR method is given in Fig. 2.22(b). As shown in Table 2.1, the PSNR and SSIM scores of all three methods are similar to each other. The overall objective score of the proposed FCDR method tends to decrease a little bit after decontouring. However, the subjective test results show that the FCDR method clearly outperforms both the LL method and the JG method. This can also be verified by readers by watching video in the supplemental material. Table 2.1: PSNR and SSIM scores of subjective test images. Four groups of scores are provided, corresponding to the compressed image without decontouring, LL’s method from [61], JG’s method from [53] and proposed FCDR method. Detail speaking, as shown in Fig. 2.22, the gap between FCDR and LL is bigger than that between FCDR and JG. For the latter, about 10% of the subjects prefer the JG method while another 10% fail to recognize the difference between the two. Actually, we observe that the JG method performs well in detecting and removing false contours in smooth regions. However, its performance drops in complex areas with real contours and textures. The LL method has the poorest performance in decontouring, since its removal method only conducts averaging on detected contour positions without exam- ining their neighborhood. In contrary, the proposed FCDR method takes the relationship between the masking effect and the false contour appearance into account and offers the best quality. 37 2.6 Conclusion and Future Work (a) (b) (c) (d) (e) (f) Figure 2.21: Six more images used for the subjective test. (a) (b) Figure 2.22: Subjective test results: (a) comparison between the proposed FCDR method and the LL method, and (b) comparison between the proposed FCDR and the JG method. A false contour detection and removal (FCDR) method was proposed in this work. Our investigation began with a clear understanding of the cause of the false contours. Then, based on the understanding, we presented a FCDR system that includes both a false contour detection module and a false contour removal module. It was shown by experimental results that the FCDR method can detect and suppress contouring artifacts 38 resulted from HEVC and H.264 alike effectively. At the same time, it can preserve edges and textures in the image well. The effectiveness of the FCDR was demonstrated as a post-processing module applied to decoded images in this work. It will be interesting to adopt it as part of the in-loop decoding process. This idea demands further investigation and verification. 39 Chapter 3 Perceptual Quality Driven Frame-Rate Selection (PQD-FRS) for High-Frame-Rate Video 3.1 Introduction Most entertainment videos such as movies and TV programs are displayed at a rate of 30 frames per second (fps) or lower nowadays. However, the demand on higher resolution and higher frame rate (HFR) video is increasing due to the availability of 4K and 8K UHD video contents and larger display screens. Although the artifact in lower frame rates could be mitigated via frame interpolation and other post-processing techniques, the result is often not satisfactory. An alternative is to shoot and encode high quality video at a higher frame rate, which will result in a higher bit rate for HFR video delivery. Actually, human viewers do not need a fixed HFR for the entire video due to different dynamics in a sequence. For example, one may lower the coding frame rate if the dynamics of the underlying video are low [92]. Generally speaking, there exists an upper bound on the maximal frame rate in each short video interval, and most people cannot discern the perceived quality difference if the frame rate goes higher. Therefore, one would like to encode video at a variable frame rate depending on its content. Since human subjective visual experience varies 40 from one to another, it is proper to adopt a statistical approach in the analysis and predic- tion of viewers’ satisfaction with respect to time-varying frame rates. One application of this study is to allow content providers to adjust the frame rate of streaming video dynamically by balancing the quality of viewer’s experience and the transmission cost. In this research, we attempt to find a relation between the display frame-rate of video contents and human perceived visual quality, and use it to design an automatic frame- rate selection algorithm. To achieve this objective, the characteristics of the human visual system (HVS) is exploited and a machine learning approach is adopted in our work. First, we build a frame-rate-dependent video quality assessment (FRD-VQA) dataset based on subjective test results. It consists of twenty-five 5-second high-frame- rate HD and UHD video sequences with a wide variety of contents. Each sequence is evaluated by 20 subjects. The FRD-VQA dataset provides a ground truth in the train- ing stage and can also be used for performance evaluation in the testing stage. Then, we adopt a machine learning approach in the design of the core component of the PQD-FRS method, called the “satisfied user ratio (SUR) prediction” module. The SUR prediction module predicts the percentage of satisfied viewers who cannot differentiate video qual- ity of a lower and higher frame rate. We derive several feature vectors from the spatial and temporal characteristics of video sequences and use the support vector regression (SVR) to implement the SUR prediction module. It is confirmed by experimental results that the proposed SUR module can offer a highly accurate prediction effectively and robustly. As a result, the proposed PQD-FRS system is able to dynamically assign a proper frame rate to each part of the video without any perceptual quality degradation for a majority of viewers. The rest of this paper is organized as follows. Related previous work is reviewed in Sec. 3.2. The FRD-VQA dataset is introduced and related HVS properties are examined in Sec. 3.3, along with a designed subject test to decide the ground truth of the dataset. 41 Proposed PQD-FRS method is explained in Sec. 3.4. Experimental results are presented in Sec. 3.5. Finally, concluding remarks and future extensions are given in Sec. 3.6. 3.2 Review of Previous Work Research on frame-rate-related video quality issues and applications has been conducted by researchers for almost two decades. Their efforts can be roughly classified into two major categories based on their objectives. The first one is concerned with different viewing conditions and perceived artifacts by watching video displayed in various frame rates [2, 29, 30, 76, 77]. The second one focuses on effective video compression meth- ods to reduce coding bit rates with little quality degradation [92, 50, 36, 79, 96, 105]. Several efforts have been made to model the relationship between video contents and their display frame rate. For example, the ITU P.910 document [2] defines the spatial information (SI) and the temporal information (TI), and uses them to characterize the property of a video sequence. Since they are simple yet coarse tools, they are neither accurate nor robust. More recently, the impact of higher frame rates on perceived quality in wide field-of-view (FOV) video was studied by Emoto, Kusakae and Sugawara [29]. They proposed a second-order regression quality model using the angular velocity, and observed that human is less sensitive to slower and faster angular motion video. They also found that the viewing distance does not influence the perceptual quality for high- frame video. However, it does have an influence the critical fusion frequency (CFF), which is the threshold frequency of the flicker artifact [30]. Ou et al. [76] proposed a method to model the relationship between the frame rate and quality. They claimed that the temporal correction factor is an inverted falling exponential function. Furthermore, they used the normalized motion vector and the normalized frame difference to represent the motion content of each video sequence. Ou et al. [77] extended the model in [76] 42 by including the quantization factor, and pointed out that the quantization effect on coded frames can be captured by a sigmoid function of the peak signal-to-noise ratio accurately. Several methods were developed to choose the representative frames and skip redun- dant ones in the context of low bit rate coding. Song and Kuo [92] developed a method to achieve a good balance between spatial quality and temporal quality to enhance the overall human perceptual experience at low bit rates. Hwang and Wu [50] proposed a dynamic frame-skipping method using bilinear interpolation of estimated motion vec- tors and adjusted the number of skipped frames according to the accumulated magnitude of motion vectors. Pejhan, Chiang and Zhang [79] proposed a scheme to vary the frame rate of pre-encoded video dynamically using the coded motion information. However, it is limited to low bit rate video (typically of QCIF resolution and 30fps or even lower). Fung, Chan and Sui [36] extended this work to the DCT domain by adding the DCT coefficients and using an error compensation feedback loop so that errors were reduced through re-encoding. These methods are primarily focused on low bit rate video by exploiting the motion information encoded in the bit-stream. However, this information cannot reflect the perceptual quality of the video content accurately. Thammineni, Raman and Vadapalli [96] proposed a scheme to assess the impact of a new frame-rate by encoding skipped frames. It maintains the motion-adaptive frame skipping framework and offers a smoother spatio-temporal quality trade-off at the oper- ating bit rate. Their method only evaluates the influence of a single frame but ignores the motion information in neighboring frames. Yeh et al. [105] proposed a method to ana- lyze visual complexity and temporal coherence and used object’s and camera’s motion information to select proper frames for a better fit to the human perceptual model. How- ever, their work is mainly targeted at mobile video. Although these two methods used 43 extra information besides embedded motion vectors, they were developed primarily for low bit rate video coding, and their models were limited to frame-level manipulation. In this work, we focus on higher frame rate HD and UHD videos, which are com- pletely different from the target of all above-mentioned work. It is worthwhile to empha- size that most of existing work cannot be generalized to address our current problem in a straightforward manner due to the large computational burden demanded by HD and UHD videos. Furthermore, none of their models took the HVS characteristics into account and, as a result, their models cannot handle differences among individual view- ers. For these reasons, our work is uniquely positioned and a completely new framework is demanded. 3.3 FRD-VQA Dataset In this section, we first explain the motivation of building the FRD-VQA dataset and then describe the subjective test procedure to build such a dataset. The objective of our subjective test is to evaluate the proper frame rate demanded for a certain video clip. In other words, we would like to answer the following question: “whether 60fps and 30fps of a certain video content will make noticeable difference to a group of viewers (rather than a single viewer)”. Then, the answer has to be interpreted in a statistical sense. For example, 80% cannot tell the difference while 20% can. 3.3.1 Motivation We have three main observations on human perception of video of multiple resolutions under multiple frame rates. The first one is about the artifact caused by insufficient frame rates. It is observed that the major artifact in video of a reduced frame rate is the judder effect, which is most often visualized as sudden jumps in object’s position and motion 44 discontinuity. This artifact can be suppressed by increasing the frame rate. The second one is related to visual attention. People tend to pay more attention to moving and distinguishable objects that have good contrast against the background. Also, given the fact that a great amount of visual information is presented within a short period of time, only visual salient parts get enough attention [43]. The third one is about the masking effect which refers to people’s reduced capability to perceive stimuli such as edges, motions and distortions in the presence of complex spatial or temporal background [100, 35]. To give an example, a screen-shot of one sequence in the FRD-VQA dataset is shown in Fig. 3.1(a). It has a running fountain and a wandering kid. Although dropping waters move very fast, people find it difficult to tell the difference between HFR and its low frame rate version. This is because that the complex background consisting of water drops mask the artifact in lower frame rate video. In contrast, the screen-shot of another sequence in the FRD-VQA dataset is shown in Fig. 3.1(b). It is a scene of a tunnel com- posed by arc frames viewed from a passing vehicle, and the frames are moving rapidly with respect to the viewer. There is a huge difference in perceptual smoothness of this sequence caused by different frame rates since the motion is regular and predictable. Thus, it is worthwhile to emphasize that not all motion patterns influence perceptual video quality in the same way. Furthermore, it was observed in [30] that the increase of the frame rate tends to be more beneficial for medium-speed movement, but less so for very fast or very slow movements. Since the relationship between perceptual quality and video content is complicated, it is difficult to develop a theoretical model to cover all different contents. To address this challenge, we consider an alternative solution instead. That is, we select a set of representative video sequences that cover different motion scenarios and conduct the subjective test to collect statistical samples of human visual experience. After building 45 (a) (b) ] (c) (d) Figure 3.1: Examples of four FRD-VQA sequences: (a) Fountain: fast motion with complex background, (b) Tunnel: fast motion with simple background, (c) Walking: slow motion with complex background and (d) Slow City: slow motion with simple background. such a dataset, we can adopt a machine learning approach to learn and predict the appro- priate frame rate for certain video content. This process of building a subjective video quality assessment (VQA) dataset will be elaborated below. 3.3.2 Description of FRD-VQA Dataset We describe the process of building the FRD-VQA dataset in this subsection. It includes three parts: 1) sequence selection, 2) the subjective test procedure, and 3) the subjective test results. Sequence Selection We collect twenty-five HFR HD and UHD video clips, each of which has a duration of about five seconds. The source video contents are from Netflix El Fuente, TUTUHD and XIPH. These sequences are scaled to a constant spatial resolution (1920x1080) yet with multiple frame-rates. Their highest frame-rates are 60fps and 50fps as required by the US and the European standards, respectively. The down-sampled frame-rates are 46 30 and 15fps for the US standard and 25 and 12.5fps for the European standard. In the experiments, we obtain lower frame-rate (LFR) video via frame skipping and frame repeating. That is to say, we select one frame out of every two or four frames of the original video and repeat the frame based on the downsampling rate. As a result, we display all the videos with their corresponding native frame-rates (50/60 fps) to avoid the influence of TV signal refreshing on different sequence frame-rates. To conclude, there are 17 sequences with their original frame rate of 60fps and 8 sequences their original frame rate of 50fps in our selected dataset. In order to cover a wide variety of contents, we select sequences of either fast or slow moving objects against simple or complex background. Objects include human, animals, vehicles, etc. Scenes include parks, roads, sports arenas, etc. The properties of sequences in the FRD-VQA dataset are summarized in Table 4.6. There are three properties of concern: the motion speed, the motion type, and the spatial contrast. We consider three choices under each property category. Motion Speed: fast (F), medium (M), slow (S); Motion Type: object only (O), camera only (C) and both (B); Spatial Contrast: high (H), medium (M) and low (L). As shown in the table, the set of selected sequences are quite diversified. Table 3.1: Properties of 25 sequences in the FRD-VQA dataset. Class No. Class No. Class No. Motion Speed F 3 Motion Type O 9 Spatial Constrast H 6 M 10 C 4 M 12 S 12 B 12 L 7 For a long video sequence containing inhomogeneous contents, we cut it manu- ally, and use the homogenous portion to generate the test sequence. For example, for a 47 sequence with a bee flying around and staying still in the air, we choose the flying part only. (a) (b) Figure 3.2: The MOS values plotted as functions of the frame rate: (a) sequences with frame rates 60, 30 and 15 fps, and (b) sequences with frame rates 50, 25 and 12.5 fps. (a) (b) Figure 3.3: Comparison of the viewing experience on video sequences with the max- imum frame rate and the half maximum frame rate: (a) sequences with a maximum frame rate of 60fps, and (b) sequences with a maximum frame rate of 50fps. Subjective Test Procedure The screen used in the subjective test was the 65-inch 4K UHD TV of Samsung UN65 F9000 with a refresh rate of 240HZ. By following the standard in [1], all participants were seated at a distance of 3 times the screen height and with an viewing angle within the range of 30 degrees. We conducted the subjective test among 20 subjects using the pairwise comparison [60, 26] with double stimulus. The 20 subjects were USC undergraduate and graduate students, including both the male and the female. 48 Each subject made a decision between two stimuli, say, sequences A and B. A viewer can choose among three options: “A is better”, “B is beter” or “look the same”. As compared with the traditional Absolute Category Rating (ACR) method, the pairwise comparison offers more reliable and direct results via simple relative ranking. This is especially true when the quality difference between two stimuli is small [78]. We implemented the double stimulus test by playing two sequences of one pair sequentially in random order with an idle period of 5 seconds in between. We tested the following two pairs for each video sequence: i) the full frame-rate ver- sus the half frame-rate, and ii) the half frame-rate versus the quarter-frame-rate. Thus, there were 50 pairs in comparison (two for each of the 25 sequences) in the whole test procedure. It took about half an hour for each subject to complete the test. Subjective Test Results There are several well known methods to convert the pairwise comparison results to the mean-opinion-scores (MOS) such as the Thurstone model [90] and the Bradley-Terry model [51]. For the Thurstone model and the Bradley-Terry model, the subjective visual quality of the same sequence but in different frame rates is assumed to be of the Gaussian and the Gumbel distributions, respectively. In this work, we tailor the Thurstone model to our three-choice test (including the tie). In the original form of the Thurstone model with two choices only, the mean dif- ference ^ AB between sequences A and B in terms of MOS scores can be estimated via ^ AB = AB 1 ( C A;B C A;B +C B;A ); (3.1) where AB is the normalized standard deviation (of unit value), 1 is the inverse nor- mal distribution, andC A;B andC B;A represent the number of people choosing “Sequence 49 A is better” and “Sequence B is better”, respectively. For the three-choice test, the values ofC A;B andC B;A are modified to be C A;B =N A + N S 2 ; C B;A =N B + N S 2 ; (3.2) whereN A ;N B andN S denote the number of people choosing “Sequence A is better”, “Sequence B is better” and ”They look the same”, respectively. The MOS values are plotted as functions of the frame rate for two sets of sequences in Fig. 3.2. The frame rates of the first set are 60 fps, 30 fps and 15 fps while those of the second set are 50 fps, 25 fps and 12.5 fps. We assign the maximum MOS value, five, to every original HFR sequence. Then, we use Eq. 5.1 to compute the mean difference between the frame rates under comparison. As shown in the figure, the degree of quality drop from the full frame rate to lower frame rates is highly dependent on video content. Some are more sever than others. For the same sequence, the degree of quality drop from the full rate to the one half rate is typically less than that from the half rate to the quarter rate. To understand the content-dependent factor, we show bar charts of the pairwise com- parison results between 60 fps and 30 fps in Fig. 3.3 (a) and between 50 fps and 25 fps in Fig. 3.3 (b). For each source sequence, we plot the ratio of people who vote for the maximum frame rate sequence (in blue), the half maximum frame rate sequence (in gray) and the same (in orange) using different bar lengths. In particular, we focus on the blue portion that represents the group of people who can distinguish the quality differ- ence accurately. We arrange the sequences from the left to right according to the portion of the blue portion. The larger the blue portion, the more distinguishable quality caused by the frame rate difference. We see from Fig. 3.3(a) that the four exemplary sequences given in Fig. 3.1 rank in the following order: Walking, Slow City, Fountain and Tunnel. 50 As a result, we can choose 30 fps for Walking and Slow City and 60 fps for Fountain and Tunnel. Although we choose a frame rate of 30 fps for Walking and Slow City, the perceived visual quality is about the same as that of 60 fps for most viewers. We will propose an automatic frame rate selection (FRS) algorithm in the next section. 3.4 Proposed PQD-FRS Method In this section, we propose a method to select a suitable frame rate for each short inter- val of an input video sequence dynamically. The objective of this dynamic frame rate selection (FRS) method is to preserve the same perceptual experience for a large per- centage of viewers. Since its design is based on the subjective test results in Sec. 3.3, it is called the perceptual quality driven frame rate selection (PQD-FRS) method. The proposed method can be incorporated into encoder end as it provides with an indication of appropriate frame-rate for every group of pictures (GoP). 3.4.1 Basic Decision Unit We only apply the FRS method to the luminance channel for simplicity and efficiency. The basic FRS unit is chosen to be a group of pictures (GoP) consisting of 8 frames. There are three reasons for this choice. First, since the GoP structure is the basic unit of H.264 and high efficiency video coding (HEVC) [94] coding standards, this choice can be easily integrated into the state-of-the-art codecs. Second, the delay of 8 frames is about 0.13 ( 8=60) second, which is not too severe in practical applications. Third, we intend to use the machine learning approach to learn the decision rule, which demands a sufficient amount of training data. By dividing each sequence into GoPs of 8 frames, we can obtain more training and testing samples while preserving stable motion information in each sample. When the maximum frame rate is 8 frames per GoP, the half and the 51 quarter rates become 4 frames and 2 frames per GoP. In operation, the PQD-FRS method divides each input sequence into GoPs of 8 frames, and applies the learned FRS model to each target GoP for a decision among 8, 4 or 2 frames. 3.4.2 Feature Extraction via Influence Maps Construction In this subsection, we examine the feature extraction problem by constructing several influence maps. To build influence maps, we partition each frame into blocks of size 32 32 and extract spatial and temporal features from each block accordingly. The detailed procedure is described below. Spatial Influence Map (SIM) We consider spatial masking, spatial stimuli, and visual saliency jointly in developing the spatial influence map. The procedure is detailed below. Spatial Masking. The complex background has a spatial masking effect on main objects in the scene, which can be measured by spatial randomness [62, 63]. Tradi- tional methods explore spatial properties by taking advantage of Contrast Sensitivity Function (CSF) [70, 7] and DCT coefficients [36]. These methods are effective in ana- lyzing local complexity as they transform the spatial information to frequency domain. However, they are not accurate enough to represent perceptual sensitiveness, and cannot respond well to pattern regularity, which also turned out to be an influential factor for perceptual quality [97]. In contrary, Hu et al. [47] proposed a mathematical formula to directly evaluate spatial randomness in spatial domain. Although it is not as computa- tional efficient as traditional methods, we find it most suitable for our propose to design an accurate prediction system. Following the idea in [47], we first divided the image into blocks where we predict the central block with size W b , using its spatial neighbors with stride width W s . The 52 Figure 3.4: Illustration of spatial randomness prediction. result prediction residual serves as a spatial randomness index, which measures the sim- ilarities of central block and neighboring information. A simple illustration is shown in Fig. 3.4, where the red pixelY withW b = 1 is to be predicted based on its neighbor blue dotsX withW s = 1. The prediction is directly conducted with pixel intensities as ^ Y (u) =HX(u); (3.3) whereu is a spatial location index. The optimal prediction of ^ Y (u) can be derived by optimizing the transform matrix,H, using the minimum mean-squared error (MMSE) criterion: H =R XY R 1 X ; (3.4) whereR XY = E[Y (u)X(u) T ] is the cross-correlation matrix betweenX(u) andY (u) andR X =E[X(u)X(u) T ] is the correlation matrix ofX(u). Note thatR XY andR X are also directly calculated using pixel intensities within the regions ofY andX, which are set to be withW b = 32 andW s = 32 in our experiment for HD videos. The correlation matrices contain the local spatial information of image 53 structure, and the cross-correlation matrix resembles the similarity between the neigh- boring blocks and the center block. SinceR X is not always invertible, we can replace R 1 X by its pseudo inverse ^ R 1 X =U m 1 m U T m ; (3.5) where m is the eigenvalue matrix with topm non-zero eigenvalues of matrixR X and U m is the corresponding eigenvector matrix. Based on Eqs. 3.3-3.5, we can obtain the optimal prediction of ^ Y . The prediction error between the predicted value ^ Y (u) and the ground truthY (u) defines the random- ness map which can be expressed as SR(u) =jY (u)R XY R 1 X X(u)j: (3.6) To give an example, for the frame of the Fountain sequence shown in Fig. 3.6(a), its spatial randomness map is given in Fig. 3.6(b). A brighter value indicates higher spatial randomness. Spatial Stimuli. The spatial stimuli such as edges are more visible against smooth back- ground since there is little masking in regions of smooth background. To capture this property, we first measure background smoothness using spatial randomness in a block- wise fashion. That is, we call pixels with their SR value in Eq. 3.6 lower than threshold T lc low-complexity pixels and calculate the percentage of low-complexity pixels in each block as the smoothness measure: SM =N lc =W 2 b ; (3.7) 54 whereN lc is the number of low-complexity pixels in a block, andW b W b is the block size. Similarly with spatial randomness, we setW b = 32 for HD video in our experi- ment. Since the derived spatial randomness ranges from 0 to very large number based on the prediction, we found through experiments that pixels with prediction errors smaller than ’1’ can be regarded as with small enough difference with its neighbors. As a result, we makeT lc = 1 in our experiment to represent filter out smooth regions. An example of the smoothness measure is shown in Fig. 3.6(c), where a brighter area indicates smoother background. We compute the edge magnitude using the Sobel Filter and normalize its value to [0,1] with a predefined maximum value. Then, the normalized edge magnitude is multiplied by the SM given in Eq. 3.7 to yield the spatial contrast map. An example is shown in Fig. 3.6(d). The spatial contrast map considers both the masking effect and the spatial stimulus jointly. Visual Saliency. It is essential to include the visual saliency feature in the prediction model. Here, we generate a salient map using the graphic based visual saliency (GBVS) method [40], which first forms activation maps on several feature channels and then normalizes them to achieve prediction of human visual fixations. This frame-based method is considerably effective and computational efficient compared to other image- based and video-based saliency method. One example of the saliency map is shown is Fig. 3.6(e), where every pixel takes a value between [0; 1], indicating the degree of visual attention. Regions of higher values are more salient. Finally, we weigh the spatial contrast map with the corresponding value in the saliency map to result in the spatial influence map as shown in Fig. 3.6(f). The brighter region in the spatial influence map is more noticeable than the darker region. 55 Temporal Randomness Map (TRM) Random and unpredictable motion distracts people’s attention. We use temporal ran- domness to characterize this phenomenon. By following the idea in [48], we use tempo- ral prediction residual as an indicator of temporal randomness. Unlike traditional video coding methods which use motion vectors to represent motion information, our tempo- ral feature directly explore the similarities of neighboring frames to predict the motion structure. In fact, the derived temporal features not only response well to motion speed, but also to motion regularities and repetitive movements, which is hard otherwise using traditional methods. Figure 3.5: Temporal randomness prediction withd = 4 in a GoP. We useY l k to denote an interval of a sequence from thekth frame to thelth frame, with k + 1 < l. We divide the interval into two sub-intervals: Y ld k and Y l k+d with temporal distanced. As discussed in Section 3.4.1, our basic decision unit is GoP, and we show our motion prediction unit in Fig. 3.5, where we setd = 4 along withk = 1 and l = 8 for each GoP. Since single frame difference is very small, especially for slow motion sequences, we aggregate four frames so as to get a more concrete motion structure in a certain time interval. Moreover, using the leading four frames to predict the last four without overlapping enables us to consider all the information in the GoP without confusion between sub-intervals. After dividing into two parts, the two sub-intervals of a GoP store their own motion structure. We then evaluates motion changes between the two sub-intervals by Y l k+d =AY ld k +T; (3.8) 56 (a) (b) (c) (d) (e) (f) Figure 3.6: Spatial feature maps of HD videos with (a) The luminance channel of Fig. 3.1(a), (b) the spatial randomness map with block-size and block-stride equaling 32, (c) the smoothness map with block size 32 andT lc = 1, (d) the spatial contrast map, (e) the salient map and (f) the spatial-influence map. whereY l k+d andY ld k are data matrices storing concatenated frame intensity values,A is the prediction matrix representing sub-interval transformations andT is a prediction residual matrix that indicates motion randomness. If the motion structure within two sub-intervals are similar to each other, the prediction residue should be small, and vice versa. Similar to spatial randomness, we can obtain the optimal predication matrix via ^ A =Y l k+d ( ^ Y ld k ) 1 ; (3.9) where ( ^ Y ld k ) 1 denotes the pseudo inverse ofY ld k . 57 (a) (b) (c) (d) (e) (f) Figure 3.7: Three consecutive frames of the HD video “fountain” following the frame in Fig. 3.6(a) are shown in (a)-(c) and their temporal randomness map, spatio-temporal influence map and weighted spatio-temporal influence map are shown in (d), (e) and (f), respectively. However due to the large dimensions ofY l k+d andY ld k , which stores all the infor- mation of several frames, the computation of prediction matrix ^ A is high. This can be simplified by decomposing data matrixY as [45] Y =CX +W; (3.10) whereC is a matrix encoding the spatial information andX is the state matrix ofY . Let Y = CDV T be the singular value decomposition ofY . Then, the optimal state matrix can be estimated via ^ X =DV T : (3.11) The state matrix,X, greatly reduces the dimension of the data matrix,Y , while keeping the key motion information. Although the singular value decomposition itself introduces computation complexity to decouple the interval information Y , we find it to be extremely beneficial for the 58 prediction of motion residue. As a matter of fact, we conduct the SVD with exactly the frame information as we want to capture the very difference between the sub-intervals introduced by different motions. Prediction with shrinking frame dimension by pixel- averaging and bilinear interpolation turns out to be less effective for the designed system. Finally, we can substitute Eqs. 3.9-3.11 into Eq. 3.8 to obtain the temporal prediction residue as T l k+d =jY l k+d C ^ A ^ X k ld j; (3.12) which is called the temporal randomness map (TRM) corresponding to the frame inter- val fromk tok +d 1. For demonstration of motion information, three consecutive frames following the frame of Fig. 3.6(a) are shown in Figs. 3.7(a)-(c). We then conduct the “4-4” motion prediction on eight consecutive frames starting from Fig. 3.6(a). As a result, the corre- sponding temporal randomness map of Fig. 3.6(a) using the designed prediction module is given in Fig. 3.7(d). A brighter area indicates a higher degree of randomness. We see that the randomness in the area of falling water drops is higher than that of other areas. Spatio-Temporal Influence Map (STIM) One straightforward way to integrate the spatial and temporal information is to con- duct pixel-wise multiplication of the spatial influence map and temporal randomness map. The resulting map is called the spatio-temporal influence map (STIM). Moreover, in order to take temporal masking into consideration, where people are more likely to notice single moving object in still backgorunds such as ’honeybee’ and less sensitive about specific object in massive moving scene, we assign larger weights to relatively moving parts in still backgrounds and give smaller weights otherwise. Specifically, we divide the derived STIM value by the frame average, and call it the weighted spatio- temporal influence map (WSTIM). Examples of the spatio-temporal influence map and 59 the weighted spatio-temporal influence map are shown in Figs. 3.7(e) and (f), respec- tively. In the next subsection, we use the temporal randomness map, the spatio-temporal influence map and the weighted spatio-temporal influence map as the three main input feature maps to the machine learning system. 3.4.3 Feature Representation The dimension of influence maps is large. It is desired to derive a spatially invariant feature vector of a lower dimension from each of these maps to facilitate the machine learning task. To achieve this goal, we divide a map into non-overlapping blocks of size 64 64 and calculate the averaged feature value in each block. Then, we arrange blocks based on their feature values in an increasing order. To give an example, we plot the block STIM value distribution in Fig. 3.8 based on an exemplary frame of Tunnel, Fountain, Slowcity and Walking four sequences. Each curve corresponds to a feature vector of dimension 480 (3016) for a HD video, with 30 block-width horizontally and 16 block-width vertically given that our integrating block size is 64. This feature vector indicates blocks’ movement distribution without considering their spatial positions. We see from this figure that Tunnel and Fountain sequences have more blocks with larger STIM values. It is worthwhile to mention that the block-value distribution feature vector does bear similarity with the histogram-based feature vector. That is, if we partition the y-axis in Fig. 3.8 into bins and compute the number of blocks in each bin, it will lead to the histogram-based feature vector. However, there are very few blocks with a large STIM value and the resolution of these blocks can be blurred in the histogram-based feature vector. For this reason, we use the block-value distribution feature vector instead. 60 Figure 3.8: The block STIM value distribution of exemplary frames of four sequences in Fig. 3.1. 3.4.4 Satisfied User Ratio (SUR) Prediction and Frame-Rate Selec- tion (FRS) On one hand, a content provider would like to deliver a lower-frame-rate video to save the transmission bandwidth (or bit rates). On the other hand, this choice will have an impact on user’s dissatisfaction. A good frame rate selection algorithm should strike a balance between the transmission bandwidth and user’s satisfaction degree. We will show the SUR prediction results against the same sequence displayed in 60fps and 30fps in Sec. 3.5.2. If a user cannot tell the difference between these two frame rates, we say that he/she is satisfied with the lower rate of 30fps. Otherwise, we say that he/she is not satisfied. We call the percentage of viewers who cannot see the difference between these two frame rates the “satisfied users ratio (SUR)”. Its predic- tion can be formulated as a regression problem. Based on the block-value distribution feature vector, we use the support vector regression (SVR) [91] to predict the SUR. The SUR prediction can be conducted in each GoP. Based on the predicted SUR, a content provider can decide the proper number of frames to be encoded in each GoP. We can use an example to explain the frame-rate selection (FRS) algorithm. 61 If the maximum frame rate is 60fps and the maximum number of frames per GoP is 8, we can encode 8, 4, 2 and 1 frames for a GoP so that the frame rates can range from 60fps, 30fps, 15fps to 7.5fps. The FRS algorithm can be conducted at the GoP level. Without loss of generality, we consider the frame rate selection between 60fps and 30fps. If the predicted SUR is X% between the two rates while the target is to satisfy the need ofY % viewers, the operator should deliver this GoP with 8 frames, if X <Y , and 4 frames, ifXY . (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure 3.9: Feature heat maps of three exemplary frames of Tunnel (the first row), Run- ning Horses (the second row) and Honeybee (the third row) sequences: frame shots (the first column), TRM (the second column), STIM (the third column) and WSTIM (the last column). 3.5 Experimental Results In this section, we show visualization results of influence maps, SUR prediction accu- racy and the performance of the proposed PQD-FRS system. 62 3.5.1 Visualization of Influence Maps As mentioned in Sec. 3.4, both STIM and WSTIM provide valuable information for perceptual-based frame rate selection. We presented STIM and WSTIM of the Fountain sequence in Fig. 3.7. Here, we offer more examples by providing heat maps of TRM, STIM and WSTIM for three motion sequences (namely, Tunnel, Running Horses and Honeybee) in Fig. 3.9. To compare them across different sequences, we normalize heat maps along each column so that the brightness of the same feature type is in the same range. The results further demonstrate that TRM focuses on plain motion information, while STIM and WSTIM tend to capture perceptually sensitive parts. Specifically, for the Tunnel sequence, the central complex area which is not sensitive to HVS is removed in STIM and WSTIM results. Similarly for the Running Horses sequence, the fast move- ments of rattling legs are also removed in the weighted maps as the irregular movements within complex background are less noticeable. On the other hand, the slow motion in the Honeybee sequence is relatively highlighted. Movements of flying bees and drifting flowers are enhanced in the WSTIM result by comparing them with the still background. 3.5.2 SUR Prediction Results We focus on frame-rate selection between 60 fps and 30 fps in this subsection. In par- ticular, we will demonstrate the performance of the SUR prediction module. Since the basic decision unit is GoP, the input to the SUR module is a GoP while its output is the predicted SUR. For feature extraction, we use the leading four frames to predict the last four frames as illustrated previously in Fig. 3.5. We obtain TRMs for the last four frames and SIMs of all frames and, then, calculate STIMs and WSTIMs for each GoP. 63 The ground truth is only available at the sequence level. To allow more data sam- ples, we still use the GoP as the basic training unit. In implementation, we assign the same SUR value of a training sequence to its all GoPs. For a testing sequence, we can predict the SUR value for each GoP. However, for performance benchmarking, we take an average of the SUR values of all GoPs in one sequence to obtain the sequence-level SUR. We adopt the 3-fold cross evaluation scheme (i.e., using 2/3 of all data samples as training and the remaining 1/3 as the testing, repeating the process three times, and then taking the average). Table 3.2: Comparison of predicted SUR values in terms of percentages with different feature vectors, where the SVR with a linear or RBF SVR kernel is adopted as the machine learning algorithm. Feature SVR Kernel Scene Walk ing Slow city Car Night Skat ing Honey bee Crew Roller coaster Yacht Fast city Box ing Foun tain Park Run Metro Ready setgo Tunnel SITI RBF 80.1 86.0 77.4 68.7 72.6 66.7 76.7 68.2 64.4 65.2 67.8 67.4 63.8 61.7 63.2 62.7 57.6 SIM Linear 98.2 91.6 89.6 83.6 72.7 85.5 76.4 75.2 70.2 79.2 75.3 78.0 43.4 49.5 53.9 49.0 17.2 TRM Linear 99.6 94.2 94.5 85.7 80.2 82.0 82.1 81.0 66.6 74.4 63.7 58.6 53.3 31.6 62.0 58.5 30.2 STIM Linear 99.0 93.1 93.1 88.3 84.7 77.8 93.5 84.6 56.1 75.3 81.2 73.1 57.0 46.3 24.6 35.1 -35.6* WSTIM Linear 102* 93.1 93.0 77.3 68.6 72.8 63.7 80.9 69.4 70.2 76.9 93.8 49.5 45.7 75.4 49.8 17.6 RBF 67.3 67.3 67.3 67.3 67.3 67.3 67.3 67.3 67.3 67.3 67.3 67.3 67.3 67.3 67.3 67.3 65.0 CSTIM Linear 99.9 94.2 94.6 83.3 82.4 79.1 82.1 80.8 75.8 75.5 70.5 64.5 50.5 42.8 47.6 31.1 7.0 Ground Truth 100 95 95 85 85 80 80 80 75 75 70 60 50 45 45 25 10 We show the ground truth and the predicted SUR in terms of percentages under different feature sets and different SVR kernels for 17 sequences in Table 3.2. Recall that 20 subjects participated in the test so that the ground truth is a multiple of 5%. It is listed in the last row. For the slow motion sequences such as Scene, no viewers could tell the difference between display rates of 30 fps and 60 fps. In contrast, only 10% of viewers (2 out of 20) could not tell the difference for the fast-moving Tunnel sequence with a predictable motion pattern. If the goal of the content provider is to satisfy the need of 75% of viewers, the content provider can display 30 fps for the first 10 sequences and 60 fps for the remaining 7 sequences in Table 3.2 based on the sequence-level ground truth. The best predicted results are shown in bold face. We compare six feature vectors (i.e., SI/TI, SIM, TRM, STIM, WSTIM and CSTIM) as shown in the first column. For 64 the SI/TI feature specified in ITU P.910 [2], it uses a single value to represent the content complexity in one GoP. The feature is too simple to produce good prediction. Feature vectors derived from SIM, TRM, STIM and WSTIM were discussed in Secs. 3.4.2 and 3.4.3. Furthermore, we cascade three feature vectors derived from TRM, STIM and WSTIM into a long feature vector and call it the CSTIM, which is made to be 1440 dimension (480 3) for a GoP. Note that the final feature dimension of CSTIM can be further adjusted based on the video dimension and accuracy requirements of the system user. The predictor using the CSTIM feature and the RBF kernel does not perform well. Due to the large dimension of the CSTIM feature vector, the non-linear kernel would only work if its parameters were finetuned so that it behaves like a linear kernel. Actu- ally, we see from Table 3.2 that the CSTIM feature vector with the linear kernel gives the most accurate prediction in almost all cases (except for three), and its predicted per- centages of satisfied users are very close to those of the ground truth. The averaged error rate is about 1:77% for all testing samples. It is rare but possible that the predicted SUR may go beyond the range of 0-100% due to a small number of data samples and the regression nature of the SVR predic- tion. Two such predicted values are labeled by a asterisk sign to indicate abnormality. Generally speaking, most predictors can perform well for sequences with slow motion since there are a sufficient number of data samples. However, for fast motion sequences with complex information, the performance of most predictors drops significantly. The CSTIM with the linear kernel can still offer reasonable results. We should emphasize that the frame rate selection algorithm can be conducted dynamically at the GoP level since the SUR prediction result is available at the GoP level. We show the sequence-level SRU results in the subsection only for the purpose of performance benchmarking against the ground truth values. 65 3.5.3 Subjective Test on PQD-FRS Performance Table 3.3: Subjective SUR results of homogeneous test sequences from “Netflix Chimera” dataset. Comparison Aerial Barscene Dancer Dinnerscene Driving Rollercoast2 Toddlerfountain Windturbine Subjective Test Results 28 84 92 88 68 48 56 84 Prediction Mean 24.3 85.1 90.9 87.9 69.1 46.8 54.2 84.0 SUR (%) Prediction STD 6.3 2.4 6.6 2.1 6.3 9.4 8.7 2.1 In this section, we display two sets of results by applying the pre-trained PQD- FRS model with all 17 sequences from FRD-VQA dataset on both homogeneous and inhomogeneous sequences. Subjective Tests on Homogeneous Sequences (a) (b) (c) (d) Figure 3.10: Examples of homogeneous sequences for testing: (a) Barscene: a small object sliding on the table, (b) Windturbine: swirling wind turbines, (c) Driving: a first- person view of driving scene and (d) Toddler: a small kid running in the fountain. In the first experiment, we collect a new set of homogeneous sequences with 60 fps from the newly-released “Netflix Chimera” dataset. A subjective test is conducted to derive their corresponding ground truth SUR following the procedure described in Sec- tion 3.3 with 25 participants. Now that our focus is on the quality differences between 60fps and 30fps, we only downsample the sequence once using frame-skipping and 66 frame-repeating as described in Section 3.3.2. Examples of the acquired video clips are shown in Fig. 3.10. As a result, we select 8 sequences with homogeneous motions of various speeds. We then derive the prediction SUR by applying the pre-trained PQD-FRS model, and test on the selected sequences. In order to get a more accurate prediction, we use all the 17 60fps sequences in the proposed FRD-VQA dataset for training the model. Since the prediction is conducted on GoPs, we calculate the mean and standard deviation of the derived SURs for a given clip. Table. 3.3 presents the comparison between the ground truth SUR and prediction performance. And the results show that the mean average of predicted SUR are very close to the ground truth value. Note that although we pick out clips of sequences with homogeneous movements, they might be still varying a lot in GoP-level. For example, the movements in “Roller- coaster2” and “Toddlerfountain” have direction changes in the sequence, which would most likely introduce a couple of GoPs with large STIM values. Nevertheless, the STD for prediction is still reasonably small considering the confidence interval of ground truth is also 4% (100/25). This further proves that, our algorithm works accurately and steadily on homogeneous contents. Reconstruction of Inhomogeneous Sequences with Dynamic Frame-rates Table 3.4: Subjective Test Results of comparing the half- and dynamic-rate sequences against their full-rate ones, along with prediction mean and GoP numbers in two cate- gories. Comparison Cow Dance Gym Jockey Traffic Biking Church Block 2 versus 1 84 64 36 20 76 12 64 76 Subjective Test SUR (%) 3 versus 1 96 88 72 68 80 76 92 92 Above 75% SUR 83.5 (34) 78.9 (26) 81.9 (18) 82.6 (29) 87.9 (82) 81.3 (17) 87.2 (48) 87.6 (40) Prediction mean Below 75% SUR 73.5 (3) 68.4 (36) 59.7 (25) 56.9 (8) 72.0 (5) 66.3 (25) 69.7 (2) 72.3 (3) case 2 150 249 175 150 350 170 200 175 No. of Coded Frames case 3 162 393 275 182 370 270 208 187 67 (a) (b) (c) (d) Figure 3.11: Examples of inhomogeneous sequences for dynamic reconstruction: (a) Traffic: very fast moving traffic and changing background, (b) Dance: a scene of group dancing, (c) Biking: group of people biking and skate-boarding on the road and (d) Jockey: a man riding a running horse. (a) (b) Figure 3.12: Subjective test results for (a) comparison between the full- and the half-rate cases, and (b) comparison of the full- and the dynamic-rate cases. In the second experiment, we apply the dynamic frame-rate selection algorithm to eight sequences that contain inhomogeneous motion patterns, which are also not in the FRD-VQA dataset. Examples of selected sequences are shown in Fig. 3.11. We adopt the 25% rule in the Just Noticeable Difference (JND) [93]. That is, we set up the follow- ing quality criterion - at most 25% viewers can see the difference of the same sequence played twice - one coded at a fixed rate of 60 fps while the other coded at a dynamically adjusted frame rate based on the proposed PQD-FRS method. Through this experiment, 68 we intend to bring up the SUR by detecting the sensitive contents and restoring them with native frame-rates while keeping the others as half frame-rates. We convert the criterion to the GoP level by setting the SUR of each GoP to 75% (at least 75% viewers cannot tell the difference). If the predicted SUR is higher than 75%, we lower the frame number per GoP from 8 to 4 by temporal down-sampling. Otherwise, we will keep 8 frames per GoP. For the former case, we adopt the frame repetition scheme to keep the frame number per GoP the same. In other words, we only need to encode/decode 4 frames per GoP and display 8 frames per GoP by repeating each frame twice. In this way, we construct a new sequence by assigning frame-rates dynamically based on the underlying video content. For each test sequence, we asked 25 subjects to compare the visual quality of the same sequence encoded by the following three cases: 1. Full Rate: Each sequence is encoded at a fixed rate of 60 fps; 2. Half Rate: Each sequence is downsampled from 60 fps to 30 fps, encoded at a fixed rate of 30fps, and displayed at 60 fps by frame repetition; 3. Dynamic Rate: Each GoP is encoded at the full or half rate by applying PQD-FRS to the GoP level. We use the full rate sequence as the reference and ask each subject whether he/she can see the difference between “cases 1 and 2” and between “cases 1 and 3”. Since it is easier to reconstruct and display the video with dynamic frame-rate by just repeating the frames at appropriate frame positions, here we also display all the videos at full frame-rates (60fps). The subjective test results are shown in the first major row of Table 3.4, where we show the SUR values for the eight test sequences in two comparison cases: 1)case2versus1, and 2)case3versus1. We see from the table that the SUR of the 69 dynamic rate is significantly higher than that of the half rate. Six sequences encoded by the dynamic frame rate meet the original goal of 75% SUR. Although the SUR val- ues of Gym and Jockey do not met with the 75% target, they both have an improvement of 40% SUR and are still close to the objective SUR considering the ground truth confi- dence interval is 4%. Detail SUR prediction results are shown in the second major row of Table 3.4. We present two sets of average values corresponding to the two quality categories since sim- ple average of all GoPs can no longer represent the video quality of the whole sequence. In addition, we indicate the number of GoPs of each category in the bracket by the average SUR value. The results show that both the number of noticeable GoPs and the “severeness” of motion difference influence the sequence-level SUR. Specifically, it proves that a small group of very noticeable GoPs can level down the sequence SUR to a great extent even if most parts of the sequence are smooth. Here we didn’t design a fusing method to derive the subjective quality score of com- plex sequence, as our major target is still detecting the sensitive fast-motion contents and provide them with appropriate frame-rate. However, we believe it is applicable to take advantage of the derived features, and learn a separate system which maps the GoP SURs to sequence level SUR. Furthermore, we list the number of encoded frames for the half rate (case 2) and the dynamic rate (case 3) in the third major row of Table 3.4. The results indicate that the SUR value can go up significantly with a modest extra coding cost using the dynamic rate. Finally, we provide two bar charts in Fig. 3.12 to show the preference of (a) the full rate (case 1) and the half rate (case 2), and (b) the full rate (case 1) and the dynamic rate (case 3). The information in Fig. 3.12 is comparable to that in Table 3.4 except that we 70 split the SUR into the “tie” (in orange) and “better” (in blue) two groups. The advantage of the PQD-FRS method is clearly demonstrated. 3.5.4 Influence of Compression In this subsection, we examine the influence of compression on SUR results as well as the performance of proposed method. Since we are working on HD and UHD sequences with high frame-rates, we use HEVC to encode the target videos. Specifically, we use HM 15.0 for HEVC coding and encode with the main HEVC tier of random access, with sample adaptive offset (SAO) tool and asymmetric motion partitions (AMP) on. The available coding unit (CU) sizes are as default from 8 up to 64 and the GoP size is 8. Due to the length of subjective tests, as well as complexity and variety of compression artifacts, we only provide a glimpse into this sub-topic. As a result, we use the HM tool to encode 4 sequences, each with 5 different QPs as 22, 27, 32, 37 and 42. In order to diverse the content properties of selected sequences, we choose the four example videos as in Fig. 3.1 from the FRD-VQA dataset which represent four important types of video contents. We then conduct another set of subjective tests following the subjective tests steps in Section 3.3 with 25 subjects and apply the pre-trained PQD-FRS model on all 6 quality versions of the sequences. Note that here we do not re-train or finetune the model with extra compression contents, but only with uncoded 17 sequences from proposed FRD- VQA dataset as in Section 3.5.3. The test results are plotted in Fig. 3.13 with QP values and SUR percentage as corre- sponding coordinates. The ground truth SURs are labeled in solid lines and the predic- tion results in dashed lines. As shown in the subjective test results, higher compression rates tend to gradually alleviate the influence of fast motion on human perception, and thus increase the SUR. For contents with originally sharp edges and smooth background, 71 Figure 3.13: SURs results of subjective test and prediction corresponding to different qualities of the video samples in Fig. 3.1. blurriness caused by compression would decrease the sensitiveness of moving objects to HVS. For contents with originally complex background, compression helps decrease the spatial randomness, making the moving objects more noticeable than uncoded ver- sion. And for originally slow moving objects, the temporal properties tend to dominate the perceptual quality, which keeps the SUR at very high level. As indicated by Fig. 3.13, our prediction results match with the general variations of SUR values in different QP levels. One one hand, our proposed method is robust to quality difference and it captures the temporal information steadily. On the other hand, it is able to adapt itself to recognize the spatial differences and generate a reasonable SUR prediction. However, we believe that more experiments should be included to completely study the influence of compression on frame-rate sensitiveness. And we assume that by adopt- ing the proposed PQD-FRS system, a more general model can be learnt by adding com- pression videos into the training pool. 72 3.6 Conclusion and Future Work A novel dynamic frame rate selection algorithm was proposed for the coding of high frame rate video in this work. We adopted a statistical approach to measure and predict the percentage of satisfied users between two frame rates (say, 30 fps versus 60 fps). The satisfied user ratio (SUR) is highly dependent upon the underlying video content. We developed several influence maps to characterize the video property, converted them into feature vectors, and used the support vector regression (SVR) algorithm to predict the number of coded frames per GoP that meet the predefined SUR requirement. It was verified by experimental results that the SUR can be accurately predicted and the coded sequences using the dynamic frame rate can meet the SUR requirement by lowering the number of coded frames. Further study can be conducted including developing a subjective quality metrics for complex sequence, and designing more robust system considering videos with different qualities. Moreover, since the number of high-frame-rate (HFR) ultra high definition (UHD) video sequences accessible to us was still limited, it is desired to conduct a larger scale test on the proposed dynamic frame rate algorithm for more HFR/UHD video sequences, which will be extremely valuable to the broadcasting industry in delivering extremely high quality programs with a reasonable bandwidth. 73 Chapter 4 Learning from Fine-to-Coarse: Intra-Class Difference Guided Semantic Segmentation 4.1 Introduction Semantic segmentation is an important task for image understanding and object local- ization. With the development of fully-convolutional neural network (FCN) [69], there has been a significant advancement in the semantic segmentation filed. The FCN-based methods take advantage of the discriminative capability of the deep convolutional net- work for accurate pixel-level labeling by incorporating the deconvolution network and the skip architecture [11]. To enhance prediction details, a dilation network was pro- posed in [14] to increase the resolution of feature maps by removing certain pooling layers while keeping the field-of-view (FoV). Other approaches have been developed to combine the convolutional neural network (CNN) framework with traditional features, such as the conditional random field [14, 109], super-pixels [5], and the edge informa- tion [9, 13, 49]. Recent development in unsupervised and semi-supervised learning shed light on ways to improve the classification and detection performance. Specifically designed training targets prompt the network to learn more efficiently [19, 21], and the training 74 Figure 4.1: The proposed fine-to-coarse learning (FCL) procedure with intra-class dif- ference guided fully convolutional neural network (ICD-FCN). A designed clustering scheme is designed to relabel the original ground truth into detailed sub-classes. procedures have been rendered so that the network would pay more attention to less con- fident cases [87, 102, 101]. While CNN has the capability to capture the common object features and templates within the same category, it also has a tendency to overlook the distinctive characteristics among object instances. In this work, we propose a novel fine-to-coarse learning (FCL) procedure to bet- ter exploit the intra-class variability and guide the underlying FCN to learn the differ- ences. Thus, the system is called the intra-class-difference guided FCN (ICD-FCN). An overview of the proposed ICD-FCN is shown in Fig. 5.1. It will be explained in detail in Sec. 4.3.1. The FCL procedure guides the network with designed ’finer’ sub-class labels and combines its decision with the original object class labels through end-to-end training. The fine sub-class labels are generated from the network features. The pro- posed FCT procedure enables a balance between the fine-scale (i.e. sub-class) and the coarse-scale (i.e. class) information. The proposed ICD-FCN achieves 78:0% and 77:8% in the mean intersection-over- unit (IU) performance on PASCAL VOC 11 and 12 tests, respectively. Consistent and 75 significant improvements in semantic segmentation are observed across the PASCAL VOC, NYUDv2 and PASCAL Context datasets as compared with the straightforward ResNet [42] based FCN [102]. 4.2 Related Work Semantic segmentation requires accurate pixel-wise labeling required by traditional segmentation [107, 83, 33] as well as high-level object labels of segmented areas. Recent developments in FCNs have brought significant advancement in the field, and lots of traditional computer vision techniques have been deployed to boost the FCN performance. In this section, we briefly review the basic FCN structure and introduce several recent modifications made to the FCN and its training procedure. We also review several unsupervised and semi-supervised methods that prove to be efficient by adding extra labeling information. Fully Convolutional Neural Network. The development of deep CNN [58, 89, 95] and the availability of a large amount of image data [31, 67, 24] has improved computer vision tasks such as object classification and detection. To extend the powerful tool to the semantic segmentation task, an FCN was proposed in [69] for pixel-wise classification, which converts the fully-connected layers at the final stage to be fully-convolutional to offer a 2D image output. The deconvolution layer allows an image output to have the same size as the input. The skip architecture [11] enables the FCN to combine results from different pooling layers of different resolutions. Descendants of the FCN. To improve the pixel-wise prediction accuracy and relieve the confusion between different classes, several computer vision techniques were 76 proposed to be integrated with the FCN. For example, MRF/CRF-driven FCN methods target at training classifiers and graphical models simultaneously [82, 56, 14, 109]. The original FCN demands multi-stage training by gradually combining the high-level class information at deeper layers with low-level details at shallower layers, a dilation based network was proposed in [14] to remove this constraint. By dilating the feature map and removing certain down-sampling layers, the network can maintain both a large FoV and a small filter size. Recent developments in [102, 101] extend this design to deep Residual Network [42], which offers further improvements. Unsupervised Learning. Recently, many methods have been developed to generate extra labels which are not originally provided. One way is to add the geometric infor- mation to the label by grouping patterns with a similar geometric direction into the same class [25, 19, 21]. Another way is to apply the hard-case learning so that the network can focus on less confident cases in the original network [87, 102, 101]. However, it is less effective in semantic segmentation [102]. In addition, generating sub-category labels was proven to be efficient in object classification [28, 27], where the HoG-based features [32] was exploited and the late fusion was adopted to fuse decisions of each sub-class. 4.3 Proposed ICD-FCN Method and FCL Procedure 4.3.1 Overview of ICD-FCN It is observed that the FCN learning tends to maximize its ability to recognize the com- mon discriminative features across the training data. The learning could be even more 77 effective if one can exploit varieties existing in an object class. To achieve this objec- tive, we propose a fine-to-coarse learning (FCL) procedure to guide the FCN to be more aware of the intra-class difference (ICD). As shown in Fig. 5.1, we first train a fine-scale FCN based on sub-class labels, which are created by clustering the original segmentation ground truth. Then, we map the fine-scale decisions to the original learning target using designed ICD-FCN. The proposed ICD-FCN is a multi-task fully convolutional network (MFCN) with both the coarse-scale and fine-scale labels as training targets. It can be trained end-to-end by back-propagation using a standard stochastic gradient descent (SGD) method. For example, the original ’aeroplane’ class in Fig. 5.1 can be further divided into sub-classes based on visual characteristics, such as the shape, pose and viewing angle. By generating a fine-scale ground truth based on the created sub-classes, the ICD-FCN can learn fine-tuned feature representations and provide more accurate matching tem- plates for each sub-class of the ’aeroplane’. 4.3.2 Proposed FCL Procedure Training Procedure. We present the overall FCL procedure in Fig. 4.2. It consists of three steps: 1) obtain a baseline model with the original (or coarse) class labels; 2) derive features associated with the sub-class (or fine) characteristics and 3) combine the fine and coarse information using the end-to-end training. They are detailed below. In the first step, we adopt an FCN [69] with the dilation structure [14] to train a baseline model by following [102]. This offers a good baseline FCN that learns object classes from the training data. Then, we use the convolutional features and the corre- sponding responses of this FCN to cluster samples with original labels into sub-classes, called the ICD-classes. This clustering process will be explained in Sec. 4.3.3. 78 Figure 4.2: The network architecture of the proposed 3-step FCL procedure: a baseline model with the original (or coarse) class labels is obtained in Step 1; features associated with the sub-class (or fine) characteristics are derived and used for training in Step 2; and the fine and coarse information is combined using the end-to-end training. By following [102], the baseline FCN adopts the ResNet structure [42] (rather than the VGG [89]) based on the ResNet model pretrained on the ImageNet [24]. A dilation setup is added in conv4 and conv5 layers to increase the FoV while keeping the resolu- tion of the feature space. This results in a decision scale of 1=8 (each side) after conv5. To further increase the FoV , a kernel size of 5 5 with dilation of 12 is adopted for the convolutional ICD-class classifier. As a result, a FoV of 392 392 can be derived after deconvolution, which is close to the input image dimension 500 500. In the second step, we use a clustering scheme to split the original object categories into finer ICD-classes. To give an example, samples of the ’aeroplane’ class in Fig. 4.2 are re-labeled into three different aeroplane ICD-classes with distinctive appearance. 79 When we train an FCN with these fine-scale labels, it will guide the network to differen- tiate these ICD-classes. Finer feature representations and fine-tuned matching templates for all ICD-classes can be derived through this process. In the third step, we add a second branch to the FCN in Step 2 and make the whole network a multitask FCN (MFCN). This new branch takes the output from the ICD-class classifier and passes it through two more layers. The output of the second branch is the ground truth class label in training and the desired output in testing. The first layer at the second branch is designed to transform fine-scale decisions from the output of the ICD- class classifier in Step 2 to the target object classes. Thus, it is called the Transform Classifier. The second layer is the deconvolutional layer. As a result, both fine and coarse labels are provided for the end-to-end training of the MFCN. The corresponding losses are backpropagated to the network. The network parameters of Step 2 and Step 3 are initialized by the trained models from their previous steps. Thus, the learned knowledge of images and objects are passed on step by step. The final segmentation result is obtained directly from the deconvolved output of the transform classifier. 4.3.3 Exploring Intra-Class-Difference (ICD) There are many ways to generate sub-class information. In this work, we use the base- line FCN model, collect the conv5 responses from all training data of the same object class, and split them into multiple ICD-classes in an unsupervised manner. The ground truth label provides the location of the object and the convolutional responses in that area represent the characteristics of the instance. Since the resolution of the collected responses can be quite large and each object has a different size, we normalize the con- volutional responses make it a 1D vector. 80 Figure 4.3: Illustration of the ICD-class feature extraction, normalization and clustering procedures to generate finer ICD-class labels. First, the response maps of all train- ing data are collected at the output of the conv5. Then, a normalization procedure is conducted with respect to the size of the label region. Finally, k-means clustering is performed on the normalized 1D feature vectors. Feature Collection. To begin with, all training images are passed through the baseline FCN as trained in Step 1. We gather all response maps at the output of the conv5 (res5c) layer as the features. They are selected as they contain the high-level object information. Since only the object class label is provided in semantic segmentation, all object instances of the same class in an image are treated as one sample. To treat each image as a basic sample unit is reasonable as most images have one object instance or several object instances of similar appearance. However, we may get a better clustering result if the instance information is provided. Feature Normalization. Next, we normalize theRRN responses at conv5 into anN 1 feature vectorV l for each class, whereR is the resolution of the feature map andN is the number of filters. In our current context, they are equal to 63 and 2048, respectively. For each dimension of V l (n), n2 [1;N], we sum up the convolutional 81 responses in the labeled region of the class and normalize it by its area. Mathematically, we have V l (n) = X L(i;j)=l F n (i;j)= X L(i;j)=l 1; (4.1) where L(i;j) denotes the ground truth label at position (i;j). The proposed normal- ization procedure is simple yet effective. the normalized response of each filter shows the matched object class, the visual characteristics of the particular instance such as its shape and pose, and the statistical distribution of various instances. The effectiveness of this feature representation and normalization will be elaborated in Sec. 4.4.3. Feature Clustering. With the normalized feature vector obtained above, the remain- ing task is to generate the ICD-class label using a clustering algorithm. As shown in Fig. 4.3, for each object class, we further split its samples by clustering their normalized fea- ture vectors. The created ICD-class lables are then used as the training target in Steps 2 and 3. To proceed, we need to decide the optimal ICD-class number and select a clustering algorithm. One way to evaluate the clustering result is to calculate the Hopkins statistics [44] for a cluster of data. This index examines whether samples in a dataset differ significantly from the assumption of being uniformly distributed in a multi-dimensional space [12, 59]. As a result, a larger Hopkins score indicates that the data cluster is more likely to be split further. Another way to assess the optimal cluster number is the Silhouette method [81, 68]. The silhouette of an instance is a measure of how closely it is matched to other data in the same cluster and how loosely it is matched to data of its neighbor clusters. However, a direct application of any of these two methods does not serve our purpose well. In the proposed ICD-FCN solution, we choose the unsupervised k-means algorithm as the clustering algorithm and estimate the optimal ICD-class number for each class by combining the Hopkins statistics and the Silhouette method. Its detailed procedure 82 is given in Algorithm A. In this algorithm, we set up two thresholds,LT andHT , for the Hopkins statistics to evaluate the cluster tendency for further splitting. If the data distribution in the current cluster is close to the uniform one (H < LT ), the splitting stops. On the other hand, if the current cluster has a strong tendency for further split (H >HT ), the splitting process continues until a pre-defined maximum cluster number (MAXN) is reached. The Silhouette method is then used to see whether it is an optimal clustering solution by considering the whole data space. Algorithm A Determine the optimal cluster number for each object class for the k-meansclusteringalgorithm. whilen<MAXN do V icd (l) =kmeans(V all ;n);l2 [1;n] if max(Hopkins(V icd (l)))>HT :n!n + 1 if min(Hopkins(V icd (l)))<LT :break end while return Silhouette(n, MAXN) Since the cluster tendency is greatly influenced by the data size, only a certain num- ber of randomly selected training data is used in Algorithm A. This ensures the compu- tational efficiency of the Silhouette estimation. The optimal ICD-class number is thus derived for each original object category, and we use this information to further cluster all the training data. The importance of this clustering strategy will be further demon- strated in Sec. 4.4.3. 83 Implementation Details. We implemented the proposed ICD-FCN system using the Caffe [52] library. All trainings were conducted with Titan-X, where the ResNet FCN baseline (Res-FCN) with dilated conv4 and conv5 were trained. Being similar to the FCN training setup, we set the basic learning rate to 10 10 with the batch size equal to 1. The momentum was chosen to be 0.99 while the weight decay was kept at 0.005. We adopted the standard softmax loss function [57] at each training step and added the losses of two branches in the last stage. To preserve the ICD information learned from Step 2, an initialization scheme was designed for the transform classifier in Step 3; namely, !(k;m; :; :) = 8 < : a; ifm2C icd (k); 0; otherwise; (4.2) wherea is a non-zero constant andC icd (k) represents a ICD-classes from original object category k. With this initialization, the MFCN is able to connect sub-class decisions from Step 2 to its original class label. This results in faster convergence and better final performance. We set the training elapse to 80 epochs for the baseline FCN training in Step 1 and, then, observed that the following two steps would need one more elapse to reach the optimal solution. 4.4 Analyze Fine-to-Coarse Learning In this section, we evaluate the proposed FCL procedure by comparing it with the stan- dard FCN training. Examples of clustering results are also presented to demonstrate the advantage of finding the intra-class difference and generating ICD-class labels. 84 Image FCN-8s Res-FCN ICD-FCN Ground Truth Figure 4.4: Qualitative results on PASCAL VOC 2012 validation set. We present the results from FCN-8s [69], the Res-FCN baseline from step-1 and the proposed ICD- FCN in columns 2-4, with the corresponding ground truth in column 5. 4.4.1 Experimental Setup We conducted several experiments to evaluate the performance of the standard FCN training and the proposed FCL training. All experiments in this section were conducted against the PASCAL VOC 2012 dataset. We used 2,567 images for training and 346 images for validation following the data dividing strategy described in [109]. We first 85 compare results with different baseline models and examine the clustering effect. Then, we compare the proposed FCL procedure with simultaneous learning of two object labels (i.e. both fine and coarse labels). To provide a qualitative evaluation of the performance of the proposed ICD-FCN method, we show its semantic segmentation results of several exemplary images in Fig. 4.4. 4.4.2 ResNet-101 versus Res-FCN We demonstrate the importance of a baseline network (Res-FCN). In the first step of our design, a baseline model is trained on top of the publicly available ResNet model. This ResNet-101 model by itself has an excellent capability to discern objects and recognize discriminative features thanks to a large amount of training images in the ImageNet. However, it is still not powerful enough for the target PASCAL VOC 2012 dataset in the semantic segmentation task. As a matter of fact, there would be a domain adaptation problem [8]. Since we depend on feature responses to acquire the sub-label information, the raw network is less effective and representative than the trained network. An example of clustering results for the ’aeroplane’ class is given in Figs. 4.5 (a) and (b), where we show the images closest to each cluster center, which are determined by the ResNet-101 and pre-trained Res-FCN models, respectively. As indicated in these two columns, the pre-trained Res-FCN using the target dataset has a better knowledge of an object instance. The intra-class diversity can be easily understood from the viewpoint of ICD-class clustering. We present two other clustering center results corresponding to the Res-FCN in Fig. 4.5 (c) and (d). It can be seen that ICD-classes in the ’aeroplane’ class is due to different viewing angles, those in the ’boat’ class is caused by the variety of object shapes, while those in the ’bicycle’ class is attributed to different partial views, due to the frequent occlusion in bicycle images. 86 (a) (b) (c) (d) Figure 4.5: Sample images close to the centroid of each ICD-class for (a) the ’aeroplane’ class with ResNet-101, (b) the ’aeroplane’ class with the pre-trained Res-FCN, (c) the ’boat’ class with the pre-trained Res-FCN, and (d) the ’bicyle’ class with the pre-trained Res-FCN. Table 4.1: Experiments on different learning procedures and baseline models with cor- responding mean IU performance onreduced PASCAL 2012 validation set. # Learning Procedure Base Model mean IU 1 Res-FCN Baseline ResNet-101 66.1 2 Fine-and-Coarse ResNet-101 67.1 3 Res-FCN Baseline 68.8 4 Fine-to-Coarse ResNet-101 67.5 5 Res-FCN Baseline 69.4 In addition, as the original ResNet does not have the dilation set-up, it is challeng- ing to train it from an ImageNet classification network into fully convolutional semantic segmentation network. To evaluate and resolve this issue, we conducted several train- ing strategies with different baseline models. We summarize the experimental setting and test results in terms of the mean IU in Table 4.6, where “Fine-and-Coarse” means that we train the two branches of the MFCN simultaneously. Furthermore, we plot the corresponding validation performance in Fig. 4.6. We see that the training without the baseline model in the first step (see Case #2 and Case #4) is more difficult to converge to an optimal solution. 87 Figure 4.6: Plots of the validation performance in terms of the mean IU with different training procedures and baseline models as stated in Table . Table 4.2: Experiment with different sets of [LT, HT] and corresponding mean IU per- formance with different sub-class numbers. # [LT, HT] Sub-class Number mean IU 6 [0.8, 0.9] 75 69.4 7 [0.75, 0.85] 44 68.5 8 [0.9, 0.95] 107 68.9 4.4.3 Effectiveness of Clustering We evaluate the effectiveness of the proposed clustering scheme as described in Sec. 4.3.3 by conducting several experiments with different threshold values (LT , HT ) in Algorithm A. We provide several clustering setups in Table 4.2, where the corresponding training loss is plotted in Fig. 4.7(a). The results show that a smaller splitting number would be relatively easier to train, yet it may not be powerful enough to explore the intra-class difference (see Case #6). On the other hand, too many sub-classes would influence the clustering performance and influence the learning of the network (see Case #8). To demonstrate the importance of setting up a Hopkins-score based thresholds, we plotted the minimum and maximum Hopkins score value with respect to the different 88 (a) (b) Figure 4.7: Plots of (a) training losses corresponding to different cluster numbers in Step 2 and (b) the cluster estimation using the Hopkins score and the Silhouette method for the ’aeroplane’ class, where the Hopkins score provides a reference class number at 3 while the Silhouette estimation method offers the optimal cluster number at 4. cluster numbers for class ’aeroplane’ in Fig. 4.7(b). We then use the corresponding cluster number to conduct Silhouette estimation and draw it in Fig. 4.7(b) accordingly. In addition, we label the corresponding class mean IU near the marks to demonstrate the influence of cluster number on the final segmentation performance. Results show that the Hopkins Score itself can go out of control if the clustering becomes unstable, and the Silhouette Method is not aggressive enough to derive optimal clustering. Therefore, the designed thresholds are able to push the clustering to be more effective while keeping the sub-class stable and easier to learn, which provided relatively effective range of cluster number for better segmentation result. 4.4.4 Benefits of Fine-to-Coarse Here, we show that it is better to first train with the fine-scale (ICD-class) label and then combine its decision with the coarse-scale (class) label. As shown in Fig. 4.6, the train- ing in Step 3 takes around 30 epochs to converge (Case #5), and its final performance is better than the simultaneousness learning (Case #3). On the other hand, as shown in 89 Figure 4.8: The filter responses of the ’aeroplane’ class in baseline Res-FCN and the proposed ICD-FCN. The ICD-FCN enables learning of distinctive ’aeroplane’ features, which contribute to more accurate segmentation result. Fig. 4.7(a), the training in Step 2 converges quickly after 20 epochs. As a result, we use the model at 40-epoch to initialize and conduct the training in Step 3. To demonstrate the benefits of combining the fine and coarse information, we plot the response maps of the classification layer in Fig. 4.8 in Case #5. The results show that the FCN baseline only provides a relative high response for the ’airplane’ object while the proposed ICD-FCN can distribute the learning into four ICD-classes, which are combined into a more complete segmentation result later. Table 4.3: Additional experiments to explore the learning in Step 3. # Experiment Operation mean IU 9 Weighted Loss Doubled ICD loss 68.5 10 Doubled 20-class loss 68.1 11 Remove ICD-loss Fix conv layers 66.9 12 Enable back-propagation 67.3 13 Transform Classifier Intialization Without learning 65.5 14 Random initialize 67.7 4.4.5 Other Experiments We conducted more experiments to shed light on the ICD learning in Step 3. Their settings are summarized in Table 4.3. We first assigned weights to the losses of the 90 Table 4.4: Mean IU (%) benchmark performance with detailed per-class score in PAS- CAL VOC test 2011. method train set mean IU back aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv Res-FCN 1 PASCAL 73.7 93.5 88.7 40.3 83.9 61.3 74.7 90.0 81.9 88.3 35.0 81.6 50.7 82.4 83.6 83.0 80.8 63.4 86.5 47.3 84.1 66.3 ICD-FCN 2 77.1 94.4 89.3 49.4 85.0 65.1 75.3 91.0 84.0 90.0 36.5 90.0 63.7 86.3 89.8 85.9 82.3 64.7 87.6 55.5 84.5 68.0 Res-FCN 3 PASCAL COCO 75.3 93.8 88.9 41.9 83.8 63.8 69.5 90.5 82.6 87.6 39.8 90.1 60.8 84.3 85.9 85.2 84.0 58.2 85.2 51.6 83.5 70.7 ICD-FCN 4 78.0 94.7 89.4 46.3 85.3 67.4 77.3 91.9 84.9 90.4 37.3 93.0 66.8 86.9 91.9 85.7 84.0 63.9 88.4 57.1 85.8 70.1 Table 4.5: Mean IU (%) benchmark performance with detailed per-class score in PAS- CAL VOC test 2012. method train set mean IU back aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv Res-FCN 1 PASCAL 73.8 93.5 88.5 40.9 85.1 63.7 74.0 88.3 81.8 88.8 32.6 79.2 52.2 82.3 83.7 83.4 81.6 60.9 86.8 52.8 82.2 67.6 ICD-FCN 2 76.8 94.3 89.1 51.5 85.8 66.3 75.4 88.8 84.0 90.3 34.4 86.0 64.0 85.4 86.4 85.8 82.6 62.8 88.5 59.6 82.3 68.7 Res-FCN 3 PASCAL COCO 74.3 93.7 85.8 41.0 84.9 62.4 69.0 86.2 81.5 86.2 37.2 86.5 62.0 83.2 86.4 84.2 83.9 56.0 85.7 56.1 77.7 71.2 ICD-FCN 4 77.8 94.6 89.5 47.4 85.9 68.9 76.8 89.3 84.8 91.0 34.8 90.0 67.9 86.0 89.1 85.5 84.6 62.1 89.9 60.8 83.4 71.3 two branches of the MFCN but could not achieve better results. Then, we evaluate the performance by removing the ICD-loss branch. Specifically, we only trained the transform classifier (#11) in Step 3. It works as a post-processing procedure as late- fusion in [28, 27], and some improvement is observed. Furthermore, we enable the back-propagation for the whole network and observe 0:4% further improvement. These two setups demonstrate that the network has a tendency to overwrite the pre-trained ICD information if its corresponding loss is ignored. Finally, we evaluate the initialization in Step 3. The initialized classifier in Step 3 without training has certain performance degradation. This indicates that the result in Step 2 cannot be directly mapped to the 20- class decision. We also examined the performance of a randomly initialized classifier in Step 2, and only see 2% improvement as compared to the baseline FCN. 4.5 Experimental Result In this section, we present benchmarking results across several segmentation datasets. To begin with, we augmented training images from the PASCAL VOC dataset and the MS COCO dataset [67], and evaluated the performance on the VOC 2011-12 test sets [31]. Then, we applied the proposed ICD-FCN and the FCL procedure to two other segmentation datasets: the NYUDepthV2 dataset [73] and the PASCAL Context dataset 91 [71]. PASCAL VOC dataset. Following [69] and [109], we considered an augmented PAS- CAL dataset [41] with a total of 11,685 images. We evaluate the performance on the same reduced validation set, and observe improvements in both the Res-FCN baseline and proposed ICD-FCN method. Then, we select the best models based on the valida- tion performance, and apply them to the PASCAL VOC 2011-12 test by submitting the results to the online evaluation server. For both validation and testing, about 3% mean IU improvement over the baseline Res-FCN is observed. Table 4.6: Mean IU (%) performance on the reduced VOC 12 validation set and the VOC 2011-12 test set with augmented PASCAL training data and selected COCO images. train data VOC 2012 validation VOC 2011 test VOC 2012 test FCN-8s PASCAL 61.3 62.7 62.2 CRF-RNN 69.6 72.4 72.0 Res-FCN 70.9 73.7 73.8 ICD-FCN 74.0 77.1 76.8 Res-FCN PASCAL COCO 72.0 75.3 74.3 ICD-FCN 74.9 78.0 77.8 To further demonstrate the robustness of the proposed ICD-FCN for large datasets, we added part of the training data in the MS COCO dataset. Due to our limited computational resources, we randomly selected the same amount of training samples (11,685) from the COCO dataset and combined them with the PASCAL data for train- ing. Consistent performance improvement was observed and the proposed ICD-FCN reached a performance of 78:0% and 77:8% for the PASCAL 2011 and 2012 tests, respectively. The mean IU results are summarized in Table 4.6. and the corresponding performance for each class is presented in Table 4.4 and Table 4.5, where significant improvements are observed for most classes. 92 NYUDv2 dataset. The NYUDv2 dataset [73] is an indoor scene dataset with 795 train- ing images and 654 testing images, which has a coalesced labels of 40 classes based on [38]. We show the mean IU results in Table 4.7. Since the original FCN in [69] only provides the result for FCN-32s, we tested with two different dilation setting and compare the performance. For the first experiment, we enable only conv5 with dilation size 2 (dil 12). For the second one, we choose a dilation step of 2 and 4 for conv4 and conv5 (dil 24), respectively. The ICD-FCN with the FCL procedure offers 2% improve- ment over the baseline FCN. This shows that the proposed ICD-FCN method can also be applied to a small-size dataset with a large class number. Table 4.7: Mean IU (%) performance on NYUDv2 validation set with RGB training images, compared with FCN-32s [69] and baseline Res-FCN. FCN-32s Res-FCN dil 12 ICD-FCN dil 12 Res-FCN dil 24 ICD-FCN dil 24 mean IU 29.2 29.5 31.9 30.3 32.7 PASCAL Context dataset. We conducted the experiment on the PASCAL Context dataset [71], which has 4,995 training images, 5,105 testing images and 59-class seg- mentation masks. We observe similar performance gain as compared with the baseline Res-FCN. As a result, the proposed ICD-FCN achieves 42:5% compared with the pre- vious state-of-the-art. Table 4.8: Mean IU (%) performance on PASCAL Context dataset, compared with FCN- 8s [69], CRF-RNN [109] and baseline Res-FCN. FCN-8s CRF-RNN Res-FCN ICD-FCN mean IU 37.78 39.28 40.2 42.46 93 4.6 Conclusion and Future Work An intra-class-difference guided fully convolutional neural network (ICD-FCN) and a fine-to-coarse learning (FCL) procedure were proposed in this work. By defining ICD- class labels, we can partition training images into several sub-categories of similar fea- tures. The proposed FCL procedure can integrate the knowledge of both fine and coarse labels. Our proposed solution offers around 3% mean IU improvement over the ResNet- based FCN baseline. Consistent performance improvement has been observed across several segmentation datasets of a small or large number of training data. It appears that the learning procedure can be generalized to other computer vision tasks as well, which will be further examined as a future work item. 94 Chapter 5 Semantic Segmentation with Reverse Attention 5.1 Introduction Semantic segmentation is an important task for image understanding and object local- ization. With the development of fully-convolutional neural network (FCN) [69], there has been a significant advancement in the field using end-to-end trainable networks. The progress in deep convolutional neural networks (CNNs) such as the VGGNet [89], Inception Net [95], and Residual Net [42] pushes the semantic segmentation perfor- mance even higher via comprehensive learning of high-level semantic features. Besides deeper networks, other ideas have been proposed to enhance the semantic segmentation performance. For example, low-level features can be explored along with the high-level semantic features [11] for performance improvement. Holistic image understanding can also be used to boost the performance [65, 108, 46]. Furthermore, one can guide the network learning by generating highlighted targets [25, 19, 21, 87, 102, 101]. Gener- ally speaking, a CNN can learn the semantic segmentation task more effectively under specific guidance. In spite of these developments, all existing methods focus on the understanding of the features and prediction of the target class. However, there is no mechanism to specif- ically teach the network to learn the difference between classes. The high-level seman- tic features are sometimes shared across different classes (or between an object and its 95 Figure 5.1: An illustration of the proposed reversed attention network (RAN), where the lower and upper branches learn features and predictions that are and are not associated with a target class, respectively. The mid-branch focuses on local regions with compli- cated spatial patterns whose object responses are weaker and provide a mechanism to amplify the response. The predictions of all three branches are fused to yield the final prediction for the segmentation task. background) due to a certain level of visual similarity among classes in the training set. This will yield a confusing results in regions that are located in the boundary of two objects (or object/background) since the responses to both objects (or an object and its background) are equally strong. Another problem is caused by the weaker responses of the target object due to a complicated mixture of objects/background. It is desirable to develop a mechanism to identify these regions and amplify the weaker responses to capture the target object. We are not aware of any effective solution to address these two problems up to now. In this work, we propose a new semantic segmentation architecture called the reverse attention network (RAN) to achieve these two goals. A conceptual overview of the RAN system is shown in Fig. 5.1. The RAN uses two separate branches to learn features and generate predictions that are and are not associated with a target class, respectively. To further highlight the knowledge learnt from reverse-class, we design a reverse attention structure, which generates per-class mask to amplify the reverse-class response in the confused region. 96 The predictions of all three branches are finally fused together to yield the final predic- tion for the segmentation task. We build the RAN upon the state-of-the-art Deeplabv2- LargeFOV with the ResNet-101 structure and conduct comprehensive experiments on many datasets, including PASCAL VOC, PASCAL Person Part, PASCAL Context, NYU-Depth2, and ADE20K MIT datasets. Consistent and significant improvements across the datasets are observed. We implement the proposed RAN in Caffe [52], and the trained network structure with models are available to the public. 5.2 Related Work A brief review on recent progresses in semantic segmentation is given in this section. Semantic segmentation is a combination of the pixel-wise localization task [107, 83] and the high-level recognition task. Recent developments in deep CNNs [58, 89, 95] enable comprehensive learning of semantic features using a large amount of image data [31, 67, 24]. The FCN [69] allows effective end-to-end learning by converting fully- connected layers into convolutional layers. Performance improvements have been achieved by introducing several new ideas. One is to integrate low- and high-level convolutional features in the network. This is motivated by the observation that the pooling and the stride operations can offer a larger filed of view (FOV) and extract semantic features with fewer convolutional layers, yet it decreases the resolution of the response maps and thus suffers from inaccurate local- ization. The combination of segmentation results from multiple layers was proposed in [69, 88]. Fusion of multi-level features before decision gives an even better perfor- mance as shown in [16, 65]. Another idea, as presented in [15], is to adopt a dilation architecture to increase the resolution of response maps while preserving large FOVs. In addition, both local- and long-range conditional random fields can be used to refine 97 segmentation details as done in [109, 13]. Recent advances in the RefineNet [65] and the PSPNet [108] show that a holistic understanding of the whole image [46] can boost the segmentation performance furthermore. Another class of methods focuses on guiding the learning procedure with highlighted knowledge. For example, a hard-case learning was adopted in [87] to guide a network to focus on less confident cases. Besides, the spatial information can be explored to enhance features by considering coherence with neighboring patterns [25, 19, 21]. Some other information such as the object boundary can also be explored to guide the segmen- tation with more accurate object shape prediction [13, 49]. All the above-mentioned methods strive to improve features and decision classifiers for better segmentation performance. They attempt to capture generative object match- ing templates across training data. However, their classifiers simply look for the most likely patterns with the guidance of the cross-entropy loss in the softmax-based output layer. This methodology overlooks characteristics of less common instances, and could be confused by similar patterns of different classes. In this work, we would like to address this shortcoming by letting the network learn what does not belong to the target class as well as better co-existing background/object separation. 5.3 Proposed Reverse Attention Network (RAN) 5.3.1 Motivation Our work is motivated by observations on FCN’s learning as given in Fig. 5.2, where an image is fed into an FCN network. Convolutional layers of an FCN are usually repre- sented as two parts, the convolutional features network (usually conv1-conv5), and the 98 Figure 5.2: Observations on FCN’s direct learning. The normalized feature response of the last conv5 layer is presented along with the class-wise probability map for ’dog’ and ’cat’. class-oriented convolutional layer (CONV) which relates the semantic features to pixel- wise classification results. Without loss of generality, we use an image that contains a dog and a cat as illustrated in Fig. 5.2 as an example in our discussion. The segmentation result is shown in the lower-right corner of Fig. 5.2, where dog’s lower body in the circled area is misclassified as part of a cat. To explain the phe- nomenon, we show the heat maps (i.e. the corresponding filter responses) for the dog and the cat classes, respectively. It turns out that both classifiers generate high responses in the circled area. Classification errors can arise easily in these confusing areas where two or more classes share similar spatial patterns. To offer additional insights, we plot the normalized filter responses in the last CONV layer for both classes in Fig. 5.2, where the normalized response is defined as the sum of all responses of the same filter per unit area. For ease of visualization, we only show the filters that have normalized responses higher than a threshold. The decision on a target class is primarily contributed by the high response of a small number of filters while a large number of filters are barely evoked in the decision. For examples, there are about 20 filters (out of a total of 2048 filters) that have high responses to the dog or the cat 99 classes. We can further divide them into three groups - with a high response to both the dog and cat classes (in red), with a high response to the dog class only (in purple) or the cat class (in dark brown) only. On one hand, these filters, known as the Grand Mother Cell (GMC) filter [37, 3], capture the most important semantic patterns of target objects (e.g., the cat face). On the other hand, some filters have strong responses to multiple object classes so that they are less useful in discriminating the underlying object classes. Apparently, the FCN is only trained by each class label yet without being trained to learn the difference between confusing classes. If we can let a network learn that the confusing area is not part of a cat explicitly, it is possible to obtain a network of higher performance. As a result, this strategy, called the reverse attention learning, may contribute to better discrimination of confusing classes and better understanding of co- existing background context in the image. 5.3.2 Proposed RAN System To improve the performance of the FCN, we propose a Reverse Attention Network (RAN) whose system diagram is depicted in Fig. 5.3. After getting the feature map, the RAN consists of three branches: the original branch (the lower path), the attention branch (the middle path) and the reverse branch (the upper path). The reverse branch and the attention branch merge to form the reverse attention response. Finally, decisions from the reverse attention response is subtracted from the the prediction of original branch to derive the final decision scores in semantic segmentation. The FCN system diagram shown in Fig. 5.2 corresponds to the lower branch in Fig. 5.3 with the “original branch” label. As described earlier, its CONV layers before the feature map are used to learn object features and itsCONV org layers are used to help decision classifiers to generate the class-wise probability map. Here, we useCONV org layers to denote that obtained from the original FCN through a straightforward direct 100 Figure 5.3: The system diagram of the reverse attention network (RAN), where CONV org and CONV rev filters are used to learn features associated and not associ- ated with a particular class, respectively. The reverse object class knowledge is then highlighted by an attention mask to generate the reverse attention of a class, which will then be subtracted from the original prediction score as a correction. learning process. For the RAN system, we introduce two more branches - the reverse branch and the attention branch. The need of these two branches will be explained below. Reverse Branch. The upper one in Fig. 5.3 is the Reverse Branch. We train another CONV rev layer to learn the reverse object class explicitly, where the reverse object class is the reversed ground truth for the object class of concern. In order to obtain the reversed ground truth, we can set the corresponding class region to zero and that of the remaining region to one, as illustrated in Fig. 5.1. The remaining region includes background as well as other classes. However, this would result in specific reverse label for each object class. There is an alternative way to implement the same idea. That is, we reverse the sign of all class-wise response values before feeding them into the softmax-based classifiers. This operation is indicated by the NEG block in the Reverse Branch. Such an implemen- tation allows the CONV rev layer to be trained using the same and original class-wise ground-truth label. 101 Reverse Attention Branch. One simple way to combine results of the original and the reverse branch is to directly subtract the reverse prediction from the original predic- tion (in terms of object class probabilities). We can interpret this operation as finding the difference between the predicted decision of the original FCN and the predicted decision due to reverse learning. For example, the lower part of the dog gives strong responses to both the dog and the cat in the original FCN. However, the same region will give a strong negative response to the cat class but almost zero response to the dog class in the reverse learning branch. Then, the combination of these two branches will reduce the response to the cat class while preserving the response to the dog class. However, directly applying element-wise subtraction does not necessarily result in better performances. Sometimes the reverse prediction may not do as well as the original prediction in the confident area. Therefore we propose a reverse attention structure to further highlight the regions which are originally overlooked in the original prediction, including confusion and background areas. The output of reverse attention structure generates a class-oriented mask to amplify the reverse response map. As shown in Fig. 5.3, the input to the reverse attention branch is the prediction result ofCONV org . We flip the sign of the pixel value by the NEG block, feed the result to the sigmoid function and, finally, filter the sigmoid output with an attention mask. The sigmoid function is used to convert the response attention map to the range of [0,1]. Mathematically, the pixel value in the reverse attention mapI ra can be written as I ra (i;j) = Sigmoid(F CONVorg (i;j)); (5.1) where (i;j) denotes the pixel location, and F CONVorg denotes the response map of CONV org , respectively. Note that the region with small or negative responsesF CONVorg 102 will be highlighted due to the cascade of the NEG and the sigmoid operations. In con- trast, areas of positive response (or confident scores) will be suppressed in the reverse attention branch. After getting the reverse attention map, we combine it with theCONV rev response map using the element-wise multiplication as shown in Fig. 5.3. The multiplied response score is then subtracted from the original prediction, contributing to our final combined prediction. Several variants of the RAN architecture have been experimented. The following normalization strategy offers a faster convergence rate while providing similar segmen- tation performance: I ra (i;j) = Sigmoid( 1 Relu(F CONVorg (i;j)) + 0:125 4); (5.2) whereF CONVorg is normalized to be within [4; 4], which results in a more uniformed distribution before being fed into the sigmoid function. Also, we clip all negative scores of F CONVorg to zero by applying the Relu operation and control inverse scores to be within the range of [-4, 4] using parameters 0:125 and4. In the experiment sec- tion, we will compare results of the reverse attention set-ups given in Equations (5.1) and (5.2). They are denoted by RAN-simple (RAN-s) and RAN-normalized (RAN-n), respectively. RAN Training. In order to train the proposed RAN, we back-propagate the cross- entropy losses at the three branches simultaneously and adopt the softmax classifiers at the three prediction outputs. All three losses are needed to ensure a balanced end-to- end learning process. The original prediction loss and the reverse prediction loss allow CONV org andCONV rev to learn the target classes and their reverse classes in parallel. Furthermore, the loss of the combined prediction allows the network to learn the reverse 103 Methods feature pixel acc. mean acc. mean IU. FCN-8s [69] VGG16 65.9 46.5 35.1 BoxSup [20] - - 40.5 Context [66] 71.5 53.9 43.3 VeryDeep [101] ResNet-101 72.9 54.8 44.5 DeepLabv2-ASPP [15] - - 45.7 RefineNet-101 [65] - - 47.1 Holistic [46] ResNet-152 73.5 56.6 45.8 RefineNet-152 [65] - - 47.3 Model A2, 2conv Wider ResNet 75.0 58.1 48.1 DeepLabv2-LFOV (baseline) [15] ResNet-101 - - 43.5 RAN-s (ours) 75.3 57.1 48.0 RAN-n (ours) 75.3 57.2 48.1 Table 5.1: Comparison of semantic image segmentation performance scores (%) on the 5,105 test images of the PASCAL Context dataset. Methods Dil=0 LargeFOV +Aug +MSC +CRF DeepLabv2 (baseline) [15] 41.6 42.6 43.2 43.5 44.4 Dual-Branch RAN 42.8 43.9 44.4 45.2 46.0 RAN-1 44.4 45.6 46.2 47.2 48.0 RAN-2 44.5 45.6 46.3 47.3 48.1 Table 5.2: Ablation study of different RANs on the PASCAL-Context dataset to evaluate the benefit of proposed RAN. We compare the results under different network set-up with employing dilated decision conv filters, data augmentation, the MSC design and the CRF post-processing. attention. The proposed RAN can be effectively trained based on the pre-trained FCN, which indicates that the RAN is a further improvement of the FCN by adding more relevant guidance in the training process. 5.4 Experiments To show the effectiveness of the proposed RAN, we conduct experiments on five datasets. They are the PASCAL Context [72], PASCAL Person-Part [17], PASCAL VOC [31], NYU-Depth-v2 [73] and MIT ADE20K [110]. We implemented the RAN using the Caffe [52] library and built it upon the available DeepLab-v2 repository [15]. 104 We adopted the initial network weights provided by the repository, which were pre- trained on the COCO dataset with the ResNet-101. All the proposed reverse atten- tion architecture are implemented with the standard Caffe Layers, where we utilize the PowerLayer to flip, shift and scale the response, and use the providedSigmoid Layer to conduct sigmoid transformation. We employ the ”poly” learning rate policy withpower = 0:9, and basic learning rate equals 0:00025. Momentum and weight decay are set to 0.9 and 0.0001 respectively. We adopted the DeepLab data augmentation scheme with random scaling factor of 0.5, 0.75, 1.0, 1.25, 1.5 and with mirroring for each training image. Following [15] we adopt the mutli-scale (MSC) input with max fusion in both training and testing. Although we did not apply the atrous spatial pyramid pooling (ASPP) due to limited GPU memory, we do observe significant improvement in the mean intersection-over-union (mean IU) score over the baseline DeepLab-v2 LargeFOV and the ASPP set-up. PASCAL-Context. We first present results conducted on the challenging PASCAL- Context dataset [72]. The dataset has 4,995 training images and 5,105 test images. There are 59 labeled categories including foreground objects and background context scenes. We compare the proposed RAN method with a group of state-of-the-art methods in Table. 5.1, where RAN-s and RAN-n use equations (5.1) and (5.2) in the reverse atten- tion branch, respectively. The mean IU values of RAN-s and RAN-n have a significant improvement over that of their baseline Deeplabv2-LargeFOV . Our RAN-s and RAN- n achieve the state-of-the-art mean IU scores (i.e., around 48.1%) that are comparable with those of the RefineNet [65] and the Wider ResNet [103]. We compare the performance of dual-branch RAN (without reverse attention), RAN- s, RAN-n and their baseline DeepLabv2 by conducting a set of ablation study in Table 5.2, where a sequence of techniques is employed step by step. They include dilated clas- sification, data augmentation, MSC with max fusion and the fully connected conditional 105 Image Baseline Ours Ground Truth Figure 5.4: Qualitative results in the PASCAL-Context validation set with: the input image, the DeepLabv2-LargeFOV baseline, our RAN-s result, and the ground truth. random field (CRF). We see that the performance of RANs keeps improving and they always outperform their baseline under all situations. The quantitative results are pro- vided in Fig. 5.4. It shows that the proposed reverse learning can correct some mistakes in the confusion area, and results in more uniformed prediction for the target object. PASCAL Person-Part. We also conducted experiments on the PASCAL Person- Part dataset [17]. It includes labels of six body parts of persons (i.e., Head, Torso, Upper/Lower Arms and Upper/Lower Legs). There are 1,716 training images and 1,817 validation images. As observed in [15], the dilated decision classifier provides little per- formance improvement. Thus, we also adopted the MSC structure with 3-by-3 decision filters without dialtion for RANs. The mean IU results of several benchmarking methods are shown in Table 5.3.The results demonstrate that both RAN-s and RAN-n outperform the baseline DeepLabv2 and achieves state-of-the-art performance in this fine-grained dataset. Attention [16] HAZN [104] Graph LSTM [64] RefineNet [65] DeepLabv2 [15] RAN-s RAN-n mean IU 56.4 57.5 60.2 68.6 64.9 66.6 66.5 Table 5.3: Comparison of the Mean IU scores (%) of several benchmarking methods for the PASCAL PERSON-Part dataset. 106 PASCAL VOC2012. Furthermore, we conducted experiments on the popular PAS- CAL VOC2012 test set [31]. We adopted the augmented ground truth from [41] with a total of 12,051 training images and submitted our segmentation results to the evaluation website. We find that for the VOC dataset, our DeepLab based network does not have significant improvement as the specifically designed networks such as [65, 108]. How- ever we still observer about 1:4% improvement over the baseline DeepLabv2-LargeFOV , which also outperforms the DeepLabv2-ASPP. Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person potted sheep sofa train tv mean FCN-8s [69] 76.8 34.2 68.9 49.4 60.3 75.3 74.7 77.6 21.4 62.5 46.8 71.8 63.9 76.5 73.9 45.2 72.4 37.4 70.9 55.1 62.2 Context [66] 94.1 40.7 84.1 67.8 75.9 93.4 84.3 88.4 42.5 86.4 64.7 85.4 89.0 85.8 86.0 67.5 90.2 63.8 80.9 73.0 78.0 VeryDeep [101] 91.9 48.1 93.4 69.3 75.5 94.2 87.5 92.8 36.7 86.9 65.2 89.1 90.2 86.5 87.2 64.6 90.1 59.7 85.5 72.7 79.1 DeepLabv2-LFOV [15] 93.0 41.6 91.0 65.3 74.5 94.2 88.8 91.7 37.2 87.9 64.6 89.7 91.8 86.7 85.8 62.6 88.6 60.1 86.6 75.4 79.1 DeepLabv2-ASPP [15] 92.6 60.4 91.6 63.4 76.3 95.0 88.4 92.6 32.7 88.5 67.6 89.6 92.1 87.0 87.4 63.3 88.3 60.0 86.8 74.5 79.7 RAN-s 1 92.7 44.7 91.9 68.2 79.3 95.4 91.2 93.3 42.8 87.8 66.9 89.1 93.2 89.5 88.4 61.6 89.8 62.6 87.8 77.8 80.5 RAN-n 2 92.5 44.6 92.1 68.8 79.1 95.5 91.0 93.1 43.1 88.3 66.6 88.9 93.4 89.3 88.3 61.2 89.7 62.5 87.7 77.6 80.4 Table 5.4: Comparison of the Mean IU scores (%) per object class of several methods for the PASCAL VOC2012 test dataset. NYUDv2. The NYUDv2 dataset [73] is an indoor scene dataset with 795 training images and 654 test images. It has coalesced labels of 40 classes provided by [38]. The mean IU results of several benchmarking methods are shown in Table 5.5. We see that RAN-s and RAN-n improve their baseline DeepLabv2-LargeFOV by a large margin (around 3%). Visual comparison of segmentation results of two images are shown in Fig. 5.5. Gupta et al. [39] FCN-32s [69] Context [66] Holistic [46] RefineNet [65] DeepLabv2-ASPP [15] DeepLabv2-LFOV [15] RAN-s RAN-n feature VGG16 ResNet-152 ResNet-101 mean IU 28.6 29.2 40.6 38.8 46.5 37.8 37.3 41.2 40.7 Table 5.5: Comparison of the Mean IU scores (%) of several benchmarking methods on the NYU-Depth2 dataset. MIT ADE20K. The MIT ADE20K dataset [110] was released recently. The dataset has 150 labeled classes for both objects and background scene parsing. There are about 20K and 2K images in the training and validation sets, respectively. Although our base- line DeepLabv2 does not perform well in global scene parsing as in [46, 108], we still observe about 2% improvement in the mean IU score as shown in Table 5.6. 107 Image Baseline Ours Ground Truth Figure 5.5: Qualitative results in the NYU-DepthV2 validation set with: the input image, the DeepLabv2-LargeFOV baseline, our RAN-s result, and the ground truth. FCN-8s [110] DilatedNet [110] DilatedNet Cascade [110] Holistic [46] PSPNet [108] DeepLabv2-ASPP [15] DeepLabv2-LFOV [15] RAN-s RAN-n feature VGG16 ResNet-101 ResNet-152 ResNet-101 mean IU 29.39 32.31 34.9 37.93 43.51 34.0 33.1 35.2 35.3 Table 5.6: Comparison of the Mean IU scores (%) of several benchmarking methods on the ADE20K dataset. 5.5 Conclusion A new network, called the RAN, designed for reverse learning was proposed in this work. The network explicitly learns what are and are not associated with a target class in its direct and reverse branches, respectively. To further enhance the reverse learning effect, the sigmoid activation function and an attention mask were introduced to build the reverse attention branch as the third one. The three branches were integrated in the RAN to generate final results. The RAN provides significant performance improve- ment over its baseline network and achieves the state-of-the-art semantic segmentation performance in several benchmark datasets. 108 Chapter 6 Conclusion and Future Work 6.1 Summary of the Research In this dissertation, we look into two important topics: perceptual quality enhancement and semantic image segmenation. In the first part, a false contour detection and removal (FCDR) method was proposed in this work. Our investigation began with a clear under- standing of the cause of the false contours. Then, based on the understanding, we pre- sented a FCDR system that includes both a false contour detection module and a false contour removal module. It was shown by experimental results that the FCDR method can detect and suppress contouring artifacts resulted from HEVC and H.264 alike effec- tively. At the same time, it can preserve edges and textures in the image well. In the second part, a novel dynamic frame rate selection algorithm was proposed for the coding of high frame rate video in this work. We adopted a statistical approach to measure and predict the percentage of satisfied users between two frame rates (say, 30 fps versus 60 fps). The satisfied user ratio (SUR) is highly dependent upon the under- lying video content. We developed several influence maps to characterize the video property, converted them into feature vectors, and used the support vector regression (SVR) algorithm to predict the number of coded frames per GoP that meet the prede- fined SUR requirement. It was verified by experimental results that the SUR can be accurately predicted and the coded sequences using the dynamic frame rate can meet the SUR requirement by lowering the number of coded frames. 109 In the third part, an intra-class-difference guided fully convolutional neural net- work (ICD-FCN) and a fine-to-coarse learning (FCL) procedure were proposed in this work. By defining ICD-class labels, we can partition training images into several sub- categories of similar features. The proposed FCL procedure can integrate the knowl- edge of both fine and coarse labels. Our proposed solution offers around 3% mean IU improvement over the ResNet-based FCN baseline. Consistent performance improve- ment has been observed across several segmentation datasets of a small or large number of training data. In the last part, a new network, called the RAN, designed for reverse learning was proposed. The network explicitly learns what are and are not associated with a tar- get class in its direct and reverse branches, respectively. To further enhance the reverse learning effect, the sigmoid activation function and an attention mask were introduced to build the reverse attention branch as the third one. The three branches were integrated in the RAN to generate final results. The RAN provides significant performance improve- ment over its baseline network and achieves the state-of-the-art semantic segmentation performance in several benchmark datasets. 110 6.2 Future Research Directions To extend our research, we have the following research directions to further improve the proposed two approaches. Semantic Segmentation with multi-input. The current solutions of semantic segmentation and other image-based computer vision problems rely on the inputs of RGB based images. However, more information could be pre-processed and provided in the training. As an example, on can add depth information as well as edge information in addition to the RGB input. We expect to achieve better performance and more efficient training with multi-input as training information. Instance Segmentation Instance Segmentation is a combination of object detec- tion and semantic segmentation. It requires both detail shape information, seman- tic object labeling, as well as bounding-box based object separations. As a result, another research direction could be built upon our current understanding of semantic segmentation and fully convolutional neural network. We expect to explore the topic of object detection and use it to help with the instance-based segmentation. 111 Bibliography [1] International telecommunication union, itu standard bt.500. 2002. [2] International telecommunication union, itu standard p.910. 2008. [3] P. Agrawal, R. Girshick, and J. Malik. Analyzing the performance of multilayer neural networks for object recognition. In European Conference on Computer Vision, pages 329–344. Springer, 2014. [4] W. Ahn and J.-S. Kim. Flat-region detection and false contour removal in the digital tv display. pages 1338–1341, 2005. [5] A. Arnab, S. Jayasumana, S. Zheng, and P. H. Torr. Higher order conditional random fields in deep neural networks. In European Conference on Computer Vision, pages 524–540. Springer, 2016. [6] H. Barlow. Optic nerve impulses and weber’s law. In Cold Spring Harbor sym- posia on quantitative biology, volume 30, pages 539–546. Cold Spring Harbor Laboratory Press, 1965. [7] P. G. Barten.Contrastsensitivityofthehumaneyeanditseffectsonimagequality, volume 72. SPIE press, 1999. [8] Y . Bengio et al. Deep learning of representations for unsupervised and transfer learning. ICMLUnsupervisedandTransferLearning, 27:17–36, 2012. [9] G. Bertasius, J. Shi, and L. Torresani. Semantic segmentation with boundary neural fields. arXivpreprintarXiv:1511.02674, 2015. [10] S. Bhagavathy, J. Llach, and J. Zhai. Multiscale probabilistic dithering for sup- pressing contour artifacts in digital images.ImageProcessing,IEEETransactions on, 18(9):1936–1945, 2009. [11] C. Bishop. Bishop pattern recognition and machine learning, 2001. 112 [12] V . Centner, D. Massart, and O. De Noord. Detection of inhomogeneities in sets of nir spectra. Analyticachimicaacta, 330(1):1–17, 1996. [13] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille. Semantic image segmentation with task-specific edge detection using cnns and a discrim- inatively trained domain transform. arXiv preprint arXiv:1511.03328. accpeted by CVPR 2016. [14] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprintarXiv:1412.7062, 2014. [15] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXivpreprintarXiv:1606.00915, 2016. [16] L.-C. Chen, Y . Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale- aware semantic image segmentation. In Proceedings of the IEEE Conference on ComputerVisionandPatternRecognition, pages 3640–3649, 2016. [17] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 1971–1978, 2014. [18] H.-R. Choi, J. Lee, R.-H. Park, and J.-S. Kim. False contour reduction using directional dilation and edge-preserving filtering. Consumer Electronics, IEEE Transactionson, 52(3):1099–1106, 2006. [19] J. Dai, K. He, Y . Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. arXivpreprintarXiv:1603.08678, 2016. [20] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE InternationalConferenceonComputerVision, pages 1635–1643, 2015. [21] J. Dai, Y . Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. arXivpreprintarXiv:1605.06409, 2016. [22] S. J. Daly and X. Feng. Bit-depth extension using spatiotemporal microdither based on models of the equivalent input noise of the visual system. InElectronic Imaging 2003, pages 455–466. International Society for Optics and Photonics, 2003. [23] S. J. Daly and X. Feng. Decontouring: Prevention and removal of false contour artifacts. pages 130–149, 2004. 113 [24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large- scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.CVPR2009.IEEEConferenceon, pages 248–255. IEEE, 2009. [25] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual representation learn- ing by context prediction. In Proceedings of the IEEE International Conference onComputerVision, pages 1422–1430, 2015. [26] R. A. Doherty, A. C. Younkin, and P. J. Corriveau. Paired comparison analysis for frame rate conversion algorithms. In Proc. Int. Workshop Video Processing andQualityMetricsforConsumerElectronics, 2009. [27] J. Dong, Q. Chen, J. Feng, K. Jia, Z. Huang, and S. Yan. Looking inside cate- gory: subcategory-aware object recognition. IEEE Transactions on Circuits and SystemsforVideoTechnology, 25(8):1322–1334, 2015. [28] J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang, and S. Yan. Subcategory-aware object classification. InProceedingsoftheIEEEConferenceonComputerVision andPatternRecognition, pages 827–834, 2013. [29] M. Emoto, Y . Kusakabe, and M. Sugawara. High-frame-rate motion picture qual- ity and its independence of viewing distance. Journal of Display Technology, 10(8):635–641, 2014. [30] M. Emoto and M. Sugawara. Critical fusion frequency for bright and wide field- of-view image display. DisplayTechnology,Journalof, 8(7):424–429, 2012. [31] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010. [32] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. InComputerVisionandPatternRecognition, 2008.CVPR2008.IEEEConferenceon, pages 1–8. IEEE, 2008. [33] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based image segmen- tation. InternationalJournalofComputerVision, 59(2):167–181, 2004. [34] R. Floyd and L. Steinberg. An adaptive technique for spatial grayscale. In Proc. Soc.Inf.Disp, volume 17, pages 78–84, 1976. [35] J. M. Foley. Human luminance pattern-vision mechanisms: masking experiments require a new model. JOSAA, 11(6):1710–1719, 1994. 114 [36] K.-T. Fung, Y .-L. Chan, and W.-C. Siu. New architecture for dynamic frame- skipping transcoder. Image Processing, IEEE Transactions on, 11(8):886–900, 2002. [37] C. G. Gross. Genealogy of the grandmother cell. The Neuroscientist, 8(5):512– 518, 2002. [38] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization and recognition of indoor scenes from rgb-d images. In Proceedings of the IEEE Conference on ComputerVisionandPatternRecognition, pages 564–571, 2013. [39] S. Gupta, R. Girshick, P. Arbel´ aez, and J. Malik. Learning rich features from rgb-d images for object detection and segmentation. In European Conference on ComputerVision, pages 345–360. Springer, 2014. [40] J. Harel, C. Koch, and P. Perona. Graph-based visual saliency. In Advances in neuralinformationprocessingsystems, pages 545–552, 2006. [41] B. Hariharan, P. Arbel´ aez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. InComputervision–ECCV2014, pages 297–312. Springer, 2014. [42] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recogni- tion. arXivpreprintarXiv:1512.03385, 2015. [43] S. He, P. Cavanagh, and J. Intriligator. Attentional resolution and the locus of visual awareness. Nature, 383(6598):334–337, 1996. [44] B. Hopkins and J. G. Skellam. A new method for determining the type of distri- bution of plant individuals. AnnalsofBotany, 18(2):213–227, 1954. [45] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 2012. [46] H. Hu, Z. Deng, G.-T. Zhou, F. Sha, and G. Mori. Recalling holistic information for semantic segmentation. arXivpreprintarXiv:1611.08061, 2016. [47] S. Hu, L. Jin, H. Wang, Y . Zhang, S. Kwong, and C.-C. J. Kuo. Compressed image quality metric based on perceptually weighted distortion. accepted for publication in the IEEE Trans. on Image Processing. [48] S. Hu, L. Jin, H. Wang, Y . Zhang, S. Kwong, and C.-C. J. Kuo. Objective video quality assessment based on perceptually weighted mean squared error. submitted to the IEEE Trans. on Circuits and Systems for Video Technology. [49] Q. Huang, C. Xia, W. Zheng, Y . Song, H. Xu, and C.-C. J. Kuo. Object boundary guided semantic segmentation. arXivpreprintarXiv:1603.09742, 2016. 115 [50] J.-N. Hwang and T.-D. Wu. Motion vector re-estimation and dynamic frame- skipping for video transcoding. 2:1606–1610, 1998. [51] P. B. Imrey. Bradley–terry model. EncyclopediaofBiostatistics, 1998. [52] Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar- rama, and T. Darrell. Caffe: Convolutional architecture for fast feature embed- ding. InProceedingsoftheACMInternationalConferenceonMultimedia, pages 675–678. ACM, 2014. [53] X. Jin, S. Goto, and K. N. Ngan. Composite model-based dc dithering for sup- pressing contour artifacts in decompressed video.ImageProcessing,IEEETrans- actionson, 20(8):2110–2121, 2011. [54] G. Joy and Z. Xiang. Reducing false contours in quantized color images. Com- puters&graphics, 20(2):231–242, 1996. [55] S. Kim, S. Choo, and N. I. Cho. Removing false contour artifact for bit-depth expansion. IEIE Transactions on Smart Processing & Computing, 2(2):97–101, 2013. [56] P. Kr¨ ahenb¨ uhl and V . Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. arXivpreprintarXiv:1210.5644, 2012. [57] P. Kr¨ ahenb¨ uhl and V . Koltun. Parameter learning and convergent inference for dense random fields. In Proceedings of the 30th International Conference on MachineLearning(ICML-13), pages 513–521, 2013. [58] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [59] R. G. Lawson and P. C. Jurs. New index for clustering tendency and its application to chemical problems. Journal of chemical information and computer sciences, 30(1):36–41, 1990. [60] J.-S. Lee, F. De Simone, and T. Ebrahimi. Subjective quality evaluation via paired comparison: application to scalable video coding. Multimedia, IEEE Transac- tionson, 13(5):882–893, 2011. [61] J. W. Lee, B. R. Lim, R.-H. Park, J.-S. Kim, and W. Ahn. Two-stage false contour detection using directional contrast and its application to adaptive false contour reduction. ConsumerElectronics,IEEETransactionson, 52(1):179–188, 2006. 116 [62] S. Li, L. Ma, and K. N. Ngan. Full-reference video quality assessment by decou- pling detail losses and additive impairments. CircuitsandSystemsforVideoTech- nology,IEEETransactionson, 22(7):1100–1112, 2012. [63] S. Li, F. Zhang, L. Ma, and K. N. Ngan. Image quality assessment by separately evaluating detail losses and additive impairments. Multimedia, IEEE Transac- tionson, 13(5):935–949, 2011. [64] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. Semantic object parsing with graph lstm. In European Conference on Computer Vision, pages 125–143. Springer, 2016. [65] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement net- works with identity mappings for high-resolution semantic segmentation. arXiv preprintarXiv:1611.06612, 2016. [66] G. Lin, C. Shen, A. van den Hengel, and I. Reid. Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3194– 3203, 2016. [67] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV2014, pages 740–755. Springer, 2014. [68] R. Lletı, M. C. Ortiz, L. A. Sarabia, and M. S. S´ anchez. Selecting variables for k- means cluster analysis by using a genetic algorithm that optimises the silhouettes. AnalyticaChimicaActa, 515(1):87–100, 2004. [69] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3431–3440, 2015. [70] J. L. Mannos and D. J. Sakrison. The effects of a visual fidelity criterion of the encoding of images. InformationTheory,IEEETransactionson, 20(4):525–536, 1974. [71] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [72] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. InProceedingsoftheIEEEConferenceonComputerVisionandPattern Recognition, pages 891–898, 2014. 117 [73] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012. [74] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand. Comparison of the coding efficiency of video coding standardsincluding high efficiency video coding (hevc). Circuits and Systems for Video Technology, IEEE Transactions on, 22(12):1669–1684, 2012. [75] V . Ostromoukhov. A simple and efficient error-diffusion algorithm. In Proceed- ings of the 28th annual conference on Computer graphics and interactive tech- niques, pages 567–572. ACM, 2001. [76] Y .-F. Ou, T. Liu, Z. Zhao, Z. Ma, and Y . Wang. Modeling the impact of frame rate on perceptual quality of video. City, 70(80):90, 2008. [77] Y .-F. Ou, Z. Ma, T. Liu, and Y . Wang. Perceptual quality assessment of video considering both frame rate and quantization artifacts. Circuits and Systems for VideoTechnology,IEEETransactionson, 21(3):286–298, 2011. [78] A. Parducci and L. F. Perrett. Category rating scales: Effects of relative spac- ing and frequency of stimulus values. Journal of Experimental Psychology, 89(2):427, 1971. [79] S. Pejhan, T.-H. Chiang, and Y .-Q. Zhang. Dynamic frame rate control for video streams. In Proceedings of the seventh ACM international conference on Multi- media(Part1), pages 141–144. ACM, 1999. [80] L. Roberts. Picture coding using pseudo-random noise. InformationTheory,IRE Transactionson, 8(2):145–154, 1962. [81] P. J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journalofcomputationalandappliedmathematics, 20:53–65, 1987. [82] C. Russell, P. Kohli, P. H. Torr, et al. Associative hierarchical crfs for object class image segmentation. In Computer Vision, 2009 IEEE 12th International Conferenceon, pages 739–746. IEEE, 2009. [83] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis andMachineIntelligence,IEEETransactionson, 22(8):888–905, 2000. [84] Y . Q. Shi and H. Sun. Imageandvideocompressionformultimediaengineering: Fundamentals,algorithms,andstandards. CRC press, 1999. 118 [85] T. Shigeta, N. Saegusa, H. Honda, T. Nagakubo, and T. Akiyama. 19.4: Improve- ment of moving-video image quality on pdps by reducing the dynamic false con- tour. In SID Symposium Digest of Technical Papers, volume 29, pages 287–290. Wiley Online Library, 1998. [86] T. Shinoda and K. Awamoto. Plasma display technologies for large area screen and cost reduction.PlasmaScience,IEEETransactionson, 34(2):279–286, 2006. [87] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. arXivpreprintarXiv:1604.03540, 2016. [88] B. Shuai, T. Liu, and G. Wang. Improving fully convolution network for semantic segmentation. arXivpreprintarXiv:1611.08986, 2016. [89] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXivpreprintarXiv:1409.1556, 2014. [90] J. T. Sims and T. Hammack. Ts-v: Thurstone’s law of comparative judhment (case v). JMR,JournalofMarketingResearch(pre-1986), 13(000002):161, 1976. [91] A. Smola and V . Vapnik. Support vector regression machines. Advancesinneural informationprocessingsystems, 9:155–161, 1997. [92] H. Song and C.-C. J. Kuo. Rate control for low-bit-rate video via variable- encoding frame rates. CircuitsandSystemsforVideoTechnology,IEEETransac- tionson, 11(4):512–521, 2001. [93] M. K. Stern and J. H. Johnson. Just noticeable difference. Corsini Encyclopedia ofPsychology, 2010. [94] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overview of the high effi- ciency video coding (hevc) standard. CircuitsandSystemsforVideoTechnology, IEEETransactionson, 22(12):1649–1668, 2012. [95] C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Van- houcke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. [96] A. Thammineni, A. Raman, S. C. Vadapalli, and S. Sethuraman. Dynamic frame- rate selection for live lbr video encoders using trial frames. In Multimedia and Expo,2008IEEEInternationalConferenceon, pages 817–820. IEEE, 2008. [97] S. Varadarajan and L. J. Karam. A no-reference texture regularity metric based on visual saliency. Image Processing, IEEE Transactions on, 24(9):2784–2796, 2015. 119 [98] Y . Wang, C. Abhayaratne, R. Weerakkody, and M. Mrak. Multi-scale dithering for contouring artefacts removal in compressed uhd video sequences. In Sig- nal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, pages 1014–1018. IEEE, 2014. [99] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. The ssim index for image quality assessment. MATLAB implementation available online from: http://www.cns.nyu.edu/lcv/ssim, 23(66):6, 2003. [100] A. B. Watson, R. Borthwick, and M. Taylor. Image quality and entropy mask- ing. In Electronic Imaging’97, pages 2–12. International Society for Optics and Photonics, 1997. [101] Z. Wu, C. Shen, and A. v. d. Hengel. Bridging category-level and instance-level semantic image segmentation. arXivpreprintarXiv:1605.06885, 2016. [102] Z. Wu, C. Shen, and A. v. d. Hengel. High-performance semantic segmentation using very deep fully convolutional networks. arXiv preprint arXiv:1604.04339, 2016. [103] Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. arXivpreprintarXiv:1611.10080, 2016. [104] F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In European Conference on ComputerVision, pages 648–663. Springer, 2016. [105] C.-H. Yeh, S.-J. F. Jiang, C.-Y . Lin, and M.-J. Chen. Temporal video transcoding based on frame complexity analysis for mobile video communication. Broad- casting,IEEETransactionson, 59(1):38–46, 2013. [106] K. Yoo and H. Song. In-loop selective processing for contour artifact reduction in video coding. Electron.Lett, 45:10201022, 2009. [107] Y . J. Zhang. A survey on evaluation methods for image segmentation. Pattern recognition, 29(8):1335–1346, 1996. [108] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. arXiv preprintarXiv:1612.01105, 2016. [109] S. Zheng, S. Jayasumana, B. Romera-Paredes, V . Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as recurrent neural networks. In Pro- ceedingsoftheIEEEInternationalConferenceonComputerVision, pages 1529– 1537, 2015. 120 [110] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Seman- tic understanding of scenes through the ade20k dataset. arXiv preprint arXiv:1608.05442, 2016. 121
Abstract (if available)
Abstract
Understanding of the contents in image and video is important in both image processing and computer vision. Since the digital image and video contents are in the mere form of two-dimensional pixel arrays, the computer needs to be guided to "see" and to "understand". To resolve these issues, traditional methods include two important steps: extracting effective feature representations and designing efficient decision system. However with the emerging of big visual data, the recent development of deep learning techniques provides a more effective way to simultaneously acquire feature representations and generate the desired output. ❧ In this dissertation, we contribute to four works that gradually develop from the traditional method to deep learning based method. Based on the applications, the works can be divided into two major categories: perceptual quality enhancement and semantic image segmentation. In the first part, we focus on enhancing the quality of images and videos by considering related perceptual properties of human visual system. To begin with, we deal with the a type of compression artifacts referred to as "false contour". We then focus on the visual experience of videos and its relationship with displaying frame-rate. In the second part, we target at generating both low-level detail segmentation and high-level semantic meaning on given images, which requires a more detailed understanding of images. ❧ Based on the desired targets, our solutions are specifically designed and can efficiently resolve the problems with both traditional machine-learning frame-work and state-of-the-art deep learning techniques.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Object localization with deep learning techniques
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
A learning‐based approach to image quality assessment
PDF
Experimental design and evaluation methodology for human-centric visual quality assessment
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
PDF
Brain tumor segmentation
PDF
Experimental analysis and feedforward design of neural networks
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Efficient graph learning: theory and performance evaluation
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Object classification based on neural-network-inspired image transforms
PDF
3D deep learning for perception and modeling
PDF
A data-driven approach to image splicing localization
PDF
Techniques for compressed visual data quality assessment and advanced video coding
PDF
Exploring complexity reduction in deep learning
Asset Metadata
Creator
Huang, Qin
(author)
Core Title
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/22/2017
Defense Date
08/23/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
fully convolutional neural network,machine learning,neural network,OAI-PMH Harvest,perceptual quality,semantic image segmentation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Haldar, Justin (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
qinhuang@usc.edu,qinhuanghugin@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-431876
Unique identifier
UC11264285
Identifier
etd-HuangQin-5749.pdf (filename),usctheses-c40-431876 (legacy record id)
Legacy Identifier
etd-HuangQin-5749.pdf
Dmrecord
431876
Document Type
Dissertation
Rights
Huang, Qin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
fully convolutional neural network
machine learning
neural network
perceptual quality
semantic image segmentation