Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A learning‐based approach to image quality assessment
(USC Thesis Other)
A learning‐based approach to image quality assessment
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A LEARNING-BASED APPROACH TO IMAGE QUALITY ASSESSMENT by Tsung-Jung Liu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2014 Copyright 2014 Tsung-Jung Liu To my family and especially to my parents, Ming-Yuan Liu, and Hsiu-Hsia Chen. ii Acknowledgments First, I would like to thank my advisor, Prof. C.-C. Jay Kuo, for his vision and guidance throughout my Ph.D. study. Because of his advice and encouragement, I would be able to complete the Ph.D. study at USC within four years. I really learn a lot of things from him. Also, I would like to thank both Prof. C.-C. Jay Kuo and Prof. Weisi Lin, for spending their time with me on research problem discussion and paper revision. Second, I would like to thank Prof. Alexander Sawchuk, Prof. Keith Jenkins, Prof. Antonio Ortega, and Prof. David D'Argenio, for their useful suggestions during my Qualifying Exam. Thirdly, I would like to express my gratitude to Prof. Panayiotis Georgiou, and Prof. Aiichiro Nakano, for their valuable comments toward my Ph.D. Defense. Lastly, I would also like to thank my colleagues in Media Communications Lab, for their consultation and sharing during the Ph.D. process. In addition, I would like to say "Thank you" to my family. With their nancial and emotional support, I would be able to make it to get the Ph.D. degree. Especially, I am grateful to have my brother, Kuan-Hsien Liu, stay with me the whole time during the Ph.D. study. He encourages and helps me on both research discussion and general life. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures x Abstract xii Chapter 1: Introduction 1 1.1 Signicance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Background Review 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Classication of Objective Visual Quality Assessment Methods . . . . . . 15 2.2.1 Classication Based upon the Availability of Reference . . . . . . . 15 2.2.2 Classication Based upon Methodology for Assessment . . . . . . 16 2.3 Recent Developments in IQA . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Image Quality Databases . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Major IQA Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.3 Application in Perceptual Image Coding . . . . . . . . . . . . . . . 26 2.4 Recent Developments in VQA . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.1 Video Quality Databases . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.2 Major VQA Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.3 Application in Perceptual Video Coding . . . . . . . . . . . . . . . 38 2.5 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.5.1 Image Quality Metric Benchmarking . . . . . . . . . . . . . . . . . 46 2.5.2 Video Quality Metric Benchmarking . . . . . . . . . . . . . . . . . 46 2.6 Applications of Machine Learning Method on Visual Quality Assessment . 49 2.6.1 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.6.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 iv 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 3: Image Quality Assessment Using Multi-Method Fusion (MMF) 53 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 Multi-Method Fusion (MMF) . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.2 Support Vector Regression (SVR) . . . . . . . . . . . . . . . . . . 58 3.3.3 MMF Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.4 Data Scaling and Cross-Validation . . . . . . . . . . . . . . . . . . 62 3.3.5 Training Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.6 Testing Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Contexts for MMF Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4.1 Context Denition in CD-MMF . . . . . . . . . . . . . . . . . . . . 64 3.4.2 Automatic Context Determination in CD-MMF . . . . . . . . . . . 65 3.5 Fused IQA Methods Selection . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.5.1 Sequential Forward Method Selection (SFMS) . . . . . . . . . . . . 68 3.5.2 Biggest Index Ranking Dierence (BIRD) . . . . . . . . . . . . . . 69 3.5.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.6.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.6.2 Test Methodology and Performance Measure . . . . . . . . . . . . 73 3.6.3 CF-MMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.6.4 CD-MMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.6.5 Performance Comparison between MMF and the Existing Methods 79 3.6.6 Cross-Database Evaluation . . . . . . . . . . . . . . . . . . . . . . 81 3.6.7 Further Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 93 Chapter 4: A ParaBoost Method to Image Quality Assessment 96 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.3 Feature Exraction for Image Quality Scorers . . . . . . . . . . . . . . . . . 99 4.3.1 Features for Basic Image Quality Scorers (BIQSs) . . . . . . . . . 100 4.3.2 Features of Auxiliary Image Quality Scorers (AIQSs) . . . . . . . . 102 4.4 IQS Evaluation, Training and ParaBoost . . . . . . . . . . . . . . . . . . . 110 4.4.1 Contribution Evaluation of IQSs . . . . . . . . . . . . . . . . . . . 110 4.4.2 Training of BIQS and AIQS Models . . . . . . . . . . . . . . . . . 113 4.4.3 ParaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.5 Scorer Selection in ParaBoost IQA System . . . . . . . . . . . . . . . . . . 116 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.6.1 Image Quality Databases . . . . . . . . . . . . . . . . . . . . . . . 121 4.6.2 Performance Measures for IQA Methods . . . . . . . . . . . . . . . 122 4.6.3 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . 123 v 4.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 125 Chapter 5: No-Reference Image Quality Assessment by Ensemble Method 128 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.3 Multi-Perceptual-Domain Features . . . . . . . . . . . . . . . . . . . . . . 133 5.3.1 Brightness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3.2 Contrast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.3.3 Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.3.4 Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.3.5 Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.4 Multi-Perceptual-Domain Scorer Ensemble . . . . . . . . . . . . . . . . . . 142 5.4.1 Building Scorer Models . . . . . . . . . . . . . . . . . . . . . . . . 142 5.4.2 Ensemble System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.4.3 IQS Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.5.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.5.2 Performance Measure Indices . . . . . . . . . . . . . . . . . . . . . 148 5.5.3 Performance Comparisons . . . . . . . . . . . . . . . . . . . . . . . 149 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Chapter 6: Conclusion and Future Work 153 6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.2.1 PSNR or SSIM-modied Metrics . . . . . . . . . . . . . . . . . . . 156 6.2.2 Multiple Strategies or Multi-Metric Fusion Approaches . . . . . . . 157 6.2.3 Migration from IQA to VQA . . . . . . . . . . . . . . . . . . . . . 157 6.2.4 Audiovisual Quality Assessment for 4G Networks . . . . . . . . . . 158 6.2.5 Perceptual Image/Video Coding . . . . . . . . . . . . . . . . . . . 159 6.2.6 No-Reference (NR) Quality Metrics . . . . . . . . . . . . . . . . . 160 Bibliography 161 vi List of Tables 2.1 Comparison of Image Quality Databases (notes: '-' means no informa- tion available; 'proprietary' means the testing method is designed by the authors, not in [21] and [25].) . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Classication of IQA Models Based on Reference Availability and Assess- ment Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Comparison of Video Quality Databases . . . . . . . . . . . . . . . . . . . 32 2.4 Classication of VQA Models Based on Reference Availability and Assess- ment Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5 Performance Comparison among IQA Models in CSIQ Database . . . . . 47 2.6 Performance Comparison among IQA Models in LIVE Image Database . . 47 2.7 Performance Comparison among IQA Models in TID2008 Database . . . 47 2.8 Performance Comparison of VQA Models in LIVE Video Database . . . . 48 2.9 Performance Comparison of VQA Models in EPFL-PoliMI Database [103] 48 3.1 Ten Better-Recognized IQA Methods . . . . . . . . . . . . . . . . . . . . . 54 3.2 Image Distortion Types in TID2008 Database . . . . . . . . . . . . . . . . 57 3.3 Top Three Quality Indices for Image Distortion Types in Table 3.2 (in Terms of the PCC Performance) . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 The Context Denition for Each Database . . . . . . . . . . . . . . . . . . 64 3.5 The Context Classication Rate for Each Database . . . . . . . . . . . . . 66 3.6 Complexity Comparison of the Fused IQA Methods Selection Algorithms in Terms of Required Arithmetic Operations . . . . . . . . . . . . . . . . . 71 3.7 Performance Measure of CF-MMF with SFMS in TID2008 Database . . . 74 3.8 Selected Fused IQA Methods for CF-MMF in Six Databases . . . . . . . . 76 vii 3.9 Performance Measure for CF-MMF in Six Databases . . . . . . . . . . . . 76 3.10 Performance Measure of CD-MMF with SFMS in TID2008 Database . . . 77 3.11 Selected Fused IQA Methods for CD-MMF in Six Databases . . . . . . . 78 3.12 Performance Measure for CD-MMF in Six Databases . . . . . . . . . . . . 78 3.13 Performance Comparison among 15 IQA Models in Six databases . . . . . 80 3.14 Cross-Database PCC Performance of MMF (in Terms of Distortion Groups) 90 3.15 Cross-Database PCC Performance of MMF (in Terms of Whole Database) 90 3.16 n-Fold Cross-Validation Performance of CF-MMF with SFMS in TID2008 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.17 Performance Comparison among Fused IQA Methods Selection Algorithms for CF-MMF in LIVE Database . . . . . . . . . . . . . . . . . . . . . . . . 91 3.18 Performance Comparison among Dierent Objective Functions J(Y N ) for CF-MMF (with SFMS) in LIVE Database . . . . . . . . . . . . . . . . . . 92 3.19 Performance Comparison in TID2008 Database . . . . . . . . . . . . . . . 92 3.20 Comparison of Computation Time among 15 IQA Models in CSIQ Database 93 4.1 Image Distortion Types in TID2008 Database . . . . . . . . . . . . . . . . 103 4.2 SROCC Performance of BIQSs vs. Distortion Types in TID2008 Database 103 4.3 Performance of Each IQS in TID2008 Database . . . . . . . . . . . . . . . 111 4.4 SROCC Performance of AIQSs vs. Distortion Types in TID2008 Database 111 4.5 SROCC Performance Boosting by Considering Both BIQS and AIQS in TID2008 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.6 IQS Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.7 Training Image Sets (Distortion Types) for BIQSs . . . . . . . . . . . . . 114 4.8 Training Image Sets (Distortion Types) for AIQSs . . . . . . . . . . . . . 114 4.9 1-Way ANOVA and K-W Statistic for Each IQS in TID2008 Database . . 119 4.10 Performance of Selected IQSs by Methods 1 and 2 in TID2008 Database . 120 4.11 Normal Distribution Test for Each IQS in TID2008 Database . . . . . . . 121 viii 4.12 Performance of Each IQS in the Other Three Databases . . . . . . . . . . 123 4.13 Selected IQSs by Methods 1 and 2 in Four Databases . . . . . . . . . . . . 123 4.14 Performance Comparisons for Dierent Combinations of IQSs . . . . . . . 125 4.15 Performance Comparison among 12 IQA Models in Four Databases . . . . 126 4.16 Cross-Database SROCC Performance of ParaBoost System . . . . . . . . 126 5.1 Mapping Table for Each Color Channel . . . . . . . . . . . . . . . . . . . 136 5.2 List of Multi-Perceptual-Domain Features . . . . . . . . . . . . . . . . . . 143 5.3 Image Distortion Types in TID2013 . . . . . . . . . . . . . . . . . . . . . 148 5.4 Performance of Each IQS in LIVE and TID2013 . . . . . . . . . . . . . . 149 5.5 Selected IQS by SFSS in LIVE and TID2013 . . . . . . . . . . . . . . . . 150 5.6 Performance Comparisons for Dierent Combinations of IQSs . . . . . . . 150 5.7 Performance Comparison among 7 IQA Models in LIVE and TID2013 . . 151 5.8 SROCC Performance of 7 IQA Models w.r.t Distortion Types in LIVE Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.9 SROCC Performance of 7 IQA Models w.r.t Distortion Types in TID2013 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 ix List of Figures 2.1 Classications of objective visual quality assessment (QA) metrics. . . . . 18 3.1 The block diagram of the proposed CF-MMF (without block A) and CD- MMF (with block A) quality assessment system. . . . . . . . . . . . . . . 67 3.2 PCC performance of CF-MMF in TID2008 database. . . . . . . . . . . . . 75 3.3 PCC comparison of CD-MMF in TID2008 database. . . . . . . . . . . . . 77 3.4 Comparison of the PCC measure of 15 IQA models in the TID2008 database. 81 3.5 Scatter plots and best tting logistic function of objective IQA (MS-SSIM) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . . . . . . . 82 3.6 Scatter plots and best tting logistic function of objective IQA (SSIM) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . . . . . . . 82 3.7 Scatter plots and best tting logistic function of objective IQA (VIF) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . . . . . . . 83 3.8 Scatter plots and best tting logistic function of objective IQA (VSNR) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . . . . . . . 83 3.9 Scatter plots and best tting logistic function of objective IQA (NQM) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . . . . . . . 84 3.10 Scatter plots and best tting logistic function of objective IQA (PSNR- HVS) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . . . 84 3.11 Scatter plots and best tting logistic function of objective IQA (IFC) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . . . . . . . 85 3.12 Scatter plots and best tting logistic function of objective IQA (PSNR) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . . . . . . . 85 3.13 Scatter plots and best tting logistic function of objective IQA (FSIM) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . . . . . . . 86 x 3.14 Scatter plots and best tting logistic function of objective IQA (MAD) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . . . . . . . 86 3.15 Scatter plots and best tting logistic function of objective IQA (IW-SSIM) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . . . . . . . 87 3.16 Scatter plots and best tting logistic function of objective IQA (CF-MMF (SFMS)) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . 87 3.17 Scatter plots and best tting logistic function of objective IQA (CF-MMF (BIRD)) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . 88 3.18 Scatter plots and best tting logistic function of objective IQA (CD-MMF (SFMS)) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . 88 3.19 Scatter plots and best tting logistic function of objective IQA (CD-MMF (BIRD)) scores vs. MOS for the TID2008 database. . . . . . . . . . . . . 89 4.1 Spatial relationship of pixel of interest in GLCM. . . . . . . . . . . . . . . 104 4.2 Circularly symmetric neighbor set in LBP. . . . . . . . . . . . . . . . . . . 106 4.3 Extracting features of the 4th and 5th AIQSs. . . . . . . . . . . . . . . . . 107 4.4 Extracting features of the 10th AIQS. . . . . . . . . . . . . . . . . . . . . 110 4.5 The original and distorted hat images and their corresponding Sobel edge maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.6 The ParaBoost IQA system. . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1 Spatial relationship of pixel of interest in GLCM. . . . . . . . . . . . . . . 135 5.2 Histogram of channel Y and corresponding characteristics. . . . . . . . . . 136 5.3 Circularly symmetric neighbor set in LBP. . . . . . . . . . . . . . . . . . . 141 5.4 Extraction procedure of the 5th and 6th texture features. . . . . . . . . . 143 5.5 The MPDSE IQA system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 xi Abstract Research on visual quality assessment has been active during the last decade. This dissertation consists of six parts centered on this subject. In Chapter 1, we highlight the signicance and contributions of our research work. The previous work in this area is also thoroughly reviewed. In Chapter 2, we provide an in-depth review of recent developments in the eld. As compared with others' work, our survey has several contributions. First, besides image quality databases and metrics, we put emphasis on video quality databases and metrics since this is a less investigated area. Second, we discuss the application of visual quality evaluation to perceptual coding as an example for applications. Thirdly, we compare the performance of state-of-the-art visual quality metrics with experiments. Finally, we introduce the machine learning methods that can be applied on visual quality assessment. In Chapter 3, a new methodology for objective image quality assessment (IQA) with multi-method fusion (MMF) is proposed. The research is motivated by the observation that there is no single method that can give the best performance in all situations. To achieve MMF, we adopt a regression approach. The new MMF score is set to be a nonlinear combination of scores from multiple methods with suitable weights obtained by a training process. In order to improve the regression results further, we divide distorted images into three to ve groups based on the distortion types and perform regression within each group, which is called "context-dependent MMF" (CD-MMF). One task in CD-MMF is to determine the context automatically, which is achieved by a machine learning approach. To further reduce the complexity of MMF, we perform algorithms to xii select a small subset from the candidate method set. The result is very good even if only 3 quality assessment methods are included in the fusion process. The proposed MMF method using support vector regression (SVR) is shown to outperform a large number of existing IQA methods by a signicant margin when being tested in six representative databases. In Chapter 4, an ensemble method for full-reference image quality assessment (IQA) based on the parallel boosting (or ParaBoost in short) idea is proposed in this work. We rst extract features from existing image quality metrics and train them to form basic image quality scorers (BIQSs). Then, we select additional features to address specic distortion types and train them to construct auxiliary image quality scorers (AIQSs). Both BIQSs and AIQSs are trained on small image subsets of certain distortion types and, as a result, they are weak performers with respect to a wide variety of distortions. Finally, we adopt the ParaBoost framework to fuse the scores of BIQSs and AIQSs to evaluate images containing a wide range of distortion types. This ParaBoost methodology can be easily extended to images of new distortion types. Extensive experiments are conducted to demonstrate the superior performance of the ParaBoost method, which ourperforms existing IQA methods by a signicant margin. Specically, the Spearman rank order correlation coecients (SROCCs) of the ParaBoost method with respect to the LIVE, CSIQ, TID2008 and TID2013 image quality databases are 0.98, 0.97, 0.98 and 0.96, respectively. In Chapter 5, a no-reference learning-based approach to assess image quality is pre- sented in this work. The developed features are extracted from multiple perceptual domains, including brightness, contrast, color, distortion, and texture. The features are then trained to become a model (scorer) which can predict scores. The scorer selection algorithm is utilized to help simplify the proposed system. In the nal stage, the ensem- ble method is used to combine the prediction results from all scorers. Being dierent from other existing image quality assessment (IQA) methods based on natural scene statistics (NSS) or distortion dependent features, the proposed quality prediction model is robust xiii with respect to more than 24 image distortion types. The extensive experiments on two well-known databases conrm the performance robustness of our proposed model. Chapter 6 summarizes the work presented in the dissertation. In addition, we have pointed out and discussed several possible directions for future visual signal quality assessment, i.e., PSNR or SSIM-modied metrics, multiple strategy and multi-metric fusion approaches, migration of IQA to VQA, joint audiovisual assessment, perceptual image/video coding, and NR quality assessment, with reasoning based upon our experi- ence and understanding of the related research. xiv Chapter 1 Introduction 1.1 Signicance of the Research During recent years, digital images and videos play more and more important roles in our work and life because of the increasing availability and accessibility. Thanks to the rapid advancement of new technology, people can easily have an imaging device, such as a digital camera, camcorder and cellular phone, to capture what they see and what happen in daily life. In addition, with the development of social network and mobile devices, photo and video sharing on the Internet becomes much more popular than before. Quality assessment and assurance for digital images and videos in an objective manner become an increasingly useful and interesting topic in the research community. Generally speaking, visual quality assessment can be divided into two categories. One is subjective visual quality assessment, and the other is objective visual quality assessment. As the name implies, the former is done by humans. It represents the most realistic opinion of humans towards an image or a video, and also the most reliable measure of visual quality among all available means (if the pool of subjects is suciently large and the nature of the circumstances allows such assessments). However, the subjective method is time-consuming, and not applicable for real-time processing since the test has to be performed carefully in order to obtain meaningful results. Moreover, it is not feasible to have human intervention with in-loop and on- service processes (like video encoding, transmission, etc.). Thus, most research has been focused on automatic assessment of quality for an image or a video. Decades ago, peak signal-to-noise ratio (PSNR) was the most well-known and trust- worthy method to gauge the quality of an image or a video. Although some researchers 1 started to have doubts on the credibility of PSNR [51], it still had been widely used in the related eld of multimedia. The main reason to this phenomenon is because the PSNR is probably the easiest image quality assessment (IQA) method to compute visual quality and there is no better and simple approach to be brought into this area to replace the existing PSNR. Until 2004, Wang et al. [133] proposed to use the Structural SIMilar- ity (SSIM) index to measure the image quality and the resultant performance of SSIM has been proved to be quite promising than most of the other methods. Accordingly, the development of visual quality related research also becomes booming and has made great progress than before. As we know, various image quality assessment methods have been developed to re ect human visual quality experience, including MS-SSIM [139], SSIM [133], VIF [117], VSNR [33], NQM [41], PSNR-HVS [43], IFC [118], FSIM [148], and MAD [63]. However, with the increasing demand of image quality assurance and assessment, more and more databases are made publicly available in recent years, such as LIVE [9], TID2008 [16], and CSIQ [2], to facilitate the development of image quality metrics. The SSIM, and its variant IW-SSIM [135] cannot continue to work the best across all participating databases. The feature similarity (FSIM) index [148] seems to become the one which can beat SSIM in several databases. But it still cannot perform better than any other quality metrics with respect to all kinds of image contents and distortion types. Hence, the formula-based IQA method has gradually been replaced by the learning- oriented approach because sometimes it is hard to predict the visual quality with only a formula under such rich image distortion types and diversifying image contents [76, 79]. In addition, there is no single quality index that can signicantly outperform others so far. Some method may be superior for one image distortion type but inferior for others. Thus, the idea of multi-method fusion (MMF) arises naturally. The methods to objective IQA can be classied into three categories [44]. They are full-reference (FR), reduced-reference (RR), and no-reference (NR). FR IQA method needs a distortion-free image as the reference for distorted one to compare with. Some 2 well-known FR methods include SSIM [133], FSIM [148], MAD [63], and MMF [75, 78]. They are well-established and have been proved to be able to work very well for several databases. If the information of the reference image is partially available (e.g., in the form of a set of extracted features), then this is the so-called RR IQA method. Moreover, if the original image is not available for us to compare with distorted ones, then we call this approach as NR (or blind) IQA method. The NR method does not always perform as well as the FR because it judges the quality solely based on the distorted image without any reference. However, it can be used in wider scope of applications (e.g., multimedia over wired/wireless networks, image/video retargeting, and computer graphics/animation) and the computational requirement is usually less since there is no need to process the reference. Therefore, more and more researchers start to dive into the development of NR IQA methods. 1.2 Review of Previous Work Machine learning and multiple-metric based image quality assessment methods were reported in the literature before. Luo [83] proposed a two-step algorithm to assess image quality. First, a face detection algorithm is used to detect human faces from the image. Second, the spectrum distribution of the detected region is compared with a trained model to determine its quality score. The restriction is that it primarily applies to images that contain human faces. Although the authors claimed it's not dicult to generalize faces to other objects, they still only provided the results which used the images containing human faces to prove the feasibility of their algorithm. Suresh et al. [121, 122] proposed the use of a machine learning method to measure the visual quality of JPEG-coded images. Features are extracted by considering factors related with the human visual sensitivity, such as edge length, edge amplitude, back- ground luminance, and background activity. The visual quality of an image is then computed using the predicted class number and their estimated posterior probability. It 3 was shown by experimental results that the approach performs better than other met- rics. However, it is only applicable to JPEG-coded images since the above features are calculated based on the DCT blocks. The machine learning tool has been used in developing an objective image quality metric. For example, Narwaria and Lin [91] proposed to use singular vectors out of singular value decomposition (SVD) as features to quantify the major structural infor- mation in images. Then, they applied support vector regression (SVR) for image quality prediction, where the SVR method has the ability to learn complex data patterns and maps complicated features into a proper score. The result is better than most of the existing formula-based approaches, as demonstrated in [91]. Moreover, Leontaris et al. [66] collected 15 metrics, and evaluated each one of them to see if they satisfy the expectation of a good video quality metric. In the end, they linearly combined two metrics (MCEAM and GBIM) to get a hybrid metric by using simple coecients as the weights. In summary, the works in [83, 91, 121, 122] used extracted features for model training and test. They are related to machine learning. Another work [66] showed the advantage of integrating results from two methods (although it did not use the machine learning approach). A block-based MMF (BMMF) [59] method was also proposed for image quality assess- ment. First, an image is decomposed into small blocks. Blocks are then classied into three types (smooth, edge, and texture) while distortions are classied into ve groups. Finally, one proper IQA metric is selected for each block based on the block type and the distortion group. Pooling over all blocks leads to the nal quality score of a test image. It oers competitive performance with the MMF for the TID2008 database. As compared with the previous work, the proposed multi-method fusion (MMF) and ensemble method are new since they oer a generic framework that enables better performance than existing methods and serves as a reference for future research. Both of 4 them are also developed in a systematic manner to complement the existing (and even future) approaches. In addition to the FR methods, the existing NR IQA methods can be classied into two types. One is the formula-based approach, and the other is learning-based approach. Most of the formula-based approaches are distortion specic, which assumes the type of image distortion is known. Usually, they only can target at one to two distortion types. For example, Wang et al. [137] introduced three features to measure blockiness, activity, and blurriness for JPEG compressed images. They are the average dierences across block boundaries, the average absolute dierence between in-block image samples, and the zero-crossing rate. Then three features are combined into a quality assessment model. In addition, Ferzli et al. [45] proposed an objective image sharpness metric, called Just Noticeable Blur Metric (JNBM). They claimed the just noticeable blur (JNB) is a function of local contrast and can be used to derive an edge-based sharpness metric with probability summation model over space. The experiment results showed this method can successfully predict the relative amount of sharpness/blurriness in images, even with dierent scenes. Natural scene statistics (NSS) have been applied to learning-based NR IQA [89, 113, 87]. When images are properly normalized or transformed to another domain (e.g., wavelet or DCT, local descriptors (e.g., wavelet coecients) can be modeled by some probability distribution. Since the shapes of the distributions are dierent for a reference (undistorted) image and its corresponding distorted version, this feature can be used to dierentiate the image quality. Well-known examples are DIIVINE [89], BLIINDS-II [113], and BRISQUE [87]. We will brie y introduce them below. DIIVINE [89] is a two-stage framework, including distortion identication followed by distortion-specic quality assessment. First, the distorted image is decomposed using the wavelet transform and the obtained subband coecients are utilized to extract the statistical features. The features can be mapped to the quality score by a regression model. Then the algorithm estimates the probability of presence of ve distortions 5 (including JPEG, JPEG2000, white noise, Gaussian blur, and fast fading) in an image. In the next stage, image quality score is computed for each of these distortions. Finally, the nal image quality is the probability-weighted sum of scores from each distortion category. The index can be applied to images consisting of multiple distortion types. Later, a NSS-based NR IQA approach has been developed, which was called BLIINDS-II [113]. First, an image is partitioned into equally sized blocks, and then local DCT coecients are computed on each of the blocks. The second stage of this approach applies a generalized Gaussian model to each block of DCT coecients. In the third stage, they derive several parameters based on the generalized Gaussian model. These parameters become the features which can be used to predict quality scores. The nal stage is to employ a Bayesian model to predict the image quality for the image. The most recent NSS approach is proposed by Mittal [87]. This model (BRISQUE) does not compute the distortion-specic features, but uses the scene statistics from locally normalized luminance coecients as features to quantify the potential loss of natural- ness in the image because of the presence of the distortions. Moreover, this approach does not need to transform images into another domain and this can greatly lower the computational complexity, making it suitable for real-time applications. Instead of using above hand-craft local descriptors, Ye et al. proposed another type of approaches which are based on feature learning. The rst approach they proposed is called CBIQ [145]. In the rst step, the Gabor lters are used for local feature extraction. Then the visual codebook is created by using a clustering algorithm on Gabor feature vectors from all training images. The third step is to encode features via hard-assignment coding and average pooling. Finally, the codewords are used as the input to a regression model for estimating quality scores. However, the use of Gabor lter based features and a very large codebook (300,000 codewords) makes this approach highly computationally expensive. 6 To improve the rst approach, they developed the second method, called CORNIA [146]. First, the raw-image-patch local descriptors are used, which can be easily com- puted. Second, they use a codebook based approach to learn features automatically. The codebook is constructed by K-means clustering on local features extracted from unla- beled training images. Third, they use soft-assignment coding with maximum pooling for encoding. This process is parameter free and also computationally ecient. In the last stage, support vector regression (SVR) with linear kernel is adopted for quality esti- mation. Comparing to CBIQ, CORNIA does not require labels to construct codebook. It is also able to estimate the quality with more accuracy by using a smaller codebook size (10,000 codewords). Based on above discussions, we are aware that learning-based approach is a better candidate for general purpose NR IQA algorithms since the complicated relationship between quality scores and features for images with dierent distortion levels cannot be easily expressed by a single formula. In addition, NSS-based methods can only be well- suited for assessing natural scene images instead of articial images. This also means that the performance of NSS method is image content dependent. Without the training on some type of image contents, the NSS method will not be robust on estimating quality scores for all kinds of images. The feature learning approach uses a large set of codewords to capture dierent types of distortions. But the performance of this approach drops drastically when only a small set of codewords are used. Therefore, to be able to achieve satisfactory performance, this type of method has to consume a large amount of memories. 1.3 Contributions of the Research Several contributions are made in this research. They are described as follows: Chapter 2 aims at an overview and discussion of the latest research in the area of objective quality evaluation of visual signals (for both image and video). There have 7 been a few good survey papers in this area before, such as [44, 74, 143]. Our current work has several new contributions. An equal emphasis on image and video quality assessment. The video quality assessment is a rapidly growing eld and it has progressed a lot in the last 3-4 years. The recent developments have not been well covered in the existing survey papers. Here, we provide the most updated results in this eld. An in-depth discussion on the application of visual quality assessment to perceptual image/video coding, which is one of the most researched areas in applications. Benchmark the performance of several state-of-the-art quality metrics for both images and videos with appropriate databases and experiments. The major contribution of Chapter 3 is to provide a new perspective in visual quality assessment to complement existing eorts that target at developing a single method suit- able for certain types of images. Given the complex and diversifying nature of general visual content and distortion types, it would be challenging to solely rely on a single method. On the other hand, we demonstrate that it is possible to achieve signicantly better performance by fusing multiple methods with proper means (e.g., machine learn- ing) at the cost of higher complexity. The performance of the proposed multi-method fusion (MMF) scheme will be improved continuously when new methods invented by the research community are incorporated. As compared with the previous work, the pro- posed multi-method fusion (MMF) idea is new since it oers a generic framework that enables better performance than existing methods and serves as a reference for future research. The contributions of MMF are listed below. Provide a detailed discussion on support vector regression (SVR) theory, which is the machine learning tool used for fusion of multiple methods. Oer an elaborated study on the fusion rules. Specically, we develop a new fused IQA methods selection algorithm called the Biggest Index Ranking Dierence 8 (BIRD) that is used to select the most appropriate method for fusion so as to reduce the complexity of the MMF method. Furthermore, we compare BIRD with another fused IQA methods selection algorithm called the Sequential Forward Method Selection (SFMS) in terms of performance accuracy and complexity. Conduct a more thorough experimental evaluation. Here, we test the performance of the MMF method against six publicly available image quality databases. Replace two poor-performed methods in [75] with two more recently developed approaches (i.e., FSIM [148] and MAD [63]) in the MMF process so as to achieve a better correlation between predicted objective quality scores and human subjec- tive scores. This also demonstrates that the proposed MMF approach is able to accommodate new methods to produce better results. The Pearson Correlation Coecient (PCC) performance of the proposed MMF method ranges from 0.94 to 0.98 with respect to various image quality databases. In Chapter 4, we proposed an ensemble method for full-reference image quality assess- ment (IQA) based on the parallel boosting (or ParaBoost in short) idea. The major con- tribution of our proposed ensemble based IQA system and dierences with other existing learning-based approaches are summarized as follows: It fuses scores from a set of weak IQSs, where each IQS can be designed to predict the quality of some specic image distortion types. The proposed ParaBoost system can perform well in situations where individual IQS cannot perform well. An IQS bank structure is adopted to optimize the overall performance of the IQA system. The structure of the ensemble system is modular so that we can add or discard an IQS easily depending on the application need. Each IQS is built by training images on dierent distortion types to increase the diversity among all scorers which can help optimize the ensemble performance. 9 No need for distortion classication stage which can degrade the overall perfor- mance when the classication rate is low. Chapter 5 is based on a dierent perspective to build a new NR IQA model. The contributions of this work can be summarized below: We design and utilize a rich diversity of features instead of just using distortion related or NSS features. These features are extracted from multiple perceptual domains, such as brightness, contrast, color, distortion, and texture. Then the quality score prediction model (also called a scorer) is built for each feature via machine learning. Since each scorer plays a dierent role at judging image quality. Some are based on color, and the other are based on the contrast or texture. Instead of using conventional weighting scheme or the ad-hoc approach, we use ensemble method to combine the opinions from all IQSs intelligently. In addition, SFSS is employed in the system to perform IQS selection. This is a more systematic and reasonable way to achieve better performance and reduce the complexity. 1.4 Organization of the Dissertation The rest of this dissertation is organized as follows. Recent developments, and coding applications of visual quality assessment are addressed in Chapter 2. A review of machine learning methods that can be applied on visual quality assessment is also introduced in the end of Chapter 2. Two multi-method fusion (MMF) approaches are proposed to assess image quality in Chapter 3. In Chapter 4, we design some image quality scorers (IQSs) and use one type of ensemble structure (i.e., ParaBoost) to realize the image quality assessmnt system. The corresponding experimental results are both reported in Chapter 3 and Chapter 4, where extensive performance comparisons are made across 10 multiple image quality databases. In addition to the full-reference (FR) methods pro- posed in Chapter 3 and Chapter 4, we also present a no-reference (NR) learning-based approach to assess image quality in Chapter 5. Finally, concluding remarks and future works are given in Chapter 6. 11 Chapter 2 Background Review 2.1 Introduction During recent years, digital images and videos play more and more important roles in our work and life because of the increasing availability and accessibility. Thanks to the rapid advancement of new technology, people can easily have an imaging device, such as a digital camera, camcorder and cellular phone, to capture what they see and what happen in daily life. In addition, with the development of social network and mobile devices, photo and video sharing on the Internet becomes much more popular than before. Quality assessment and assurance for digital images and videos in an objective manner become an increasingly useful and interesting topic in the research community. Generally speaking, visual quality assessment can be divided into two categories. One is subjective visual quality assessment, and the other is objective visual quality assessment. As the name implies, the former is done by humans. It represents the most realistic opinion of humans towards an image or a video, and also the most reliable measure of visual quality among all available means (if the pool of subjects is suciently large and the nature of the circumstances allows such assessments). For subjective evaluation of visual quality, the tests can be performed with the meth- ods dened in [21, 25]: (a) Pair Comparison (PC) The method of Pair Comparisons implies that the test sequences are presented in pairs, consisting of the same sequence being presented rst through one system under test and then through another system. (b) Absolute Category Rating (ACR) 12 The Absolute Category Rating method is a category judgment where the test sequences are presented one at a time and are rated independently on a discrete ve-level scale from 'bad' to 'excellent'. This method is also called Single Stimulus Method. (c) Degradation Category Rating (DCR) (also called the Double-Stimulus Impairment Scale (DSIS)) The reference picture (sequence) and the test picture (sequence) are presented only once or twice. The reference is always shown before the test sequence, and neither is repeated. Subjects rate the amount of impairment in the test sequence on a discrete ve-level scale from 'very annoying' to 'imperceptible'. (d) Double-Stimulus Continuous Quality Scale (DSCQS) The reference and test sequences are presented twice in alternating fashion, in the order of the two chosen randomly for each trial. Subjects are not informed which one is the reference and which one is the test sequence. They rate each of the two separately on a continuous quality scale ranging from 'bad' to 'excellent'. Analysis is based on the dierence in rating for each pair, which is calculated from an equivalent numerical scale from 0 to 100. (e) Single-Stimulus Continuous Quality Evaluation (SSCQE) Instead of seeing separate short sequence pairs, subjects watch a program of 20-30 min- utes duration which has been processed by the system under test. The reference is not shown. The subjects continuously rate the perceived quality on the continuous scale from 'bad' to 'excellent' using a slider. (f) Simultaneous Double-Stimulus for Continuous Evaluation (SDSCE) The subjects watch two sequences at the same time. One is the reference sequence, and the other one is the test sequence. If the format of the sequences is the standard image format (SIF) or smaller, the two sequences can be displayed side by side on the same monitor; otherwise two aligned monitors should be used. Subjects are requested to check the dierences between the two sequences and to judge the delity of the video by moving the slider. When the delity is perfect, the slider should be at the top of the 13 scale range (coded 100); when the delity is the worst, the slider should be at the bottom of the scale (coded 0). Subjects are aware of which one is the reference and they are requested to express their opinion while they view the sequences throughout the whole duration. In general, methods (a)-(c) above can be used in multimedia applications. Television pictures can be evaluated with methods (c)-(f). In all these test methods, the visual quality ratings evaluated by the test subjects are then averaged to obtain the Mean Opinion Score (MOS). In some cases, Dierence Mean Opinion Score (DMOS) is used to represent the mean of dierential subjective score instead of MOS. However, the subjective method is time-consuming, and not applicable for real-time processing since the test has to be performed carefully in order to obtain meaningful results. Moreover, it is not feasible to have human intervention with in-loop and on- service processes (like video encoding, transmission, etc.). Thus, most research has been focused on automatic assessment of quality for an image or a video. This chapter aims at an overview and discussion of the latest research in the area of objective quality evaluation of visual signals (for both image and video). There have been a few good survey papers in this area before, such as [44, 74, 143]. Our current work has several new contributions. First, we put an equal emphasis on image and video quality assessment. The video quality assessment is a rapidly growing eld and it has progressed a lot in the last 3-4 years. The recent developments have not been well covered in the existing survey papers. Here, we provide the most updated results in this eld. Second, we have an in-depth discussion on the application of visual qual- ity assessment to perceptual image/video coding, which is one of the most researched areas in applications. Thirdly, we benchmark the performance of several state-of-the-art quality metrics for both images and videos with appropriate databases and experiments. Lastly, we introduce the machine learning methods that can be applied on visual quality assessment. 14 The rest of this chapter is organized as follows. In Section 2.2, the classication of objective quality assessment methods will be presented. Recent developments, applica- tions, and publicly available databases in image quality assessment (IQA) will be exam- ined in Section 2.3, while those in video quality assessment (VQA) are to be introduced in Section 2.4. We follow the similar format of writing for images and videos respectively, for readers' easy reading, reference and comparison. Section 2.5 will present performance comparison for some recent popular visual quality metrics. In Section 2.6, we introduce the machine learning methods that can be applied on visual quality assessment. Finally, the conclusion will be drawn in Section 2.7. 2.2 Classication of Objective Visual Quality Assessment Methods There are several popular ways to classify the visual quality assessment methods [44, 74, 143]. In this section, we present two possibilities of classication to facilitate the presentation and understanding of the related problems, the existing solutions (taking into account of the most recent developments) and the future trends. 2.2.1 Classication Based upon the Availability of Reference The classication depends on the availability of original (reference) image/video. If there is no reference signal available for the distorted (test) one to compare with, then a quality evaluation method is termed as a no-reference (NR) one [84]. The current NR methods [99, 126] do not perform well in general since they judge the quality solely based on the distorted medium and without any reference available. If the information of the reference medium is partially available, e.g., in the form of a set of extracted features, then this is the so-called reduced-reference (RR) method [105]. Since the extracted partial reference information is much sparser than the whole reference, the RR approach can be used in a remote location (e.g., the relay site 15 and receiving end of transmission) with reasonable bandwidth overheads to achieve bet- ter results than the NR method, or in a situation where the reference is available (such as a video encoder) to reduce the computational requirement (especially in repeated manipulation and optimization). The last one is the full-reference (FR) method (e.g., [133]), as the opposite of the NR method. As the name suggests, an FR metric needs the complete reference medium to assess the distorted (test) medium. Since it has the full information about original medium, it is expected to have the best quality prediction performance. Most existing quality assessment schemes belong to this category, and can be usually used in image and video coding. We will discuss more in Sections 2.3 and 2.4. 2.2.2 Classication Based upon Methodology for Assessment The rst type in this classication is image/video delity metrics, which operate based only on the direct accumulation of errors and therefore are usually FR. Mean- squared error (MSE) and peak signal-to-noise ratio (PSNR) are two representatives in this category. Although being the simplest and still widely used, such a metric is often not a good re ection of perceived visual quality if the distortion is not additive. The second type is the human visual system (HVS) model based metrics, which typically employ a frequency-based decomposition, and take into account various aspects of the HVS. This can include modeling of contrast and orientation sensitivity, spatial and temporal masking eects, frequency selectivity and color perception. Due to the complexity of the HVS, these metrics can become very complex and computationally expensive. Examples of the work following this framework include the work in [40, 58, 82, 124], Perceptual Distortion Metric (PDM) [142], the continuous video quality metric in [86] and the scalable wavelet based video distortion index [85]. Recently, a new strategy to measure image quality, called most apparent distortion (MAD) [63], also belongs to this category. 16 Signal structure (information or other feature) based metrics are the third type of metrics. Some of them quantify visual delity based on the assumption that a high-quality image or video is the one whose structural content, such as object bound- aries or regions of high entropy, most closely matches that of the original image or video [117, 118, 133]. Other metrics of this type are based on the assumption that the HVS understands an image mainly through its low-level features. Hence, image degra- dations can be perceived by comparing the low-level features between the distorted and the reference images. The latest work is called feature-similarity (FSIM) index [148]. We will discuss more details on this type of metric in Section 2.3. The fourth type in the classication is packet-analysis based metrics. This type of metric focuses on the assessment of the impact caused by network impairments on visual quality. It is usually based on the parameters extracted from the transport stream to measure the quality loss. It also has the advantage of measuring the quality of several image/video streams in parallel. Lately, this type of metric becomes more popular because of the increasing video delivery service over networks, such as IPTV or Internet streaming. One example of such metrics is the V-Factor [143]. The details about this metric will be introduced in Section 2.4. The last type of metrics is the emerging learning-oriented metrics. Some recent works are [75, 77, 78, 83, 91, 93, 121]. Basically, it extracts the specic features from the image or video, and then uses the machine learning approach to obtain a trained model. Finally, the trained model is used to predict the perceived quality of images/videos. The obtained experimental results are quite promising, especially for multi-metric fusion (MMF) approach [75, 78] which uses the major existing metrics as the components for the learnt model. The MMF is expected to outperform all the existing metrics as the fusion-based approach to allow the combination of merits from each metric. For easy reference, we illustrate the metric classication in Fig. 2.1. 17 Figure 2.1: Classications of objective visual quality assessment (QA) metrics. 2.3 Recent Developments in IQA 2.3.1 Image Quality Databases Databases with subjective data facilitate metric development and benchmarking, as the ground truth and source of inspiration. There are a number of publicly available image quality database, including LIVE [9], TID2008 [16], CSIQ [2], IVC [7], IVC-LAR [8], Toy- oma [17], WIQ [20], A57 [1], and MMSP 3D Image [12]. We will give a brief introduction for each database below. LIVE Image Quality Database has 29 reference images (also called source refer- ence circuits (SRC)), and 779 test images, including ve distortion types - JPEG2000, JPEG, white noise in the RGB components, Gaussian blur, and transmission errors in the JPEG2000 bit stream using a fast-fading Rayleigh channel model. The subjective quality scores provided in this database are DMOS, ranging from 0 to 100. 18 Tampere Image Database 2008 (TID2008) has 25 reference images, and 1700 distorted images, including 17 types of distortions and 4 dierent levels for each type of distortion. Hence, there are 68 test conditions (also called hypothetical reference circuits (HRC)). MOS is provided in this database, and the scores range from 0 to 9. Categorical Image Quality (CSIQ) Database contains 30 reference images, and each image is distorted using 6 types of distortions - JPEG compression, JPEG2000 com- pression, global contrast decrements, additive Gaussian white noise, additive Gaussian pink noise, and Gaussian blurring - at 4 to 5 dierent levels, resulting in 866 distorted images. The score ratings (0 to 1) are reported in the form of DMOS. IVC Database has 10 original images, and 235 distorted images, including 4 types of distortions - JPEG, JPEG2000, locally adaptive resolution (LAR) coding, and blurring. The subjective quality scores provided in this database are MOS, ranging from 1 to 5. IVC-LAR Database contains 8 original images (4 natural images, and 4 art images), and 120 distorted images, including three distortion types - JPEG, JPEG2000, and LAR coding. The subjective quality scores provided in this database are MOS, ranging from 1 to 5. Toyoma Database has 14 original images, and 168 distorted images, including two types of distortions - JPEG, and JPEG2000. The subjective scores in this database are MOS, ranging from 1 to 5. Wireless Imaging Quality (WIQ) Database has 7 reference images, and 80 distorted images. The subjective quality scores used in this database are DMOS, ranging from 0 to 100. A57 Database has 3 original images, and 54 distorted images, including six dis- tortion types - quantization of the LH subbands of a 5-level DWT of the image using the 9/7 lters, additive Gaussian white noise, JPEG compression, JPEG2000 compres- sion, JPEG2000 compression with the Dynamic Contrast-Based Quantization (DCQ), and Gaussian blurring. The subjective quality scores used for this database are DOMS, ranging from 0 to 1. 19 Table 2.1: Comparison of Image Quality Databases (notes: '-' means no information available; 'proprietary' means the testing method is designed by the authors, not in [21] and [25].) Database Year SRC (no. of reference images) HRC (no. of test conditions) Total no. of test images Subjective Testing Method Subjective Score Applications and Merits IVC 2005 10 25 235 DSIS MOS (1 - 5) For testing IQA metrics on images having com- pression distortions LIVE 2006 29 27 779 ACR DMOS (0 - 100) For testing IQA met- rics on images having compression distortions, transmission distortions, and acquisition distor- tions A57 2007 3 18 54 - DMOS (0 - 1) For testing IQA metrics on images having com- pression distortions, and acquisition distortions Toyoma 2008 14 12 168 ACR MOS (1 - 5) For testing IQA metrics on images having com- pression distortions TID2008 2008 25 68 1700 Proprietary MOS (0 - 9) For testing IQA met- rics on images having compression distortions, transmission distortions, and acquisition distor- tions CSIQ 2009 30 29 866 Proprietary DMOS (0 - 1) For testing IQA met- rics on images having compression distortions, transmission distortions, and acquisition distor- tions IVC-LAR 2009 8 15 120 DSIS MOS (1 - 5) For testing IQA metrics on images having com- pression distortions WIQ 2009 7 - 80 DSCQS DMOS (0 - 100) For testing IQA metrics on images having trans- mission distortions MMSP 3D Image 2009 9 6 54 SSCQE MOS (0 - 100) For testing images on 3D Quality of Experience (QoE) MMSP 3D Image Quality Assessment Database contains stereoscopic images with a resolution of 19201080 pixels. Various indoor and outdoor scenes with a large variety of colors, textures, and depth structures have been captured. The database contains 10 scenes. Seventeen subjects participated in the test. For each of the scenes, 6 dierent stimuli have been considered corresponding to dierent camera distances (10, 20, 30, 40, 50 and 60 cm). To make a clear comparison among these databases, we list the important information for each database in Table 2.1. 20 2.3.2 Major IQA Metrics As mentioned earlier, the simplest and most widely used image quality metrics are MSE and PSNR since they are easy to calculate and are also mathematically convenient in the optimization sense. However, they often correlate poorly with subjective visual quality [132]. Hence, researchers have done a lot of work to include the characteristics of the HVS to improve the performance of the quality prediction. The noise quality mea- sure (NQM) [41], PSNR-HVS-M [109], and the visual signal-to-noise ratio (VSNR) [33] are several representatives in this category. NQM (FR, HVS model based metric), which is based on Peli's contrast pyra- mid [104], takes into account the following: 1. variation in contrast sensitivity with distance, image dimensions, and spatial fre- quency; 2. variation in the local luminance mean; 3. contrast interaction between spatial frequencies; 4. contrast masking eects. It has been demonstrated that the nonlinear NQM is a better measure of additive noise than PSNR and other linear quality measures [41]. PSNR-HVS-M (FR, HVS model based metric) is a still image quality metric which takes into account contrast sensitivity function (CSF) and between-coecient contrast masking of DCT basis functions. It has been shown that PSNR-HVS-M outperforms other well-known reference based quality metrics and demonstrated high correlation with the results of subjective experiments [109]. VSNR (FR, HVS model based metric) is a metric computed by a two-stage approach [33]. In the rst stage, contrast thresholds for detection of distortions in the presence of natural images are computed via wavelet-based models of visual masking and 21 visual summation in order to determine whether the distortions in the distorted image are visible. If the distortions are below the threshold of detection, the distorted image is claimed to be of perfect visual quality. If the distortions are higher than a threshold, a second stage is applied, which operates based on the visual property of perceived con- trast and global precedence. These two properties are modeled as Euclidean distances in distortion-contrast space of a multi-scale wavelet decomposition, and the nal VSNR is obtained by linearly summing these distances. However, the HVS is a nonlinear and highly complicated system, and most models so far are only based on quasi-linear or linear operators. Hence, a dierent framework was introduced, based on the assumption that a measurement of structural information change should provide a good approximation to perceived image distortion. Structural similarity (SSIM) index (FR, Signal structure based metric) [133] is the most well- known one in this category. Suppose two image signals x and y, and let x ; y ; 2 x ; 2 y , and xy be the mean of x, the mean of y, the variance of x, the variance of y, and the covariance of x and y respectively. Wang et al. [133] dene the luminance, contrast and structure comparison measures as follows: l(x; y) = 2 x y +C 1 2 x + 2 y +C 1 ;c(x; y) = 2 x y +C 2 2 x + 2 y +C 2 ;s(x; y) = xy +C 3 x y +C 3 ; (2.1) where the constantsC 1 ;C 2 ;C 3 are included to avoid instabilities when 2 x + 2 y ; 2 x + 2 y , and x y are very close to zeros. Finally, they combine these three comparison measures and name the resulting similarity measure between image signals x and y as SSIM(x; y) = [l(x; y)] [c(x; y)] [s(x; y)] ; (2.2) where> 0; > 0 and > 0 are the parameters used to adjust the relative importance of these three components. In order to simplify the expression, they set = = = 1 22 andC 3 =C 2 =2. This results in a specic form of the SSIM index between image signals x and y: SSIM(x; y) = (2 x y +C 1 )(2 xy +C 2 ) ( 2 x + 2 y +C 1 )( 2 x + 2 y +C 2 ) : (2.3) However, the standard SSIM dened above is only a single-scale method. To be able to consider image details at dierent resolutions (we do not know the right object sizes in general), a multi-scale SSIM (MS-SSIM) (FR, Signal structure based metric) [139] is adopted. Taking the reference and distorted image signals as the input, the system iteratively applies a low-pass lter and down-samples the ltered image by a factor of two. The original image is labeled as scale 1, and the highest scale as M, which is obtained after M 1 iterations; at the j-th scale, the contrast comparison and the structure comparison are calculated and denoted as c j (x; y) and s j (x; y), respectively. The luminance comparison is computed only at scale M and denoted as l M (x; y). The overall SSIM evaluation is obtained by combining the measurement at dierent scales using MS-SSIM(x; y) = [l M (x; y)] M M Y j=1 [c j (x; y)] j [s j (x; y)] j : (2.4) Similarly, the exponents M ; j and j are used to adjust the relative importance of dierent components. As the simplest parameter selection, j = j = j for all j's. In addition, normalization is performed for the cross-scale settings such that P M j=1 j = 1. Since SSIM is sensitive to relative translations, rotations, and scalings of images [132], complex-wavelet SSIM (CW-SSIM) [138] has been developed. The CW-SSIM is locally computed from each subband, and then averaged over space and subbands, yielding an overall CW-SSIM index between the original and the distorted images. The CW- SSIM method is robust with respect to luminance changes, contrast changes and trans- lations [138]. Afterward, some researchers have tried to propose a new metric by modifying SSIM, such as 3-component weighted SSIM (3-SSIM) [67], and information content weighted 23 SSIM (IW-SSIM) [135]. They are all based on the similar strategy to assign dierent weightings to the SSIM scores. Another metric based on the information theory to measure image delity is called information delity criterion (IFC) (FR, Signal information extracted metric) [118]. It was later extended to visual information delity (VIF) metric (FR, Signal infor- mation extracted metric) [117]. The VIF attempts to relate signal delity to the amount of information that is shared between two signals. The shared information is quantied using the concept of mutual information. The reference image is modeled by a wavelet domain Gaussian scale mixture (GSM), which has been shown to model the non-Gaussian marginal distributions of the wavelet coecients of natural images eectively, and also capture the dependencies between the magnitudes of neighboring wavelet coecients. Therefore, it brings good performance to the VIF index over a wide range of distortion types[119]. Reduced-reference image quality assessment (RRIQA) (RR, Signal feature extracted metric) is proposed in [69]. The authors use Gaussian scale mixture (GSM) statistical model of image wavelet coecients to compute a divisive normalization trans- form (DNT) for images. Then they evaluate the image quality based on the comparison between features extracted from the DNT of reference and distorted images. The pro- posed RR approach has improved performance and even works better than FR PSNR in LIVE Image Quality Database. In [49], multi-scale geometric analysis (MGA) is used to decompose images and extract features to model the multi-channel structure of HVS. Moreover, several trans- forms (e.g., wavelet, curvelet, bandelet, and contourlet) are also utilized to capture the dierent kinds of geometric information of images. CSF is used to weight the coecients obtained by the MGA. Next, Just Noticeable Dierence (JND) is applied to produce a noticeable variation. Finally, the quality of the distorted image is obtained by comparing the normalized histogram between the distorted image and the reference one. In addi- tion to the good consistency with human subjective evaluation, this MGA-based IQA 24 (RR, Signal feature extracted metric) also has the advantage of using low data rate to represent features. Ferzli et al. [45] proposed an objective image sharpness metric, called Just Notice- able Blur Metric (JNBM) (NR, HVS model based metric). They claimed the just noticeable blur (JNB) is a function of local contrast and can be used to derive an edge-based sharpness metric with probability summation model over space. The experiment results showed this method can successfully predict the relative amount of sharpness/blurriness in images, even with dierent scenes. In [37], the authors presented a method for IQA by combining the features obtained from the computation of mean and ratio of edge blurriness and noise (MREBN). The proposed metric MREBN (NR, Signal feature extracted metric) has high correlation with subjective quality scores. They also claimed the low computational load of the model because of the linear combination of the features obtained. In [63], Larson and Chandler suggested that a single strategy may not be sucient to determine the image quality. They presented a quality assessment method, called most apparent distortion (MAD) (FR, HVS model based metric), which can model two dierent strategies. First, they used local luminance and contrast masking to estimate the detection-based perceived distortions in high quality images. Then changes in the local statistics of spatial-frequency components are used to estimate the appearance- based perceived distortions in low quality images. In the end, the authors showed that combining these two strategies can predict subjective ratings of image quality well. FSIM (FR, Signal feature extracted metric) [148] is a recently developed image quality metric, which compares the low-level feature sets between the reference image and the distorted image based on the fact that the HVS understands an image mainly according to its low-level features. Phase congruency (PC) is the primary feature to be used in computing FSIM. Gradient magnitude (GM) is the second feature to be added in FSIM metric because PC is contrast invariant and contrast information also aects the 25 HVS' perception of image quality. Actually, in the FSIM index, the similarity measures for PC and GM all follow the same formula as in the SSIM metric. More recently, we proposed a multi-metric fusion (MMF) (FR, learning-oriented metrics) approach for visual quality assessment [75, 78]. This method is motivated by the observation that no single metric can give the best performance scores in all situations. To achieve MMF, a regression approach is adopted. First, we collected a large number of image samples, each of which has a score labeled by human observers and scores associated with dierent quality metrics. The new MMF score is set to be the nonlinear combination of scores obtained by multiple existing metrics (including SSIM [133], MS-SSIM [139], VSNR [33], IFC [118], VIF [117], PSNR, PSNR-HVS [43], NQM [41], FSIM [148] and MAD [63]) with suitable weights via a training process. We also term it as context-free MMF (CF-MMF) since it does not depend on image contexts. Furthermore, we divide image distortions into several groups and perform regression within each group, which is called context-dependent MMF (CD-MMF). One task in CD-MMF is to determine the context automatically, which is achieved by a machine learning approach. It is shown by experimental results that the proposed MMF metric outperforms all existing metrics by a signicant margin. Table 2.2 summarizes the IQA models we have mentioned so far and the correspond- ing classications based on reference availability and assessment methodology; we have also commented on the strength and weakness of the models under discussion in the table. 2.3.3 Application in Perceptual Image Coding IQA metrics are widely exploited to image coding. Dierent metrics, such as SSIM [35, 133] and VIF [117] are used to improve the perceptual performance of JPEG and JPEG2000 compression and provide the feedback to rate control algorithms. In other words, the concept of perceptual image coding is to assess the quality of the target image by using IQAs and then apply the index to improve coding eciency. Each IQA 26 Table 2.2: Classication of IQA Models Based on Reference Availability and Assessment Methodology IQA Model Reference Availability Assessment Methodology Remarks (strength and weakness) PSNR FR Image Fidelity Simple. Low correlation. NQM FR HVS model Better measure of additive noise than PSNR. 80% correlation to visual results. PSNR-HVS-M FR HVS model Incorporate CSF model. 98% correlation with subjective scores. VSNR FR HVS model Low computation complexity and memory requirements. Accommodate dierent viewing conditions. 88.9% correlation with subjective scores in LIVE database. SSIM FR Signal structure Easy to implement. Good correlation with subjective scores. MS-SSIM FR Signal structure Incorporate image details at dierent resolutions. Better correlation with subjective scores than SSIM. IFC FR Signal structure Use mutual information to quantify signal delity. Better correlation with subjective scores than SSIM. VIF FR Signal structure Use mutual information to quantify signal delity. Better correlation with subjective scores than IFC. RRIQA RR Signal structure Better performance than PSNR. MGA-based IQA RR Signal structure Good consistency with subjective scores. Low data rate to represent features. JNBM NR HVS model Can predict the relative amount of sharpness/blurriness in images. MREBN NR Signal structure Good correlation with subjective scores. Low computation load. FSIM FR Signal structure Use low-level features. Very good correlation with subjective scores. MAD FR HVS model Combine two dierent strategies to predict visual quality. Good correlation with subjective scores. MMF FR Learning-oriented Use machine learning to automatically fuse the scores from multiple quality metrics. Very high correlation with subjective scores. Can incorporate new IQA metrics. 27 re ects the specic features. Thus, choosing the perceptual model is based on the need of specic application or codec. Coding distortion can be approximated from the extracted perceptual features and used to guide an image coder. Yim and Bovik [147] analyzed the blockiness of the compressed JPEG images. The proposed metric index focuses on discrete cosine transformed and quantized images. It has been shown that the blocking eect can be assessed by using the quality metric which detects the dierences of the neighborhoods of the target block. The blocking eect factor (BEF) is dened by the dierence of the mean boundary pixel squared dierence and the mean non-boundary pixel squared dierence. The mean-squared error including the blocking eect (MSE-B) is calculated from the corresponded BEF and MSE and leads to peak signal-to-noise ratio including the blocking eect (PSNR-B). The PSNR-B can quantify the blocking eect in boundary of macroblocks. Moreover, this can help to develop H.264/AVC de-blocking lters. Hontsch and Karam [55] presented a locally adaptive perceptual image coder, which optimizes the bit allocation of the targeted distortion type. The algorithm starts from extracting visual properties adaptively based on the local image features. It decomposes data into discrete cosine transform (DCT) coecients, which are fed to the perceptual model to generate perceptual properties. These properties are used to compute the local distortion adaptively and result in the local distortion sensitivity proles. The thresholds, which are derived from the proles, re ect the characteristics of the local image data. Two visual phenomena, contrast sensitivity dependent on background luminance and contrast masking, are modeled to generate the thresholds. For contrast sensitivity, the threshold is dened related to the luminance of the background to verify the sensitivity of the eye under the condition of the background. For contrast masking adjustment, contrast masking pertains to the visual change. The masker signal is in the form of the discrete cosine transform subband coecients of the input image comparing to the quantization error. Thus, the quantization step size is calculated from the threshold in order to achieve the target bitrate. 28 Rehman and Wang [111] addressed the practical use of SSIM. Instead of fully access- ing the original image, reduced reference technique only uses partial information. The rst step of the algorithm is the multi-scale multi-orientation divisive normalization transform which extracts the neural features of the biological human visual system. The divisive normalization transform coecient distribution is parameterized and provides the needed partial information of the reference image. This information can be used to dene the distortion of the compressed image and re ect the SSIM value of the images. The proposed reduced reference version of SSIM shows linear relationship to the full reference version in specic circumstances. The application of the algorithm does not only measure the SSIM but also repair some distortions. Besides VIF, other approaches are taken to JPEG2000. Tan et al. [123] proposed an image coder based on the just-noticeable distortion model which considers a variety of perceptual aspects. The algorithm is developed from a monochromatic vision model to a color image one. The monochromatic contrast gain control (CGC) model includes spatial masking, orientation masking and contrast sensitivity. The luminance and chro- matic parts are modeled by the CGC respectively. The distortion metric is designed to estimate perceptual error and applied to replace MSE which is used in the cost function in embedded block coding with optimal truncation (EBCOT). The 14 parameters in the metric are optimized with a two tiered approach. One calculates the parameter set recursively; the other ne-tunes the parameter set via algorithmic optimization. SSIM is also exploited in JPEG2000. Richter et al. [112] proposed a JPEG encoder based on optimal Multi-scale SSIM (MS-SSIM) [139]. Eorts are made to modify MS- SSIM in order to be embedded to the encoder. The rst step of the algorithm is trying to modify MS-SSIM to the logarithmic form. The contrast and structure part of the index can be expressed by the reconstruction error; the luminance part is ignored due to its minor eect. The nal term of the index can be computed by utilizing the results from EBCOT and wavelet decomposition process. Thus, the implementation integrates MS-SSIM to a JPEG2000 encoder. 29 2.4 Recent Developments in VQA 2.4.1 Video Quality Databases To our knowledge, there are nine public video quality databases available, including VQEG FRTV-I [18], IRCCyN/IVC 1080i [5], IRCCyN/IVC SD RoI [6], EPFL-PoliMI [4], LIVE [10], LIVE Wireless [11], MMSP 3D Video [13], MMSP SVD [14], VQEG HDTV [19]. We will brie y introduce them below. VQEG FR-TV Phase I Database is the oldest public database on video quality applied to MPEG-2 and H.263 video with two formats: 525@60Hz and 625@50Hz in this database. The resolution for video sequence 525@60Hz is 720486 pixels, and 720576 pixels for 625@50Hz. The video format is 4:2:2. And the subjective quality scores provided are DMOS, ranging from 0 to 100. IRCCyN/IVC 1080i Database contains 24 contents. For each content, there is one reference and seven dierent compression rates on H.264 video. The resolution is 19201080 pixels, the display mode is interleaving and the eld display frequency is 50Hz. The provided subjective quality scores are MOS, ranging from 1 to 5. IRCCyN/IVC SD RoI Database contains 6 reference videos and 14 HRCs (i.e., 84 videos in total). The HRCs are H.264 coding with or without error transmission simulations. The contents of this database are SD videos. The resolution is 720576 pixels, the display mode is interleaving and the eld display frequency is 50Hz with MOS from 1 to 5. EPFL-PoliMI Video Quality Assessment Database contains 12 reference videos (6 in CIF, and 6 in 4CIF), and 144 distorted videos, which are encoded with H.264/AVC and corrupted by simulating the packet loss due to transmission over an error-prone network. For CIF, the resolution is 352288 pixels, and frame rate 30fps. For 4CIF, the resolution is 704576 pixels, and frame rate are 30fps and 25fps. For each of the 12 original H.264/AVC videos, they have generated a number of corrupted ones by dropping packets according to a given error pattern. To simulate burst errors, the 30 patterns have been generated at six dierent packet loss rates (PLR) and two channel realizations have been selected for each PLR. LIVE Video Quality Database [116] includes 10 reference videos. All videos are 10 seconds long, except for Blue Sky. The Blue Sky sequence is 8.68 seconds long. The rst seven sequences have a frame rate of 25 fps, while the remaining three (Mobile & Calendar, Park Run, and Shields) have a frame rate 50 fps. There are 15 test sequences from each of the reference sequences using four dierent distortion processes - simu- lated transmission of H.264 compressed videos through error-prone wireless networks and through error-prone IP networks, H.264 compression, and MPEG-2 compression. All video les have planar YUV 4:2:0 formats and do not contain any headers. The spatial resolution of all videos is 768432 pixels. LIVE Wireless Video Quality Assessment Database has 10 reference videos, and 160 distorted videos, which focus on H.264/AVC compressed video transmission over wireless networks. The video is YUV 4:2:0 formats with a resolution of 768480 and a frame rate of 30 fps. Four bit-rates and 4 packet-loss rates are performed. However, this database has been taken oine temporarily since it has limited video level contents and a tendency to cluster at 0.95-0.96 correlation for most objective metrics. MMSP 3D Video Quality Assessment Database contains stereoscopic videos with a resolution of 19201080 pixels and a frame rate of 25 fps. Various indoor and outdoor scenes with a large variety of color, texture, motion, and depth structure have been captured. The database contains 6 scenes, and 20 subjects participated in the test. For each of the scenes, 5 dierent stimuli have been considered corresponding to dierent camera distances (10, 20, 30, 40, 50 cm). MMSP Scalable Video Database is related to 2 scalable video codecs (SVC and wavelet-based codec), 3 HD contents, and bit-rates ranging between 300 kbps and 4 Mbps. There are 3 spatial resolutions (320180, 640360, and 1280720), and 4 temporal resolutions (6.25 fps, 12.5 fps, 25 fps and 50 fps). In total, 28 and 44 video 31 Table 2.3: Comparison of Video Quality Databases Database Year SRC (no. of reference videos) HRC (no. of test conditions) Total no. of test videos Subjective Testing Method Subjective Score Applications and Merits VQEG FR- TV-I 2000 20 16 320 DSCQS DMOS (0 - 100) For testing VQA metrics on videos having com- pression distortions IRCCyN/IVC 1080i 2008 24 7 192 ACR MOS (1 - 5) For testing VQA metrics on videos having com- pression distortions IRCCyN/IVC SD RoI 2009 6 14 84 ACR MOS (1 - 5) For testing VQA metrics on videos having com- pression distortions, and transmission distortions EPFL- PoliMI 2009 16 9 165 ACR MOS (0 - 5) For testing VQA metrics on videos having com- pression distortions, and transmission distortions LIVE 2009 10 15 150 ACR DMOS (0 - 100) For testing VQA metrics on videos having com- pression distortions, and transmission distortions LIVE Wire- less 2009 10 16 160 SSCQE DMOS (0 - 100) For testing VQA metrics on videos having com- pression distortions, and transmission distortions MMSP 3D Video 2010 6 5 30 SSCQE MOS (0 - 100) For testing videos on 3D quality of experience (QoE) MMSP SVD 2010 3 24 72 PC MOS (0 - 100) For testing VQA metrics on videos having com- pression distortions, and transmission distortions VQEG HDTV 2010 45 15 675 ACR MOS (0 - 5), DMOS (1 - 5) For testing VQA metrics on videos having com- pression distortions, and transmission distortions sequences were considered for each codec, respectively. The video data are in the YUV 4:2:0 formats. VQEG HDTV Database has 4 dierent video formats - 1080p at 25 and 29.97fps, 1080i at 50 and 59.94fps. The impairments are restricted to MPEG-2 and H.264, with both coding-only error and coding-plus-transmission error. The video sequences are released progressively via the Consumer Digital Video Library (CDVL) [3]. We summarize and compare these video quality databases in Table 2.3 for the con- venience of readers. 32 2.4.2 Major VQA Metrics One obvious way to implement video quality metrics is to apply a still image quality assessment metric on a frame-by-frame basis. The quality of each frame is evaluated independently, and the global quality of the video sequence can be obtained by a simple time average. SSIM has been applied in video quality assessment as reported in [136]. The quality of the distorted video is measured in three levels: the local region level, the frame level, and the sequence level. First, the SSIM indexing approach is applied to the Y, Cb and Cr color components independently and combined into a local quality measure using a weighted summation. In the second level of quality evaluation, the local quality values are weighted to obtain a frame level quality index. Finally in the third level, the overall quality of the video sequence is given by the weighted summation of the frame level quality index. This approach is often called as V-SSIM (FR, Signal structure based metric), and has been demonstrated to perform better than KPN/Swisscom CT [127] (the best metric for the Video Quality Experts Group (VQEG) Phase I test data set [18]) in [136]. Wang and Li [134] proposed Speed-SSIM (FR, Signal structure based metric) that incorporated a model of the human visual speed perception by formulating the visual perception process in an information communication framework. Consistent improve- ment over existing VQA algorithms has been observed in the validation with the VQEG Phase I test data set [18]. Watson et al. [140] developed a video quality metric, which they call digital video quality (DVQ) (FR, HVS model based metric). The DVQ accepts a pair of video sequences, and computes a measure of the magnitude of the visible dierence between them. The rst step consists of various sampling, cropping, and color transformations that serve to restrict processing to a region of interest and to express the sequence in a perceptual color space. This stage also deals with de-interlacing and de-gamma- correcting the input video. The sequence is then subjected to a blocking and a discrete 33 cosine transform (DCT), and the results are transformed to local contrast. Then the next steps are temporal, spatial ltering, and a contrast masking operation. Finally, the masked dierences are pooled over spatial temporal and chromatic dimensions to compute a quality measure. Video Quality Metric (VQM) (RR, HVS model based metric) [105] is developed by National Telecommunications and Information Administration (NTIA) to provide an objective measurement for perceived video quality. The NTIA VQM provides sev- eral quality models, such as the Television Model, the General Model, and the Video Conferencing Model, based on the video sequence under consideration and with several calibration options prior to feature extraction in order to produce ecient quality ratings. The General Model contains seven independent parameters. Four parameters (si loss, hv loss, hv gain, si gain) are based on the features extracted from spatial gradients of Y luminance component, two parameters (chroma spread, chroma extreme) are based on the features extracted from the vector formed by the two chrominance components (Cb, Cr), and one parameter (ct ati gain) is based on the product of features that mea- sure contrast and motion, both of which are extracted from Y luminance component. The VQM takes the original video and the processed video as inputs and is computed using the linear combination of these seven parameters. Due to its good performance in the VQEG Phase II validation tests, the VQM method was adopted as a national standard by the American National Standards Institute (ANSI) and as International Telecommunications Union Recommendations [23, 24]. By analyzing subjective scores of various video sequences, Lee et al. [64] found out that the HVS is sensitive to degradation around edges. In other words, when edge areas of a video sequence are degraded, human evaluators tend to give low quality scores to the video, even though the overall mean squared error is not large. Based on this observation, they proposed an objective video quality measurement method based on degradation around edges. In the proposed method, they rst applied an edge detection algorithm to videos and locate edge areas. Then, they measured degradation of those 34 edge areas by computing mean squared errors and used it as a video quality metric after some post-processing. Experiments show that this proposed method EPSNR (FR, Video delity metric) outperforms the conventional PSNR. This method was also evaluated by independent laboratory groups in the VQEG Phase II test. As a result, it was included in international recommendations for objective video quality measurement. Kawayoke et al. [61] suggested a new objective VQA method, called continuous video quality (CVQ) (NR, Learning-oriented metric). The metric can provide quality values at a rate of two scores per second according to the data obtained from subjec- tive assessment tests under a Single Stimulus Continuous Quality Evaluation (SSCQE) method. It is based on the concept that frame quality value is needed to be adjusted by spatial and temporal information. As a result, the objective quality scores computed by this approach have a higher estimation accuracy than frame quality scores. More recently, an approach integrates both spatial and temporal aspects of distortion assessment, known as MOtion-based Video Integrity Evaluation (MOVIE) index (FR, HVS model based metric) [115]. The MOVIE uses optical ow estimation to adap- tively guide spatial-temporal ltering using three-dimensional (3-D) Gabor lterbanks. The key dierentiation of this method is that a subset of lters are selected adaptively at each location based on the direction and speed of motion, such that the major axis of the lter set is oriented along the direction of motion in the frequency domain. The video quality evaluation process is carried out with coecients computed from these selected lters only. One component of the MOVIE framework, known as the Spatial MOVIE index, uses the output of the multi-scale decomposition of the reference and test videos to measure spatial distortions in the video. The second component of the MOVIE index, known as the Temporal MOVIE index, captures temporal degradations in the video. The Temporal MOVIE index computes and uses motion information from the reference video, and evaluates the quality of the test video along the motion trajectories of the reference video. Finally, the Spatial MOVIE index and the Temporal MOVIE index are 35 combined to obtain a single measure of video quality known as the MOVIE index. The performance of MOVIE on the VQEG FRTV Phase I dataset is summarized in [115]. In addition, TetraVQM (FR, HVS model based metric) [27] has been proposed to utilize motion estimation within a VQA framework, where motion compensated errors are computed between the reference and distorted images. Based on the motion vectors and the motion prediction error, the appearance of new image areas and the display time of objects are evaluated. Additionally, degradations on moving objects can be judged more exactly. And in [95], Ninassi et al. tried to utilize models of visual attention (VA) and human eye movements to improve the VQA performance. The temporal variations of the spatial distortions are evaluated both at eye xation level and on the whole video sequence. These two kinds of temporal variations are assimilated into a short-term temporal pooling and a long-term temporal pooling, respectively. V-Factor (NR, Packet-analysis based metric) [143] is a real-time, packet-based video quality metric, which works without the need of references. In [143], this metric is pri- marily used on MPEG-2 and H.264 video streaming over IP networks. First, it inspects several parts of the video stream, including the transport stream (TS) headers, the pack- etized elementary stream (PES) headers, the video coding layer (VCL), and the decoded video signal. Then it analyzes the bitstream to obtain static parameters, such as frame rate and image size. The dynamic parameters (e.g., variation of quantization steps) are also obtained along with the analysis. The nal video quality is estimated based upon the content characteristics, compression methods, bandwidth constraints, delays, jitter, and packet loss. Among these six factors, the rst three are aected by video impair- ments and the last three are caused by network impairments. In addition, this metric also analyzes real-time network impairments to calculate the packet loss probability ratio by using hidden Markov models. The nal V-Factor value (i.e., the estimate of MOS) is obtained by using a codec-specic curve t equation and inputs from the following three models: the bandwidth model, the VCL complexity model, and the loss model. 36 Li et al. [71] proposed to use temporal inconsistency measure (TIM) to describe the visual disparity of the same object in consecutive distortion frames. First, they performed block-based motion estimation on the reference video to obtain the motion vectors. Then the motion vectors can be used to create the motion compensated frames for reference and distorted videos, respectively. The dierence between motion compensated and real frames of the reference video (DoR) is called the inherent dierence. Likewise, there is also a dierence between motion compensated and real frames of the distorted video (DoD). However, DoD consists of two components, including inherent dierence and the temporal inconsistency. Hence, the TIM can be computed by subtracting DoR from DoD. In the end, they incorporated TIM into MSE, called MSE TIM (FR, Video delity metrics) and introduce a weighting parameter to adjust the importance between spatial impairment and temporal inconsistency measure in the quality prediction. The experiment results show that TIM improves the performance of MSE. Moreover, the performance becomes even better when using TIM alone. In [26], the authors proposed a new video quality metric, named spatial-temporal assessment of quality (STAQ) (RR, HVS model based metric). As the name suggests, it includes both spatial and temporal parts. In the rst step, they used a temporal approach to nd the matching regions in adjacent frames. One important change from existing motion estimation methods during this step is to use CW-SSIM instead of the mean absolute dierence to compute the motion vectors. This will increase the precision of nding the matching regions. In the second step, a spatial method is used to compute the quality of the matching regions extracted via the temporal approach. The visual attention map (VAM) is used to weight each sub-block in the luminance channel based on the importance. In the nal step, the video quality is estimated according to the values obtained from both the spatial and temporal domains, and quality of experience (QoE) is introduced as a function related to the motion activity density group of the video to control the pooling function. The results are quite promising in H.264 distorted video case, but are less competitive than MOVIE in either MPEG-2 or IP case. 37 There is also another approach integrating both spatial and temporal domains, called spatiotemporal MAD (ST-MAD) (FR, HVS model based metric) [129], which is extended from the image quality metric MAD [63]. First, a spatiotemporal slice (STS) image is constructed from the time-based slices of the reference and distorted videos. The detailed procedure is as follows: A single column or row of the frame is extracted for each video frame, and these columns (or rows) are stacked from left to right (or top to bottom) to become a STS image. Then ST-MAD estimates motion-based distortions by using MAD's appearance-based model to STS images. Next, it gives larger weights to the fast-moving regions by applying optical- ow algorithm. Finally, it employs a combination rule to add spatial and temporal distortions together. Experimental results show that ST-MAD performs better than other state-of-the-art quality metrics in LIVE Video Quality Database, especially on H.264 and MPEG-2 distorted videos. However, MOVIE only outperforms ST-MAD for wireless distorted videos. To summarize these VQA models, we present a simple comparison based on reference availability and assessment methodology in Table 2.4, as well as providing the comments on strength and weakness of each metric. 2.4.3 Application in Perceptual Video Coding Since perceptual quality assessment is a hot topic in video coding, we use this as an example for applications. Currently, there are two main approaches of perceptual video coding. One is to use dierent IQA or VQA metrics to measure distortions and develop the perceptual rate-distortion model to achieve the better performance in a perceptual sense. The other one is trying to utilize human visual features to develop a just noticeable distortion (JND) model for quantization step (QP) selection, or a visual attention (VA) model in order to nd the region of interest (ROI) in the target video and optimize the bit allocation corresponding to the ROI information. A JND model may be combined with a VA one for a more comprehensive evaluation (to become a foveated JND model). 38 Table 2.4: Classication of VQA Models Based on Reference Availability and Assessment Methodology VQA Model Reference Availability Assessment Methodology Remarks (strength and weakness) V-SSIM FR Signal structure Utilize dierent weighting strategy for the quality scores in three levels. Perform better than KPN/Swisscom CT in VQEG FR-TV-I database. Speed-SSIM FR Signal structure Incorporated a model of the human visual speed perception. Consistent improvement in the validation with the VQEG Phase I test data set. DVQ FR HVS model Contrast masked dierences are pooled over spatial temporal and chromatic dimensions to compute a quality measure. VQM RR HVS model Provide several quality models. Good performance in the VQEG Phase II validation tests, VQM was adopted as a national standard. EPSNR FR Video Fidelity Video quality measurement based on degradation around edges. Outperform conventional PSNR. CVQ NR Learning-oriented Adjust frame quality value by spatial and temporal informa- tion. Have higher estimation accuracy than frame quality scores. MOVIE FR HVS model Use optical ow estimation to adaptively guide spatial- temporal ltering using 3-D Gabor lterbanks. Perform the best in both LIVE and VQEG FR-TV-I databases. TetraVQM FR HVS model Based on the motion vectors and the motion prediction error, the appearance of new image areas and the display time of objects are evaluated. Degradations on moving objects are judged more exactly. V-Factor NR Packet Analysis Real-time. Primarily used on MPEG-2 and H.264 video streaming. MSE TIM FR Video Fidelity Incorporate TIM into MSE and introduce a weighting param- eter to adjust the importance between spatial impairment and TIM in the quality prediction. Improves the performance of MSE. STAQ RR HVS model QoE is introduced as a function related to the motion activity density group of the video to control the pooling function. The results are quite promising for H.264 distorted videos. ST-MAD FR HVS model A spatiotemporal slice (STS) image is constructed from the time-based slices of the reference and distorted videos. Give larger weights to the fast-moving regions. Perform better than other state-of-the-art quality metrics in LIVE Video Quality Database, especially on H.264 and MPEG- 2 distorted videos. 39 For the former approach, not all applications are developed to the whole codec. Some eorts [39, 144] are made to tune the performance of encoding intra frames or made to optimize the coding eciency of inter frames. The others target to the overall rate distortion optimization of video coding. The algorithms are strongly bound to the codec type because the measurement of distortion is replaced in a perceptual fashion. For the latter approach, the JND model is used to analyze the image features. Compared to the former method, it is more independent to the codec type. Use of IQA or VQA metrics Chen et al. [57, 101] proposed rate distortion framework based on the SSIM index. In [57], the mode decision of H.264 intra-frame and inter-frame coding is optimized perceptually by using SSIM index. The SSIM index is applied to replace the SSD to measure the dierence between the reference block and the reconstructed block. Because it is hard to determine the rate-distortion optimization by the SSIM index, the proposed approach to rate-distortion modeling provides a way to determine the Lagrange multiplier which is related to SSIM in the cost function. The rate-distortion curve tting is dened by two parameters and which can be computed from two data points of the key frame. By using the data, the rate-distortion curves of the subsequent frames can be estimated. For given rate-distortion curve, the Lagrange multiplier can be calculated by the gradient or slope of the curve. In [101], the perceptual encoding scheme is based on the rate control algorithm in [57] and extended to bit allocation. The proposed rate control scheme separates the coding methods of key frame and other frames. The algorithm adopts extra quantization parameters for key frames to update the rate-distortion model. More precisely, the Lagrange multiplier is selected adaptively according to the input data from key frames. The perceptual cost function determines the target bit budget in the frame level and the quantization step sizes. Combined [57] and [101], the proposed technique is thor- oughly implemented to improve the perceptual rate control optimization of H.264/AVC. 40 In [130], a model related to the reduced reference SSIM is developed to improve the rate distortion optimization. Instead of divisive normalization transform, the proposed algorithm extracts the frame features from discrete cosine transform (DCT). With less computing complexity than divisive normalization transform, the DCT coecients pro- vide the needed partial information of the reference image and lead to the estimated reduced reference SSIM index, which is an important parameter of the proposed rate- distortion model. The SSIM index is generated by the local SSIM index via sliding windows. The SSIM is provided by overlapped blocks, but the macroblocks are pro- cessed individually in the encoder. Also, the boundaries of the macroblocks are not continuous. To solve these issues, the macroblocks are extended to 2222 and a sliding 44 window is applied to get the SSIM index. The reference reduced SSIM index is derived from the DCT coecients. At rst, the DCT coecients of 44 non-overlap blocks are calculated and then grouped into 16 subbands. The reduced reference distor- tion can be dened from the DCT subbands and MSE to the reference frame. Because the measured distortion is linearly equivalent to the SSIM index, the reduced reference SSIM index can be written in the form of the distortion. The proposed algorithm tends to update the parameters of the model in frame level and adjust the Lagrange multiplier in macro-block level. The SSIM index is introduced to video coding to model the perceived distortion. Because SSIM is not a traditional block based distortion measurement, current video compression standard can be optimized perceptually by introducing SSIM as a distortion measurement. In [57, 101], the RD curve is parameterized to t the SSIM RD curve; the complexity of SSIM can be reduced and a more practical method is proposed in [130]. Use of JND and VA models Besides SSIM, JND is also applied to video coding algorithms. The JND is measured based on the sensitivity of the HVS. With the JND, the priority bit-allocation can be determined. In [36], a foveated JND model is proposed to measure distortion. This 41 model combines the spatial JND model and the temporal JND model. For spatial JND, the measurement is based on the luminance of the background. If the luminance of the background is not high enough that human observers cannot recognize the targeted objects, then a larger QP is used to encode the frame. The threshold of the background luminance is not only dened by spatial features but also considered temporal features. In the temporal model, the change of luminance across frames is the key point. In the proposed model, the inter-frame luminance change is considered as larger visibility threshold and separated in two cases, which are high-to-low and low-to-high. The former change results in more signicant visual attention. The foveated JND is integrated to the H.264/AVC encoder. The QP is adjusted by weighting the macroblocks. If the macroblocks are perceived in higher priority, they can tolerate less distortion and preserve more bit budgets. Itti et al. [72] developed a VA model to detect the ROI in the video. The model is based on human visual characteristics including color information, contrast, shape, motion, and etc. The model prediction generates the saliency map which is used in the bit allocation strategy. To improve the saliency map, frame to frame information is considered to update the salient locations of the objects. The relationship of the object across frames are determined by the four criteria: the Euclidean distance between the location in dierent frames, the Euclidean distance between feature vectors corre- sponding to the locations, a penalty term of the dierences between frames to depress permuting pairings, and a tracking priority according to the intensity of the saliency to encourage track of the salient objects. With the criteria, the proposed algorithm can identify the salient objects and track their locations in the map. Combing the information, the more signicant object is assigned to higher priority for bit allocation. More consideration of temporal and textural features Motion and texture are signicant features to the HVS for videos. Video coding with considering texture and motion can achieve good performance in a perceptual way. The 42 approach in [31] is based on texture and motion modeling. The texture model employed in the algorithm is to separate the perceptually relevant and non-relevant regions. The relevant region needs more bits to encode. The temporal (motion) model tries to improve the consistency in textural regions across frames. Texture analysis provides information of textural regions to the encoder; the texture synthesis is applied to the decoder to reconstruct the scene. In texture analysis, frames are divided into groups with the same textures and the boundaries of the regions are detected. The features extracted in this stage include gray level co-occurrence matrix, angular second moment, dissimilarity, correlation, entropy, sum of squares, and coecients of Gabor lters. The employed segmentation techniques are split-and-merge method and K-means clustering. In order to track the region from frame to frame, motion vectors are bound to the textural regions. The temporal model is parameterized by the motion vectors to obtain the location of the regions in the consequent frames. In the encoder side, only key frames and non- synthesizable parts are coded by H.264/AVC. At the decoder, the texture synthesis is designed to construct the other parts. With the temporal information, the textures of the synthesizable frames are derived from the key frames and the segmentation information is also passed from the encoder via the channel as the side information to reconstruct the frame at the decoder. To guarantee temporal consistency of texture based video coding, a dierent approach was taken in [96]. The framework is established on cube-based texture growing method [94]. The proposed algorithm utilizes the side information, which is a coded bitstream with a larger quantization step of the source video for two advantages. One is that the side information can be generated by any coding tool so it can be associated to any video coding system. The other one is that the amount of the side information can be adjusted by the quantization step with the result that the algorithm is exible. To achieve the goal, an area-adaptive side information selection scheme that can decide the proper amount of the side information is devised. The scheme determines the rate distortion optimization of the output coded data and the side information bitrate. The results show 43 the gap between the analyzed and synthesized texture regions can be fullled and the perceptual quality of the regions is similar. In [31], the algorithm can signicantly help to save more bits used in the side information. For intra coding, the proposed algorithm in [96] reconstructs the texture by the texture seed from a low quality video, so the side information can be reduced by controlling the mechanism. Naccari and Pereira [90] designed a complete perceptual video coding algorithm cov- ering decoding, encoding, and testing tools. The JND model generates a threshold for each DCT subband coecients. The adopted JND model contains spatial mask- ing and temporal masking components. The spatial masking model is related to three properties: frequency band masking, luminance variations masking, and image pattern masking. Frequency band masking re ects the visual sensitivity of the noise introduced in DCT coecients. Luminance variations masking re ects the change of the luminance part in the dierent image regions. The JND threshold of image pattern masking varies with the threshold of frequency band masking and luminance variations masking. The temporal masking model uses an existing model [141] because of its perfor- mance compared to other solutions. The model is established by using motion vector information. To apply this model, the issues of B-frame and intra frame are considered. Two motion vectors are used in the B-frame, and only the past vector is adopted in the model. For intra, the skip motion vector is introduced to the JND computation. In decoder side, the JND model is employed to estimate average block luminance, integer DCT coecients, and JND thresholds. In encoder side, the model is integrated into quantization, motion estimation, and rate distortion optimization. The quantization step for each DCT band of a given macroblock is adjusted by the respective JND threshold. The motion estimation and rate distortion optimization processes are weighted by the JND thresholds. The weighting process tends to weight the estimation error to provide the error in a perceptual fashion. The perceptual distortion is employed to motion estimation and rate distortion optimization. Thus, the cost function of rate distortion optimization is converted to perceptual cost function and the Lagrange 44 multiplier is also changed in the avor. The proposed testing procedure is to assess the rate-distortion performance. The algorithm is to compare the performance of a codec and another one based on a quality metric. Other attempts Besides visual quality metrics and perceptual models, audio information can be used to improve the coding eciency. In practical cases, audio is bound to the video, so the audio is also perceived by human observers synchronously. Lee et al. [65] proposed the video coding algorithm combined with audio information. The proposed scheme utilized the relation of the sound source and the corresponding spatial location to gain the ecient coding with the scene which contains multiple moving objects. The work is to nd the sound source and its region. Based on the assumption that human observers tends to recognize the sound object as the ROI, the corresponding region is encoded with more bits. The implementation encoded the ROI blocks with smaller QP relative to the non-ROI ones. 2.5 Performance Comparison We use the following three indices to measure metric performance [127, 128]. The rst index is the Pearson correlation coecient (PCC) between objective/subjective scores after non-linear regression analysis. It provides an evaluation of prediction accuracy. The second index is the Spearman rank order correlation coecient (SROCC) between the objective/subjective scores. It is considered as a measure of prediction monotonicity. The third index is the root-mean-squared error (RMSE). Before computing the rst and second indices, we need to use the logistic function and the procedure outlined in [127] to t the objective model scores to the MOS (or DMOS) in order to account for the quality rating compression at the extremes of the test range and prevent the overtting 45 problem. The monotonic logistic function used to t the objective prediction scores to the subjective quality scores [127] is: f(x) = 1 2 1 + exp ( x 3 j 4 j ) + 2 ; (2.5) where x is the objective prediction score, f(x) is the tted objective score, and the parameters j (j = 1; 2; 3; 4) are chosen to minimize the least squares error between the subjective score and the tted objective score. Initial estimates of the parameters were chosen based on the recommendation in [127]. For an ideal match between the objective prediction scores and the subjective quality scores, PCC = 1, SROCC = 1 and RMSE = 0. 2.5.1 Image Quality Metric Benchmarking To examine the performance of existing popular image quality metrics in this work, we choose CSIQ, LIVE, and TID2008 to test image quality metrics since they include the largest number of distorted images and also span more distortion types; these 3 databases cover most image distortion types that other publicly available image quality databases can provide. The performance results are listed in Tables 2.5, 2.6, and 2.7 with the three indices given above. The two best performing metrics are highlighted in bold. Clearly, MMF (both CF-MMF and CD-MMF) [75, 78] have the highest PCCs, SROCCs and the smallest RMSEs among the thirteen image quality metrics under comparison. 2.5.2 Video Quality Metric Benchmarking For the comparison of the state-of-the-art video quality metrics, LIVE Video Quality Database and EPFL-PoliMI Video Quality Assessment Database are adopted. Although most people use VQEG-FRTV Phase I Database (built in 2000) to test their video metric performance previously [115, 136], we use LIVE Video Quality Database (released in 2009) as our test database since it is new and contains distortion types in 46 Table 2.5: Performance Comparison among IQA Models in CSIQ Database Measure ||| PCC SROCC RMSE IQA Model MS-SSIM 0.8666 0.8774 0.1310 SSIM 0.8594 0.8755 0.1342 VIF 0.9253 0.9194 0.0996 VSNR 0.8005 0.8108 0.1573 NQM 0.7422 0.7411 0.1759 PSNR-HVS 0.8231 0.8294 0.1491 IFC 0.8358 0.7671 0.1441 PSNR 0.8001 0.8057 0.1576 FSIM 0.9095 0.9242 0.1091 MAD 0.9502 0.9466 0.0818 IW-SSIM 0.9025 0.9212 0.1131 CF-MMF 0.9797 0.9755 0.0527 CD-MMF 0.9675 0.9668 0.0664 Table 2.6: Performance Comparison among IQA Models in LIVE Image Database Measure ||| PCC SROCC RMSE IQA Model MS-SSIM 0.9402 0.9521 9.3038 SSIM 0.9384 0.9479 9.4439 VIF 0.9597 0.9636 7.6737 VSNR 0.9235 0.9279 10.4816 NQM 0.9128 0.9093 11.1570 PSNR-HVS 0.9134 0.9186 11.1228 IFC 0.9261 0.9259 10.3052 PSNR 0.8701 0.8756 13.4685 FSIM 0.9540 0.9634 8.1938 MAD 0.9672 0.9669 6.9419 IW-SSIM 0.9425 0.9567 9.1301 CF-MMF 0.9734 0.9732 6.2612 CD-MMF 0.9802 0.9805 5.4134 Table 2.7: Performance Comparison among IQA Models in TID2008 Database Measure ||| PCC SROCC RMSE IQA Model MS-SSIM 0.8389 0.8528 0.7303 SSIM 0.7715 0.7749 0.8537 VIF 0.8055 0.7496 0.7953 VSNR 0.6820 0.7046 0.9815 NQM 0.6103 0.6243 1.0631 PSNR-HVS 0.5977 0.5943 1.0759 IFC 0.7186 0.5707 0.9332 PSNR 0.5355 0.5245 1.1333 FSIM 0.8710 0.8805 0.6592 MAD 0.8306 0.8340 0.7474 IW-SSIM 0.8488 0.8559 0.7094 CF-MMF 0.9525 0.9487 0.4087 CD-MMF 0.9538 0.9463 0.4032 47 Table 2.8: Performance Comparison of VQA Models in LIVE Video Database Measure ||| PCC SROCC RMSE VQA Model PSNR 0.5465 0.5205 9.1929 VSNR 0.6880 0.6714 7.9666 SSIM 0.5413 0.5233 9.2301 V-SSIM 0.6058 0.5924 8.7337 VQM 0.7695 0.7529 7.0111 Q SVR [93] 0.7924 0.7820 6.6908 MOVIE 0.8116 0.7890 6.4130 ST-MAD[129] 0.8299 0.8242 - Table 2.9: Performance Comparison of VQA Models in EPFL-PoliMI Database [103] Measure ||| PCC SROCC VQA Model PSNR 0.7951 0.7983 VSNR 0.8955 0.8958 SSIM 0.8341 0.8357 VQM 0.8433 0.8375 MOVIE 0.9302 0.9203 more processes, such as H.264 compression, simulated transmission of H.264 packetized streams through error-prone wireless networks and error-prone IP networks, and MPEG- 2 compression. The comparison results are summarized in Table 2.8. Here, the image quality metrics (i.e., PSNR, VSNR, and SSIM) are used on a frame-by-frame basis for the video sequence, and then time-averaging the frame scores to obtain the video quality score. In Table 2.8, the results of ST-MAD are extracted from [129]. From Table 2.8, we can see that ST-MAD and MOVIE are the best metrics (which are both highlighted in bold) for LIVE Video Quality Database; VQM ranks the third. It means that MOVIE and ST- MAD correlate better with subjective results than other approaches under comparison. The reason why ST-MAD and MOVIE perform well is that they both consider the spatial and temporal features. In general, consideration of temporal information as well as interaction of spatial and temporal features [92] can improve the video quality prediction performance. In addition, we also summarize the performance results in Table 2.9 from [103] to see if the existing quality metrics can predict the quality well for videos distorted with 48 dierent packet loss rates. We can observe that MOVIE still works the best compared to other metrics in Table 2.9 with packet loss. 2.6 Applications of Machine Learning Method on Visual Quality Assessment Support Vector Machine (SVM) is a popular learning method for many pragmatic appli- cations. We will introduce its related applications on visual quality assessment, including classication and regression. 2.6.1 Classication A classication task usually involves separating data into training and testing sets. Each instance in the training set contains one "target value" (i.e. the class labels) and several "attributes" (i.e. the features or observed variables). The goal of SVM is to produce a model (based on the training data) which predicts the target values of the test data given only the test data attributes. Two types of SVM used for classication are described below: C-Support Vector Classication - Given training vectorsx i 2 R n , i = 1, ..., l, in two classes, and an indicator vector y2 R l such that y i 2f1;1g, C-SVC [32, 38] solves the following primal problem: min w;b; 1 2 w T w +C P l i=1 i subject to y i w T (x i ) +b 1 i ; i 0; i = 1;:::;l; (2.6) where (x i ) mapsx i into a higher dimensional space andC is the regularization param- eter. Due to the possible high dimensionality of the vector variablew, usually we solve the following dual problem: 49 min 1 2 T Qe T subject to y T = 0; 0 i C; i = 1;:::;l; (2.7) wheree = [1;:::; 1] T is the vector of all ones, Q is an l by l positive semidenite matrix, Q ij y i y j K (x i ;x j ), and K (x i ;x j ) (x i ) T (x j ) is the kernel. Once (2.7) is solved, using the primal-dual relationship, the decision function is sgn w T (x) +b = sgn l X i=1 y i i K (x i ;x) +b ! : (2.8) -Support Vector Classication - The-support vector classication [114] intro- duces a new parameter 2 (0; 1]. It is proved that is an upper bound on the fraction of the training errors and a lower bound of the fraction of support vectors. Given training vectorsx i 2R n , i = 1, ..., l, in two classes, and a vectory2R l such that y i 2f1;1g, the primal problem is: min w;b;; 1 2 w T w + 1 l P l i=1 i subject to y i w T (x i ) +b i ; i 0; i = 1;:::;l; 0: (2.9) The dual problem is: min 1 2 T Q subject to y T = 0; e T ; 0 i 1 l ; i = 1;:::;l; (2.10) where Q ij y i y j K (x i ;x j ) . The decision function is: 50 sgn l X i=1 y i i K (x i ;x) +b ! : (2.11) 2.6.2 Regression The regression is more important and favorable than classication in the led of visual quality assessment. We will have a more detailed description of this type of SVM (also called support vector regression (SVR)) in Section 3.3.2. 2.7 Conclusion In this chapter, we have rst reviewed the existing visual quality assessment methods and their classication in a comprehensive perspective. Then we introduced the recent developments in image quality assessment (IQA), including the popular public image quality databases that play an important role in facilitating the relevant research activ- ities in this eld and several well-performed image quality metrics. In a similar format, we also discussed the recent developments for video quality assessment (VQA) in general, the publicly available video quality databases and several state-of-the-art VQA metrics. In addition, we have compared the major existing IQA and VQA metrics, and given some discussion, with using the most comprehensive image and video quality databases respectively. In the end, we introduce the machine learning methods that can be applied on visual quality assessment. One important class of applications of visual quality assessment is perceptual image and video coding. The perceptually driven coding methods have demonstrated their merits, compared with the traditional MSE based coding techniques. Such research takes a dierent path (i.e., removing perceptual signal redundancy apart from the statistical one) to further improve the coding performance and makes it more use-oriented since humans are the ultimate appreciator of almost all processed visual signals. Existing and interesting methods include: utilizing a perceptual quality index to measure distortion; 51 utilizing JND and VA models in coding; integrating motion or texture information to improve the coding eciency in a perceptual sense. We believe that there are still a lot of possibilities for perceptual coding and beyond, which wait for being discovered. 52 Chapter 3 Image Quality Assessment Using Multi-Method Fusion (MMF) 3.1 Introduction Various image quality assessment methods have been developed to re ect human visual quality experience. The mean-square error (MSE) and the peak-signal-to-noise-ratio (PSNR) are two widely used ones. However, they may not correlate with human per- ception well [51, 132]. During the last decade, a number of new quality indices have been proposed as better alternatives. Examples are summarized in Table 3.1, including MS-SSIM [139], SSIM [133], VIF [117], VSNR [33], NQM [41], PSNR-HVS [43], IFC [118], PSNR, FSIM [148], and MAD [63]. So far, there is not a single quality index that signicantly outperforms others. Some method may be superior for one image distor- tion type but inferior for others. Thus, the idea of multi-method fusion (MMF) arises naturally. The major contribution of this research is to provide a new perspective in visual quality assessment to complement existing eorts that target at developing a single method suitable for certain types of images. Given the complex and diversifying nature of general visual content and distortion types, it would be challenging to solely rely on a single method. On the other hand, we demonstrate that it is possible to achieve signicantly better performance by fusing multiple methods with proper means (e.g., machine learning) at the cost of higher complexity. The performance of the proposed MMF scheme will be improved continuously when new methods invented by the research community are incorporated. 53 Table 3.1: Ten Better-Recognized IQA Methods IQA method index Abbreviation Full name m1 MS-SSIM Multi-Scale Structural Similarity m2 SSIM Structural Similarity m3 VIF Visual Information Fidelity m4 VSNR Visual Signal-to-Noise Ratio m5 NQM Noise Quality Measure m6 PSNR-HVS Peak Signal-to-Noise Ratio - Human Visual System m7 IFC Information Fidelity Criterion m8 PSNR Peak Signal-to-Noise Ratio m9 FSIM Feature Similarity m10 MAD Most Apparent Distortion To achieve MMF, we adopt a regression approach (e.g., support vector regression (SVR)). First, we collect a number of image quality evaluation methods. Then, we set the new MMF score to be the nonlinear combination of scores from multiple methods with suitable weighting coecients. Clearly, these weights can be obtained by the regression approach Although it is possible to perform MMF independently of image distortion types, the assessment result can be improved if we take image distortion types into account. To be more specic, we may divide image distortion types into several major groups (i.e., contexts) and perform regression within each group. In other words, the term context in this work means an image group consisting of similar distortion types. We call the one independent of distortion type as "context-free MMF" (CF-MMF) and the one depending on distortion types as "context-dependent MMF" (CD-MMF), respectively. For CD-MMF, one important task is to determine the context automatically. Here, we use a machine learning approach for context determination. As a result, the proposed CD-MMF system consists of two steps: 1) context determination and 2) MMF for a given context. This work is an extension of our previous work in [75] with a substantial amount of new material. First, we provide a detailed discussion on support vector regression (SVR) theory, which is the machine learning tool used for fusion of multiple methods in Section 3.3.2. Second, we oer an elaborated study on the fusion rules in Section 3.5. All material in Section 3.5 is new. Specically, we develop a new fused IQA methods 54 selection algorithm called the Biggest Index Ranking Dierence (BIRD) that is used to select the most appropriate method for fusion so as to reduce the complexity of the MMF method proposed in [75]. Furthermore, we compare BIRD with another fused IQA methods selection algorithm called the Sequential Forward Method Selection (SFMS) in terms of performance accuracy and complexity. Third, we conduct a more thorough experimental evaluation. Only the TID database was tested in [75]. Here, we test the performance of the MMF method against six publicly available image quality databases in Section 3.6. Finally, we replace two poor-performed methods in [75] with two more recently developed approaches (i.e., FSIM [148] and MAD [63]) in the MMF process so as to achieve a better correlation between predicted objective quality scores and human subjective scores. This also demonstrates that the proposed MMF approach is able to accommodate new methods to produce better results. The Pearson Correlation Coecient (PCC) performance of the proposed MMF method ranges from 0.94 to 0.98 with respect to various image quality databases. The rest of this chapter is organized as follows. Some prior related works are reviewed in Section 3.2. The MMF process based on regression is described in Section 3.3. The MMF types, context denition and determination are investigated in Section 3.4. In Section 3.5, we try to reduce the complexity of the proposed MMF approach via two fused IQA methods selection algorithms. Experimental results are reported in Sec- tion 3.6, where extensive performance comparisons are made across multiple image qual- ity databases. Finally, concluding remarks and future works are given in Section 3.7. 3.2 Review of Previous Work Machine learning and multiple-metric based image quality assessment methods were reported in the literature before. Luo [83] proposed a two-step algorithm to assess image quality. First, a face detection algorithm is used to detect human faces from the image. Second, the spectrum distribution of the detected region is compared with a trained 55 model to determine its quality score. The restriction is that it primarily applies to images that contain human faces. Although the authors claimed it's not dicult to generalize faces to other objects, they still only provided the results which used the images containing human faces to prove the feasibility of their algorithm. Suresh et al. [121, 122] proposed the use of a machine learning method to mea- sure the visual quality of JPEG-coded images. Features are extracted by considering factors related with the human visual sensitivity, such as edge length, edge amplitude, background luminance, and background activity. The visual quality of an image is then computed using the predicted class number and their estimated posterior probability. It was shown by experimental results that the approach performs better than other met- rics. However, it is only applicable to JPEG-coded images since the above features are calculated based on the DCT blocks. The machine learning tool has been used in developing an objective image quality metric. For example, Narwaria and Lin [91] proposed to use singular vectors out of singular value decomposition (SVD) as features to quantify the major structural infor- mation in images. Then, they applied support vector regression (SVR) for image quality prediction, where the SVR method has the ability to learn complex data patterns and maps complicated features into a proper score. Moreover, Leontaris et al. [66] collected 15 metrics, and evaluated each one of them to see if they satisfy the expectation of a good video quality metric. In the end, they linearly combined two metrics (MCEAM and GBIM) to get a hybrid metric by using simple coecients as the weights. In summary, the works in [83, 91, 121, 122] used extracted features for model training and test. They are related to machine learning. Another work [66] showed the advantage of integrating results from two methods (although it did not use the machine learning approach). As compared with the previous work, the proposed multi-method fusion (MMF) idea is new since it oers a generic framework that enables better performance than existing 56 Table 3.2: Image Distortion Types in TID2008 Database Type Type of distortion 1 Additive Gaussian Noise 2 Dierent additive noise in color components 3 Spatially correlated noise 4 Masked noise 5 High frequency noise 6 Impulse noise 7 Quantization noise 8 Gaussian blur 9 Image denoising 10 JPEG compression 11 JPEG2000 compression 12 JPEG transmission errors 13 JPEG2000 transmission errors 14 Non eccentricity pattern noise 15 Local block-wise distortions of dierent intensity 16 Mean shift (intensity shift) 17 Contrast change methods and serves as a reference for future research. The MMF is developed in a systematic manner to complement the existing (and even future) approaches. 3.3 Multi-Method Fusion (MMF) 3.3.1 Motivation Many objective quality indices have been developed during the last decade. We consider the ten existing better-recognized methods given in Table 3.1. Furthermore, we consider 17 image distortion types as given in the TID2008 database [16]. They are listed in Table 3.2. In Table 3.3, we list the top three quality indices for each distortion type in terms of Pearson correlation coecient (PCC). We observe that dierent quality indices work well with respect to dierent image distortion types. For example, the PSNR may not accurately predict quality scores for most distortion types, but it does work well for images corrupted by additive noise as well as quantization noise. Generally speaking, the PSNR and its variant PSNR-HVS work well for image distortion types 1-7 while FSIM works well for image distortion types 8-17. 57 Table 3.3: Top Three Quality Indices for Image Distortion Types in Table 3.2 (in Terms of the PCC Performance) Distortion type Best method (PCC) 2nd best method (PCC) 3rd best method (PCC) 1 m6 (0.9366) m8 (0.9333) m3 (0.8717) 2 m8 (0.9285) m6 (0.9137) m3 (0.9004) 3 m8 (0.9524) m6 (0.9510) m10 (0.8745) 4 m3 (0.8928) m8 (0.8737) m6 (0.8240) 5 m6 (0.9730) m8 (0.9708) m3 (0.9464) 6 m8 (0.9084) m6 (0.8651) m3 (0.8263) 7 m6 (0.8965) m8 (0.8911) m2 (0.8745) 8 m1 (0.9506) m2 (0.9452) m9 (0.9414) 9 m9 (0.9680) m2 (0.9664) m1 (0.9638) 10 m6 (0.9720) m9 (0.9710) m2 (0.9608) 11 m9 (0.9801) m10 (0.9789) m1 (0.9751) 12 m1 (0.8844) m9 (0.8823) m10 (0.8784) 13 m6 (0.9256) m2 (0.8574) m9 (0.8491) 14 m7 (0.8394) m10 (0.8315) m3 (0.7460) 15 m2 (0.8768) m9 (0.8531) m3 (0.8434) 16 m2 (0.7547) m6 (0.7099) m8 (0.7076) 17 m3 (0.9047) m9 (0.7706) m1 (0.7689) 3.3.2 Support Vector Regression (SVR) In order to develop the MMF method that handles all image distortion types and extents, we would like to integrate the scores obtained from multiple quality indices into one score. Although there exist many dierent fusion tools, we adopt a support vector regression approach here due to its relatively superior performance. Suppose we have a set of training data (x 1 ;y 1 ); (x 2 ;y 2 );:::; (x m ;y m ), where x i 2R n is a feature vector, and y i 2 R is the target output. In " support vector regression ("-SVR) [28], we want to nd a linear function, f(x) =hw; xi +b = w T x +b; (3.1) which has at most deviation " from the actually obtained target outputs y i for all the training data and at the same time as at as possible, where w2 R n ;b2 R. In other words, we want to nd w and b such that kf(x i )y i )k 1 ";8i = 1;:::;m; (3.2) 58 wherekk 1 is the l 1 norm and " 0. Flatness in (3.1) means we have to seek a smaller w [120]. For this reason, it is required to minimizekwk 2 2 , wherekk 2 is the Euclidean (l 2 ) norm. Generally, this can be written as a convex optimization problem by requiring min 1 2 kwk 2 2 subject to kf(x i )y i )k 1 ";i = 1;:::;m: (3.3) Introducing two slack variables" i 0; ^ " i 0, to cope with infeasible constraints of (3.3), (3.3) becomes min 1 2 kwk 2 2 +C P m i=1 (" i + ^ " i ) subject to y i f(x i ) +" +" i ;y i f(x i )" ^ " i ;" i 0; ^ " i 0;i = 1;:::;m;" 0; (3.4) where C is a penalty parameter for the error term. The optimization problem (3.4) can be solved through its Lagrangian dual problem max 1 2 P m i=1 P m j=1 (a i ^ a i ) (a j ^ a j ) x T i x j " P m i=1 (a i + ^ a i ) + P m i=1 (a i ^ a i )y i subject to P m i=1 (a i ^ a i ) = 0; 0a i ; ^ a i C;i = 1;:::;m; (3.5) where a i 0 and ^ a i 0, being Lagrange multipliers. After solving (3.5), we can obtain w = m X i=1 (a i ^ a i ) x i ; (3.6) f(x) = m X i=1 (a i ^ a i ) x T i x +b: (3.7) 59 By using Karush-Kuhn-Tucker (KKT) conditions as below a i " +" i + w T x i +by i = 0; ^ a i " + ^ " i w T x i b +y i = 0; (Ca i )" i = 0; (C ^ a i ) ^ " i = 0; (3.8) where b can be computed as follows: b = 8 < : y i " w T x i ; 0<a i <C y i +" w T x i ; 0< ^ a i <C: (3.9) The support vectors are dened as those data points that contribute to predictions given by (3.7), and are x i 's where a i ^ a i 6= 0. The complexity of f(x) is related to the number of support vectors. Similarly, for nonlinear regression, we just dene f(x) = w T '(x) +b, and '(x) denotes a xed feature-space transformation. Then w = m X i=1 (a i ^ a i )'(x i ); (3.10) f(x) = P m i=1 (a i ^ a i )' T (x i )'(x) +b = P m i=1 K(x i ; x) +b; (3.11) where K(x i ; x) is a kernal function. The kernel function K(x i ; x j ) can be dened as K(x i ; x j ) =' T (x i )'(x j ): (3.12) There are four basic kernels [56]. We list two commonly used ones: Linear: K(x i ; x j ) = x T i x j (3.13) 60 Radial basis function (RBF): K(x i ; x j ) = exp kx i x j k 2 2 ; > 0; (3.14) where is a kernel parameter. Since the proper value for the parameter " is dicult to determine, we try to resolve this problem by using a dierent version of the regression algorithm, support vector regression (-SVR) [28], in which " itself is a variable in the optimization process and is controlled by another new parameter 2 (0; 1). In fact, is a parameter that can be used to control the number of support vectors and the upper bound on the fraction of error points. Hence, this makes a more convenient parameter than " in adjusting the accuracy level to the data. Therefore, -SVR is to solve min 1 2 kwk 2 2 +C " + 1 m P m i=1 (" i + ^ " i ) subject to y i f(x i ) +" +" i ;y i f(x i )" ^ " i ;" i 0; ^ " i 0;i = 1;:::;m;" 0: (3.15) The dual problem is max 1 2 P m i=1 P m j=1 (a i ^ a i ) (a j ^ a j )K(x i ; x j ) + P m i=1 (a i ^ a i )y i subject to P m i=1 (a i ^ a i ) = 0; P m i=1 (a i + ^ a i )C; 0a i ; ^ a i C=m;i = 1;:::;m: (3.16) Following the same procedure, we can obtain the same expressions for w and f(x) as in (3.10) and (3.11). In this work, we choose -SVR as our tool to do all the regressions because of its convenience on parameter selection. 3.3.3 MMF Scores Consider the fusion ofn image quality assessment methods withm training images. For the i-th training image, we can compute its quality score individually, which is denoted byx i;j , wherei = 1; 2;:::;m, denoting the image index andj = 1; 2;:::;n, denoting the 61 method index. Also, we dene the quality score vector x i = (x i;1 ;:::;x i;n ) T for the i-th image. The new MMF quality score is dened as mmf(x i ) = w T '(x i ) +b; (3.17) where w = (w 1 ;:::;w n ) T is the weighting vector, and b is the bias. 3.3.4 Data Scaling and Cross-Validation Before applying SVR, we need to linearly scale the scores obtained from each quality index to the same range [0, 1] to avoid the quality index in larger numerical ranges (e.g., PSNR) dominating those in smaller numerical ranges (e.g., SSIM). Another advantage is to avoid numerical diculties during the calculation. The linear scaling operation is performed for both training and testing data [56]. In all experiments, we use the n-fold (e.g., n equals to 5) cross-validation strategy (which is widely used in the machine learning [30]) to select our training and testing sets. First, we divide the image set inton sets. One set is used for testing, and the remaining n 1 sets are used for training. Then, we do this n times, where we make sure each set is only used as the testing set once. The testing results from n folds are then combined and averaged to compute the overall correlation coecients and error. This procedure can prevent the over-tting problem. 3.3.5 Training Stage In the training stage, we would like to determine the weight vector w and the biasb from the training data that minimize the dierence between mmf(x i ) and the (dierential) mean opinion score ((D)MOS i ) obtained by human observers; namely, kmmf(x i ) (D)MOS i k;i = 1;:::;m; (3.18) 62 wherek:k denotes a certain norm. Several commonly used dierence measures include the Euclidean (l 2 ) norm, thel 1 norm and thel 1 norm. The Euclidean norm will lead to the standard least square curve tting problem. However, this choice will penalize the quality metrics that have a few outliers severely. Similar as (3.2), we demand that the maximum absolute dierence in (3.18) is bounded by a certain level (denoted by "), and adopt the support vector regression (SVR) [34] for its solution, (i.e.,to determine the weight vector and the bias). In con- ducting SVR, we choose (3.14) as the kernel function. The main advantage of RBF kernel is to be able to handle the case when the relation between (D)MOS i and qual- ity score vector x i is nonlinear. Besides, the number of hyperparameters in uences the complexity of model selection. The RBF kernel also has less hyperparameters than the polynomial kernel. That's the reason why the RBF kernel becomes our rst choice. In addition, after a series of experiments, we nd out using RBF kernel always gives us better performance than other kernels (linear or polynomial) in all cases. Hence, we choose the RBF kernel in all the experiments we conducted. The choice of RBF kernel is also corroborated in [56]. 3.3.6 Testing Stage In the testing stage, we use the quality score vector x k of the k-th test image, where k = 1; 2;:::;l, with l being the number of test images, and (3.17) to determine the quality score of the MMF method, mmf(x k ). Clearly, the test can be done very fast as long as we have the trained model. 3.4 Contexts for MMF Scheme To achieve better quality scores for the MMF method, we may cluster image distortion types into several distinct groups and then determine the regression rule for each group individually. We call each group as a context, which consists of similar image distortion 63 Table 3.4: The Context Denition for Each Database Database Context I Context II Context III Context IV Context V A57 Additive White Gaus- sian Noise Gaussian Blur JPEG, JPEG2000, JPEG2000 w/ DCQ Quantization of Subband of DWT - CSIQ White Noise, Pink Noise Gaussian Blur JPEG, JPEG2000 Contrast Decrease - IVC LAR Coding Blur JPEG, JPEG2000 - - LIVE White Noise Gaussian Blur JPEG, JPEG2000 Fast Fading Rayleigh - TID2008 1-7 8-9 10-11 12-13 14-17 Toyoma JPEG, JPEG2000 - - - - types. Then, the resulting scheme is called the context-dependent MMF (CD-MMF), while the scheme without the context classication stage is called context-free MMF (CF-MMF). This involves two issues: 1) the denition of contexts and 2) automatic determination of contexts. They will be discussed in Sections 3.4.1 and 3.4.2, respectively. 3.4.1 Context Denition in CD-MMF To dene the contexts for six image quality databases, including A57 [1], CSIQ [2], IVC [7], LIVE [9], TID2008 [16], and Toyoma [17], we combine similar image distortion types into one group (i.e., context). The detailed context denition for each database is in Table 3.4. However, there are 17 types of distortions for TID2008 database [16]. Thus, in order to make the classication and cross-database comparison easy, we have to classify distortion types into 5 contexts according to the distortion characteristics described below: Context I: Collection of all kinds of additive noise Context II: Blurring Context III: JPEG+JPEG2000 Context IV: Error caused by transmission Context V: Intensity deviation 64 3.4.2 Automatic Context Determination in CD-MMF To perform CD-MMF, the system should be able to determine the context automatically. We extract the following ve features and apply a machine learning approach to achieve this task. The rst three features [137] are calculated horizontally and vertically, and combined into a single value by averaging. 1. Blockiness (along the horizontal direction): It is dened as the average dierences across block boundaries B h = 1 M(bN=8c 1) M X i=1 bN=8c1 X j=1 jd h (i; 8j)j; (3.19) where d h (i;j) = x(i;j + 1)x(i;j);j2 [1;N 1], is the dierence signal along horizontal line, and x(i;j);i2 [1;M];j2 [1;N] for an image of size MN. 2. Average absolute dierence between in-block image samples (along the horizontal direction): A h = 1 7 2 4 8 M(N 1) M X i=1 N1 X j=1 jd h (i;j)jB h 3 5 : (3.20) 3. Zero-crossing (ZC) rate: Dene z h (m;n) = 8 < : 1 ZC happens at d h (m;n) 0 otherwise; Then, the horizontal ZC rate is estimated as: Z h = 1 M(N 2) M X i=1 N2 X j=1 z h (i;j): (3.21) 65 Table 3.5: The Context Classication Rate for Each Database Database A57 CSIQ IVC LIVE TID2008 Toyoma Context classication accuracy 88.9% 78.1% 81.6% 80.7% 83.6% 100% We only dene the three horizontal components above; one can calculate vertical components B v ;A v ;Z v in a similar fashion. Finally, the desired features are given by B = B h +B v 2 ;A = A h +A v 2 ;Z = Z h +Z v 2 : (3.22) For the justication of these 3 features, the reader is referred to [137]. In addition, we introduce two more features below. 4. Average edge-spread: First, edge detection is applied to the image. For each edge pixel, we search the gradient direction to count the number of pixels with an increasing grey level value in the "+" direction, a decreasing grey level value in the "-" direction, and stop when signicant gradient does not exist. Then, the sum of these two pixel counts is the edge-spread. The average edge-spread is computed by dividing the total amount of edge-spread by the number of edge pixels in the image. The details about this feature are given in [98]. 5. Average block variance in the image: First, we divide the whole image into blocks of size 44 and classify them into "smooth" or "non-smooth" blocks based on the existence of edges. Then, we collect a set of smooth blocks of size 44 and make sure that they do not across the boundary of 88 DCT blocks. Finally, we compute the variance of each block and obtain the average. For each database, we can use the support vector machine (SVM) algorithm and the cross-validation method to classify images into dierent contexts with these ve features. The correct context classication rate for the selected database is listed in Table 3.5 by 66 Figure 3.1: The block diagram of the proposed CF-MMF (without block A) and CD- MMF (with block A) quality assessment system. using SVM and the 5-fold cross-validation, where the average accuracy of the tests over 5 sets is taken as the performance measure. Although the classication rate is reasonable, it is still lower than 90% for most databases. However, the classication rate of the context determination is not that critical since it is simply an intermediate step. The overall performance of visual quality evaluation (as to be discussed in the next section) is what we care about. Actually, as long as we use the same context classication rule in the training and the testing stages, the fusion rule will be properly selected with respect to the classied group. Once the context of an image is given, we can apply a dierent fusion rule (i.e., the combination of multiple IQA methods, called the fused IQA methods) to a dierent context so as to optimize the assessment performance of the proposed MMF method. The block diagram of the MMF quality assessment system is given in Fig. 3.1. 3.5 Fused IQA Methods Selection In either CF-MMF or CD-MMF, we need to nd out what is the best combination for fused IQA methods, which should not only achieve higher correlation with MOS (DMOS) but also have lower complexity. Given above requirements, the fused IQA methods can be selected by the algorithms as follows. 67 3.5.1 Sequential Forward Method Selection (SFMS) First, given a method set M =fm j jj = 1;:::; 10g, we want to nd a subset M N = fm i1 ;m i2 ; ;m iN g, withN < 10, to optimize an objective functionJ(M N ), which can be one of the following three forms: J(M N ) = PCC(mmf(M N ); (D)MOS) (3.23) J(M N ) = SROCC(mmf(M N ); (D)MOS) (3.24) J(M N ) = RMSE(mmf(M N ); (D)MOS); (3.25) where PCC, SROCC, and RMSE are the Pearson linear correlation coecient, Spear- man rank order correlation coecient, and root-mean-squared error between predicted objective scores and subjective scores, respectively. If (3.23) or (3.24) is used, then max- imization needs to be applied. Otherwise, we manage to minimize (3.25)) instead. Here, we choose (3.23) since PCC represents the prediction accuracy of evaluation [127]. Sequential Forward Method Selection (SFMS) is the simplest greedy search algo- rithm to achieve the above goal. Starting from a method set M k (being empty at the start), we sequentially add one method m that results in the highest objective function J(M k +m ) between MOS (DMOS) and the SVR output mmf(x i ) to the set when combined with the method set M k that have already been selected. The algorithm can be stated below for clarity: Algorithm 1: Sequential Forward Method Selection (SFMS) 1. Start with the empty method set M 0 =fg. 2. Select the next best method. m = arg max m2MM k J(M k +m) 68 3. Update M k+1 = M k +m ; k = k + 1: 4. Go to 2. 3.5.2 Biggest Index Ranking Dierence (BIRD) Since we use n-fold cross-validation, we can obtain n dierent sets of training data for each of the ten candidate methods. Then, we dene a characteristic index I j for thejth method in the candidate method set as I j = 1 n n X i=1 Var[F j;i ] Mean[F j;i ] ;i = 1;:::;n;j = 1;:::; 10; (3.26) where F j;i is the ith fold training data of the jth method. The more diverse between two methods is for the trained model, the higher their characteristic index dierence is. Using the index I j , we develop the following algorithm to reduce the number of fused IQA methods. Algorithm 2: Biggest Index Ranking Dierence (BIRD) 1. Find the index of the most correlated method, and denoted as k. k = arg max j J(m j ) 2. Set the threshold value n th;dbs for database s as follows: For CF-MMF, n th;dbs = 1PCC PSNR;dbs 0:1 : (3.27) 69 For CD-MMF, n th;dbs = max 2; 1PCC PSNR;dbs 0:1 2 ; (3.28) wherede denotes the ceiling function, and PCC PSNR;dbs represents the Pearson correlation coecient between PSNR and MOS (DMOS) for database s. 3. Compute index I j ;j = 1;:::; 10. 4. Sort the methods from the smallest to the largest by index I j and denote the ranking of method j as r(m j ). i 1 =k; for n = 1 to n th;dbs G n = fi 1 ;:::;i n g ; i n+1 = arg max i2f1;:::;10gGn jr(m i )r(m in )j ; end 5. Choose the method set n m i 1 ;m i 2 ; ;m i n th;dbs +1 o as the fused IQA methods. Basically, this idea of nding the biggest index ranking dierence (BIRD) is analogous to picking up the most dissimilar feature for the current feature. After choosing the rst fused IQA method, we choose the one which has the biggest index ranking dierence with the rst one as the second fused IQA method since it has the most dierent characteristics comparing to the rst chosen fused IQA method. Following the same idea, we can decide the third, the fourth, and the fth fused IQA methods and so on. Actually, we may think of (3.26) as a representation of the characteristics of the method j. Intuitively, we do not want to combine two methods having similar char- acteristic index value together since this way of fusion cannot give us extra advantage comparing to the original single method. That's because these two methods may share a 70 Table 3.6: Complexity Comparison of the Fused IQA Methods Selection Algorithms in Terms of Required Arithmetic Operations Method Exhaustive search SFMS BIRD N=3 122 33 25 N=4 212 42 26 N=5 254 50 27 N=6 212 57 28 lot of common characteristics. We may only need one of them. For instance, PSNR (m8) and PSNR-HVS (m6) both can predict image quality well on additive noise distortion types. But in order to achieve better prediction results, we never combine them together (See the BIRD case in Tables 3.8 and 3.11). Therefore, instead of doing fusion with two similar methods, we choose the next fused IQA method by picking up the method which has the most complementary characteristics to obtain extra advantage over original one. 3.5.3 Complexity Analysis Since we have ten methods in the candidate method set, exhaustive evaluation of method subsets involves 10 N combinations for a xed value of N, and 210 combinations if the selection of N is to be optimized as well. This scale of combinations is unfeasible, even for moderate values of N. So a search procedure must be used in practice. We compare the complexity of two fused IQA methods selection approaches mentioned in Table 3.6. The complexity of the exhaustive evaluation is also listed together in Table 3.6 for easy comparison. Here, the algorithm complexity is measured by the total number of required arithmetic operations for 1) PCC computation, 2) value sorting in descending order, and 3) the number of methods (features) selected for the fusion. It appears that SFMS and BIRD all have relatively low complexity comparing to the exhaustive evaluation, especially BIRD. In summary, both method selection algorithms can save us more time on nding the best combination of fused IQA methods in a systematic way for the training process. We will further demonstrate their performances in Section 3.6. 71 3.5.4 Discussion Besides the algorithms we mentioned above, principal components analysis (PCA) [42] is a well-known method for feature reduction. It transforms the existing features into a lower dimensional space. The reason we did not use it is because we only have ten methods (features). It seems no need to reduce the dimension a lot. After all, we still need 3 to 6 methods (features) for the fusion. Moreover, we do not want to lose the score information of original methods (features). That's why we use the methods which can select the subset of features instead of transforming the features into a lower dimensional space with dierent units. 3.6 Performance Evaluation 3.6.1 Databases For performance evaluation, six image quality databases (A57, CSIQ, IVC, LIVE, TID2008, and Toyoma) are used. We brie y introduce each database as below. The A57 Database [1] has 3 original images, and 54 distorted images, including six distortion types - quantization of the LH subbands of a 5-level DWT of the image using the 9/7 lters, additive Gaussian white noise, JPEG compression, JPEG2000 compres- sion, JPEG2000 compression with the Dynamic Contrast-Based Quantization (DCQ), and Gaussian blurring. The subjective quality scores used for this database are DOMS, ranging from 0 to 1. The Categorical Image Quality (CSIQ) Database [2] contains 30 reference images, and each image is distorted using 6 types of distortions - JPEG compression, JPEG2000 compression, global contrast decrements, additive Gaussian white noise, additive Gaus- sian pink noise, and Gaussian blurring - at 4 to 5 dierent levels, resulting in 866 distorted images. The score ratings (0 to 1) are reported in the form of DMOS. The IVC Database [7] has 10 original images, and 185 distorted images, including 4 types of distortions - JPEG, JPEG2000, locally adaptive resolution (LAR) coding, and 72 blurring. The subjective quality scores provided in this database are MOS, ranging from 1 to 5. The LIVE Image Quality Database [9] has 29 reference images, and 779 test images, including ve distortion types - JPEG2000, JPEG, white noise in the RGB components, Gaussian blur, and transmission errors in the JPEG2000 bit stream using a fast-fading Rayleigh channel model. The subjective quality scores provided in this database are DMOS, ranging from 0 to 100. The Tampere Image Database (TID2008) [16] includes 25 reference images, 17 types of distortions for each reference image (Table 3.2), and 4 dierent levels for each type of distortion. The whole database contains 1700 distortion images. MOS is provided in this database, and the scores range from 0 to 9. The Toyoma Database [17] has 14 original images, and 168 distorted images, including two types of distortions - JPEG, and JPEG2000. The subjective scores in this database are MOS, ranging from 1 to 5. From the above introduction, we know that these six databases contain dierent visual contents, an abundant amount of distorted images, and diversifying image distor- tion types. Therefore, they can provide an excellent ground for performance evaluation. 3.6.2 Test Methodology and Performance Measure We use the following three indices to measure IQA model performance [127, 128]. The rst index is the Pearson correlation coecient (PCC) between objective/subjective scores after non-linear regression analysis. It provides an evaluation of prediction accu- racy. The second index is the Spearman rank order correlation coecient (SROCC) between the objective/subjective scores. It is considered as a measure of prediction monotonicity. The third index is the root-mean-squared error (RMSE). Before comput- ing the rst and second indices, we need to use the logistic function and the procedure 73 Table 3.7: Performance Measure of CF-MMF with SFMS in TID2008 Database No. of fused IQA methods Selected fused IQA methods PCC SROCC RMSE 1 m9 0.8726 0.8801 0.6556 2 m9, m3 0.9053 0.8953 0.5700 3 m9, m3, m7 0.9307 0.9275 0.4907 4 m9, m3, m7, m10 0.9432 0.9403 0.4460 5 m9, m3, m7, m10, m6 0.9502 0.9466 0.4183 6 m9, m3, m7, m10, m6, m1 0.9525 0.9487 0.4087 7 m9, m3, m7, m10, m6, m1, m8 0.9543 0.9495 0.4012 8 m9, m3, m7, m10, m6, m1, m8, m5 0.9546 0.9494 0.3999 9 m9, m3, m7, m10, m6, m1, m8, m5, m4 0.9543 0.9493 0.4008 10 m9, m3, m7, m10, m6, m1, m8, m5, m4, m2 0.9532 0.9483 0.4056 outlined in [127] to t the objective model scores to the MOS (or DMOS). The mono- tonic logistic function used to t the objective prediction scores to the subjective quality scores [127] is: f(x) = 1 2 1 + exp ( x 3 j 4 j ) + 2 ; (3.29) where x is the objective prediction score, f(x) is the tted objective score, and the parameters j (j = 1; 2; 3; 4) are chosen to minimize the least squares error between the subjective score and the tted objective score. For an ideal match between the objective prediction scores and the subjective quality scores, we will have PCC = 1, SROCC = 1 and RMSE = 0. One thing needs to be paid attention here. Without the logistic function, the method is still able to obtain reasonable results because RBF does a similar job. However, we observe that the logistic function enhances the performance since it oers better quality rating at the extremes of the test range. Moreover, to have a fair comparison among all IQA methods, it is still better to include the logistic function in the MMF approach. 3.6.3 CF-MMF First, we list the performance results of CF-MMF in Table 3.7 and plot the PCC results in Fig. 3.2 to see the trend better. The methods in Table 3.7 are selected by SFMS 74 Figure 3.2: PCC performance of CF-MMF in TID2008 database. algorithm, as described in Section 3.5. This selection algorithm can achieve better performance with a smaller number of methods reliably. Besides, the change of their application order does not aect the results. For example,fm9, m3, m7g,fm7, m9, m3g andfm3, m7, m9g all give the same performance. As shown in Fig. 3.2 (the method used in each case has been listed in Table 3.7), when the number of methods goes from 1 to 3, the performance improves drastically. There is still a little improvement when the number of methods increases from 3 to 8. Once the method number is over 8, the perfor- mance stays the same and sometimes even worse. Hence, we may have at least required 3 methods and up to 8 for the fusion. This phenomenon is well known as "the curse of the dimensionality" [29, 47]. The main reasons why the performance of high-dimensional data degrades rather than improve are 1) the increased noise and error when adding more features, and 2) the amount of data is not enough to obtain statistically sound and reliable estimates. This phenomenon is also called Hughes eect [100] in the eld 75 Table 3.8: Selected Fused IQA Methods for CF-MMF in Six Databases Selection algorithm |||{ SFMS BIRD Database A57 m4, m9, m10, m7, m1 m4, m8, m5, m10, m2 CSIQ m10,m3, m6 m10, m1, m7 IVC m9, m10, m5, m4 m9, m10, m1, m3 LIVE m10, m3, m6 m10, m1, m3 TID2008 m9, m3, m7, m10, m6, m1 m9, m7, m1, m10, m2, m3 Toyoma m10, m5, m6, m3, m2 m10, m1, m7, m2, m3 Table 3.9: Performance Measure for CF-MMF in Six Databases Database Selection algorithm PCC SROCC RMSE A57 (5 methods) SFMS 0.9604 0.9590 0.0685 BIRD 0.9465 0.9475 0.0793 CSIQ (3methods) SFMS 0.9797 0.9755 0.0527 BIRD 0.9698 0.9657 0.0641 IVC (4 methods) SFMS 0.9352 0.9226 0.4313 BIRD 0.9205 0.9096 0.4760 LIVE (3 methods) SFMS 0.9734 0.9732 6.2612 BIRD 0.9712 0.9710 6.5131 TID2008 (6 methods) SFMS 0.9525 0.9487 0.4087 BIRD 0.9482 0.9434 0.4261 Toyoma (5 methods) SFMS 0.9477 0.9419 0.3995 BIRD 0.9451 0.9402 0.4091 of machine learning. Therefore, we need the feature reduction (or selection) techniques introduced in Section 3.5 to reduce the dimension and achieve better results. To balance the performance and complexity, the number of fused IQA methods needed for each database can be computed via (3.27), which is equal to n th;dbs + 1. Hence, 3 methods are used for CSIQ and LIVE databases, 4 methods for IVC database, 5 methods for A57 and Toyoma databases, and 6 methods for TID2008 database, respec- tively. Here, more than 3 methods are determined for IVC, A57, Toyoma and TID2008 databases because they are the databases which are less correlated with PSNR or other image quality indices (See Table 3.8). We need to fuse more methods to achieve the same performance as in the other two databases. The selected fused IQA methods by using two algorithms (SFMS and BIRD) are listed in Table 3.8. The corresponding performances for both algorithms are also compared in Table 3.9. We can see that the performance is better by using the methods selected by SFMS. However, although the 76 Table 3.10: Performance Measure of CD-MMF with SFMS in TID2008 Database No. of fused IQA methods 1 2 3 4 5 PCC 0.9139 0.9265 0.9517 0.9538 0.9539 SROCC 0.9046 0.9174 0.9439 0.9463 0.9464 RMSE 0.5588 0.5049 0.4120 0.4032 0.4026 Figure 3.3: PCC comparison of CD-MMF in TID2008 database. performance degrades a little when using BIRD, the complexity of BIRD is lower than SFMS, as demonstrated in Table 3.6. 3.6.4 CD-MMF For CD-MMF, we increase the number of fused IQA methods from 1 to 5, and the performance stays almost the same (i.e., stop improving) when the number of fused methods is over 4 (Table 3.10). Note that the methods in Table 3.10 are selected by SFMS algorithm, which can help us reliably achieve better performance with a smaller number of methods. The PCC results are also plotted in Fig. 3.3 for comparison. We see that the quality prediction reaches the best correlation when the number of fused methods increases to four. Thus, we have similar conclusions as in CF-MMF, except for 77 Table 3.11: Selected Fused IQA Methods for CD-MMF in Six Databases Database Context SFMS BIRD A57 I m10, m6, m7 m10, m7, m8 II m4, m2, m1 m4, m8, m1 III m1, m10, m4 m1, m10, m5 IV m9, m10, m4 m9, m7, m8 CSIQ I m10, m6, m2 m10, m1, m7 II m10, m8, m5 m10, m1, m3 III m10, m2, m8 m10, m1, m3 IV m3, m6, m7 m3, m1, m4 IVC I m10, m6, m1 m10, m1, m3 II m10, m3, m7 m10, m7, m1 III m7, m2, m10 m7, m10, m1 LIVE I m3, m5, m10 m3, m9, m10 II m3, m1, m6 m3, m1, m7 III m10, m2, m4 m10, m1, m3 IV m3, m6, m9 m3, m1, m10 TID2008 I m6, m4, m5, m1 m6, m4, m10, m1 II m10, m6, m1, m5 m10, m4, m3, m7 III m9, m8, m6, m10 m9, m4, m3, m7 IV m3, m8, m9, m4 m3, m4, m5, m8 V m10, m7, m3, m1 m10, m3, m7, m9 Toyoma I m10, m5, m6 m10, m1, m7 Table 3.12: Performance Measure for CD-MMF in Six Databases Database Selection algorithm PCC SROCC RMSE A57 (3 methods) SFMS 0.9411 0.9498 0.0831 BIRD 0.9347 0.9354 0.0874 CSIQ (3 methods) SFMS 0.9675 0.9668 0.0664 BIRD 0.9630 0.9609 0.0707 IVC (3 methods) SFMS 0.9453 0.9382 0.3976 BIRD 0.9374 0.9285 0.4244 LIVE (3 methods) SFMS 0.9802 0.9805 5.4134 BIRD 0.9801 0.9798 5.4239 TID2008 (4 methods) SFMS 0.9538 0.9463 0.4032 BIRD 0.9476 0.9422 0.4289 Toyoma (3 methods) SFMS 0.9456 0.9411 0.4071 BIRD 0.9462 0.9421 0.4051 the upper limit of fused IQA methods decreasing from 8 to 4, and this allows us to lower the complexity of the proposed system for CD-MMF. The reason why CD-MMF works as well as CF-MMF and needs less fused IQA methods is because of the pre-classication of distortions in the rst stage. Then for these specic known distortions, we can use smaller number of fused IQA methods to achieve the same level of performance. As already mentioned earlier in Section 3.5, the optimal combination of fused IQA methods for each context is selected by two algorithms (SFMS and BIRD). The fused IQA methods selection rules for six databases are summarized in Table 3.11 for CD- MMF. And the performance measures for both algorithms are shown in Table 3.12. We 78 can observe that the performance dierence between SFMS and BIRD becomes smaller in CD-MMF. Especially, BIRD works a little better than SFMS in Toyoma database for CD-MMF setting. For the same reason as in CF-MMF, the number of fused IQA methods for each database can be decided via (3.28), which is still equal to n th;dbs + 1. Thus, 3 methods are adopted for all databases, except TID2008. Here, 4 methods are used for TID2008 database instead since it is the most dicult (in term of correlation) database among all. One more thing needs attention here. As shown in Table 3.11, the most frequent combinations of methods arefm10, m1, m3g andfm10, m1, m7g for BIRD algorithm, which are highlighted in bold in Table 3.11. These two combinations work well in 9 cases, especially for Blur and JPEG+JPEG2000 distortions. However, there is no combina- tion of methods that works consistently well for SFMS algorithm among all databases. Therefore, the BIRD algorithm is more general than the SFMS algorithm. 3.6.5 Performance Comparison between MMF and the Existing Meth- ods Finally, we compare the two proposed methods (i.e., CF-MMF, CD-MMF) and other mentioned quality indices with respect to the databases A57, CSIQ, IVC, LIVE, TID2008, and Toyoma. Here, we include one more quality index, information content weighted SSIM (IW-SSIM) [135] into the comparison since it has been proved to work better than other variants of SSIM. The results are shown in Tables 3.13, where the top three ranked methods are highlighted in bold. Apparently, top three ranked methods are all our proposed approaches, except VSNR in A57, and FSIM in IVC database. We observe that CD-MMF achieves the same performance as CF-MMF but with a smaller number of methods as shown in Table 3.13. In other words, CD-MMF will perform better than CF-MMF when using the same number of methods (see Table 3.13 (c), (d), (e), (f)). The A57 and CSIQ databases are the only exceptions among these six databases. There is no obvious improvement by using CD-MMF in CSIQ database 79 Table 3.13: Performance Comparison among 15 IQA Models in Six databases (a) A57 Database (54 images) (b) CSIQ Database (866 images) IQA Model PCC SROCC RMSE IQA Model PCC SROCC RMSE MS-SSIM 0.8737 0.8578 0.1196 MS-SSIM 0.8666 0.8774 0.1310 SSIM 0.8019 0.8067 0.1469 SSIM 0.8594 0.8755 0.1342 VIF 0.6160 0.6223 0.1936 VIF 0.9253 0.9194 0.0996 VSNR 0.9502 0.9359 0.0766 VSNR 0.8005 0.8108 0.1573 NQM 0.8027 0.7978 0.1466 NQM 0.7422 0.7411 0.1759 PSNR-HVS 0.8832 0.8502 0.1153 PSNR-HVS 0.8231 0.8294 0.1491 IFC 0.4549 0.3187 0.2189 IFC 0.8358 0.7671 0.1441 PSNR 0.6347 0.6189 0.1899 PSNR 0.8001 0.8057 0.1576 FSIM 0.9253 0.9181 0.2458 FSIM 0.9095 0.9242 0.1091 MAD 0.9059 0.9014 0.1041 MAD 0.9502 0.9466 0.0818 IW-SSIM 0.9024 0.8713 0.1059 IW-SSIM 0.9025 0.9212 0.1131 CF-MMF (SFMS) (5 methods) 0.9604 0.9590 0.0685 CF-MMF (SFMS) (3 methods) 0.9797 0.9755 0.0527 CF-MMF (BIRD) (5 methods) 0.9465 0.9475 0.0793 CF-MMF (BIRD) (3 methods) 0.9698 0.9657 0.0641 CD-MMF (SFMS) (3 methods) 0.9411 0.9498 0.0831 CD-MMF (SFMS) (3 methods) 0.9675 0.9668 0.0664 CD-MMF (BIRD) (3 methods) 0.9347 0.9354 0.0874 CD-MMF (BIRD) (3 methods) 0.9630 0.9609 0.0707 (c) IVC Database (185 images) (d) LIVE Database (779 images) IQA Model PCC SROCC RMSE IQA Model PCC SROCC RMSE MS-SSIM 0.9108 0.8971 0.5031 MS-SSIM 0.9402 0.9521 9.3038 SSIM 0.9117 0.9018 0.5007 SSIM 0.9384 0.9479 9.4439 VIF 0.9026 0.8964 0.5244 VIF 0.9597 0.9636 7.6737 VSNR 0.8027 0.7993 0.7265 VSNR 0.9235 0.9279 10.4816 NQM 0.8489 0.8343 0.6440 NQM 0.9128 0.9093 11.1570 PSNR-HVS 0.8648 0.8590 0.6118 PSNR-HVS 0.9134 0.9186 11.1228 IFC 0.9093 0.8993 0.5069 IFC 0.9261 0.9259 10.3052 PSNR 0.7192 0.6885 0.8465 PSNR 0.8701 0.8756 13.4685 FSIM 0.9376 0.9262 0.4236 FSIM 0.9540 0.9634 8.1938 MAD 0.9210 0.9146 0.4748 MAD 0.9672 0.9669 6.9419 IW-SSIM 0.9228 0.9125 0.4693 IW-SSIM 0.9425 0.9567 9.1301 CF-MMF (SFMS) (4 methods) 0.9352 0.9226 0.4313 CF-MMF (SFMS) (3 methods) 0.9734 0.9732 6.2612 CF-MMF (BIRD) (4 methods) 0.9205 0.9096 0.4760 CF-MMF (BIRD) (3 methods) 0.9712 0.9710 6.5131 CD-MMF (SFMS) (3 methods) 0.9453 0.9382 0.3976 CD-MMF (SFMS) (3 methods) 0.9802 0.9805 5.4134 CD-MMF (BIRD) (3 methods) 0.9374 0.9285 0.4244 CD-MMF (BIRD) (3 methods) 0.9801 0.9798 5.4239 (e) TID2008 Database (1700 images) (f) Toyoma Database (168 images) IQA Model PCC SROCC RMSE IQA Model PCC SROCC RMSE MS-SSIM 0.8389 0.8528 0.7303 MS-SSIM 0.8948 0.8911 0.5588 SSIM 0.7715 0.7749 0.8537 SSIM 0.8877 0.8794 0.5762 VIF 0.8055 0.7496 0.7953 VIF 0.9137 0.9077 0.5087 VSNR 0.6820 0.7046 0.9815 VSNR 0.8705 0.8609 0.6160 NQM 0.6103 0.6243 1.0631 NQM 0.8893 0.8871 0.5724 PSNR-HVS 0.5977 0.5943 1.0759 PSNR-HVS 0.7884 0.7817 0.7700 IFC 0.7186 0.5707 0.9332 IFC 0.8404 0.8355 0.6784 PSNR 0.5355 0.5245 1.1333 PSNR 0.6355 0.6133 0.9663 FSIM 0.8710 0.8805 0.6592 FSIM 0.9077 0.9059 0.5253 MAD 0.8306 0.8340 0.7474 MAD 0.9406 0.9362 0.4248 IW-SSIM 0.8488 0.8559 0.7094 IW-SSIM 0.9244 0.9203 0.4774 CF-MMF (SFMS) (6 methods) 0.9525 0.9487 0.4087 CF-MMF (SFMS) (5 methods) 0.9477 0.9419 0.3995 CF-MMF (BIRD) (6 methods) 0.9482 0.9434 0.4261 CF-MMF (BIRD) (5 methods) 0.9451 0.9402 0.4091 CD-MMF (SFMS) (4 methods) 0.9538 0.9463 0.4032 CD-MMF (SFMS) (3 methods) 0.9456 0.9411 0.4071 CD-MMF (BIRD) (4 methods) 0.9476 0.9422 0.4289 CD-MMF (BIRD) (3 methods) 0.9462 0.9421 0.4051 as shown in Table 3.13 (b). We think that it may be caused by the lower context classication rate (see Table 3.5). In addition, the performance of CD-MMF is lower than CF-MMF in A57 database (Table 3.13 (a)). This probably results from the small number of training images (less than 10 for each context) we can use in A57 after the context classication. In order to see the trend better, we plot the corresponding bar chart of the PCC for the TID2008 database in Fig. 3.4. Clearly, CD-MMF ranks the 80 Figure 3.4: Comparison of the PCC measure of 15 IQA models in the TID2008 database. rst (with the highest PCC, SROCC and the smallest RMSE), and CF-MMF the second among all 15 IQA models in comparison. The scatter plots of predicted objective scores vs. MOS for all the IQA models along with the best tting logistic functions are shown in Fig. 3.5 to Fig. 3.19. Each point on the plot represents one image in the database, the horizontal axis corresponds to the (scaled) IQA score for that image and the vertical axis corresponds to the subjective MOS for that image. The scatter plots conrm the results shown in Table 3.13 (e). Obviously, CF-MMF and CD-MMF have the best correlation with human judgments (i.e., MOS). 3.6.6 Cross-Database Evaluation To test the generality of our proposed approach, we perform the cross-database evalua- tion since MMF is a learning-oriented method [76]. As we know, each database contains 81 Figure 3.5: Scatter plots and best tting logistic function of objective IQA (MS-SSIM) scores vs. MOS for the TID2008 database. Figure 3.6: Scatter plots and best tting logistic function of objective IQA (SSIM) scores vs. MOS for the TID2008 database. 82 Figure 3.7: Scatter plots and best tting logistic function of objective IQA (VIF) scores vs. MOS for the TID2008 database. Figure 3.8: Scatter plots and best tting logistic function of objective IQA (VSNR) scores vs. MOS for the TID2008 database. 83 Figure 3.9: Scatter plots and best tting logistic function of objective IQA (NQM) scores vs. MOS for the TID2008 database. Figure 3.10: Scatter plots and best tting logistic function of objective IQA (PSNR-HVS) scores vs. MOS for the TID2008 database. 84 Figure 3.11: Scatter plots and best tting logistic function of objective IQA (IFC) scores vs. MOS for the TID2008 database. Figure 3.12: Scatter plots and best tting logistic function of objective IQA (PSNR) scores vs. MOS for the TID2008 database. 85 Figure 3.13: Scatter plots and best tting logistic function of objective IQA (FSIM) scores vs. MOS for the TID2008 database. Figure 3.14: Scatter plots and best tting logistic function of objective IQA (MAD) scores vs. MOS for the TID2008 database. 86 Figure 3.15: Scatter plots and best tting logistic function of objective IQA (IW-SSIM) scores vs. MOS for the TID2008 database. Figure 3.16: Scatter plots and best tting logistic function of objective IQA (CF-MMF (SFMS)) scores vs. MOS for the TID2008 database. 87 Figure 3.17: Scatter plots and best tting logistic function of objective IQA (CF-MMF (BIRD)) scores vs. MOS for the TID2008 database. Figure 3.18: Scatter plots and best tting logistic function of objective IQA (CD-MMF (SFMS)) scores vs. MOS for the TID2008 database. 88 Figure 3.19: Scatter plots and best tting logistic function of objective IQA (CD-MMF (BIRD)) scores vs. MOS for the TID2008 database. dierent image distortion types. In order to have a fair performance evaluation, we divide the images into several distortion groups (i.e., contexts) according to Table 3.4. There are four major distortion groups across all databases, namely: 1. Additive Noise: Context I in A57, CSIQ, LIVE and TID2008. 2. Blur: Context II in A57, CSIQ, IVC, LIVE and TID2008. 3. JPEG+JPEG2000: Context III in A57, CSIQ, IVC, LIVE, TID2008, and context I in Toyoma. 4. Transmission Error: Context IV in LIVE and TID2008. We use all the images from one distortion group in one database for the training, and test the images from the same distortion group in the other remaining databases. For example, we build a trained model (for additive noise) from LIVE database. Then, we use this trained model to do the testing on A57, CSIQ, and TID2008. Here, we use LIVE 89 Table 3.14: Cross-Database PCC Performance of MMF (in Terms of Distortion Groups) Test Database Distortion group |||| A57 CSIQ IVC LIVE TID2008 Toyoma Model Additive Noise LIVE 0.9827 0.9662 - - 0.9195 - TID2008 0.9782 0.9586 - 0.9863 - - Blur LIVE 0.8754 0.9467 0.9841 - 0.9255 - TID2008 0.9449 0.9652 0.9841 0.9455 - - JPEG+JPEG2000 LIVE 0.9725 0.9687 0.9191 - 0.9755 0.9361 TID2008 0.9289 0.9685 0.9055 0.9710 - 0.8651 Transmission Error) LIVE - - - - 0.8898 - TID2008 - - - 0.9607 - - Table 3.15: Cross-Database PCC Performance of MMF (in Terms of Whole Database) Test Database |||| A57 CSIQ IVC LIVE TID2008 Toyoma Model LIVE 0.8839 0.9710 0.9300 - 0.8984 0.9374 TID2008 0.9454 0.9127 0.9277 0.9644 - 0.9132 and TID2008 as our training databases since they cover most distortion groups and also have larger sizes of training images. Moreover, we only choose 3 methods for fusion in each case to reduce the complexity of MMF. The performance results are summarized in Table 3.14. For additive noise distortions, we see that using LIVE database as the trained model would give us slightly better performance than TID2008 in A57 and CSIQ. The perfor- mance with TID2008 is acceptable, but not as good as that with these aforementioned two databases. This is probably because TID2008 contains 7 kinds of additive noises, and LIVE only contains 1 type of additive noise. For blur distortions, the TID2008 trained model has better prediction performance than LIVE in A57 and CSIQ. Regarding JPEG and JPEG2000 distortions, the LIVE trained model predicts more accurately than the TID2008 trained one in A57, IVC and Toyoma. For transmission errors, the TID2008 trained model can provide a better performance for LIVE than the LIVE trained model can provide for TID2008. Moreover, we use all the images from one database for the training, and test the images from the other remaining databases. For instance, we build a trained model from LIVE database. Then, we use this trained model to do the testing on A57, CSIQ, 90 Table 3.16: n-Fold Cross-Validation Performance of CF-MMF with SFMS in TID2008 Database n 2 3 4 5 PCC 0.9501 0.9515 0.9531 0.9525 SROCC 0.9465 0.9477 0.9497 0.9487 RMSE 0.4184 0.4130 0.4063 0.4087 Table 3.17: Performance Comparison among Fused IQA Methods Selection Algorithms for CF-MMF in LIVE Database Algorithm Fused IQA methods PCC SROCC RMSE Exhaustive Search m10, m3, m6 0.9734 0.9732 6.2612 SFMS m10, m3, m6 0.9734 0.9732 6.2612 BIRD m10, m1, m3 0.9712 0.9710 6.5131 Random Selection 1 m1, m3, m5 0.9541 0.9556 8.1821 Random Selection 2 m2, m4, m6 0.9437 0.9473 9.0346 IVC, TID2008 and Toyoma.. Here, we also use LIVE and TID2008 as our training databases since they cover more distortion groups and also have larger sizes of training images. Similarly, we only choose 3 IQA methods for fusion in each case to reduce the complexity of MMF. The performance results are summarized in Table 3.15. We can observe the PCC for TID2008 is below 0.9 since only 3 IQA methods are fused for this case. However, the PCC is over 0.91 for most cases. This further veries the generality and robustness of our proposed system. 3.6.7 Further Discussion First, we want to investigate how the performance changes when we use dierent selec- tions of n for n-fold cross-validation. We perform n-fold (n = 2-5) cross-validation in TID2008 database. Table 3.16 shows the performance results. There is little dierence for the performance among dierentn values. As long as there are enough training data, the results will be similar for dierent choices of n values. Second, we would like to compare the performances among several fused IQA meth- ods selection algorithms, which include exhaustive search, SFMS, BIRD, and random selection. The results are listed in Table 3.17. Obviously, both SFMS and BIRD have better performance than the random selection of fused IQA methods. Particularly, SFMS has the same performance as the exhaustive 91 Table 3.18: Performance Comparison among Dierent Objective Functions J(Y N ) for CF-MMF (with SFMS) in LIVE Database J(Y N ) Fused IQA methods PCC SROCC RMSE (3.23) m10, m3, m6 0.9734 0.9732 6.2612 (3.24) m10, m3, m6 0.9734 0.9732 6.2612 (3.25) m10, m3, m6 0.9734 0.9732 6.2612 Table 3.19: Performance Comparison in TID2008 Database Model IQA methods Weights PCC SROCC RMSE Linear combination [66] m9, m3 [0.95, 0.05] 0.8751 0.8820 0.6495 [0.85, 0.15] 0.8752 0.8753 0.6491 [0.5, 0.5] 0.8476 0.8196 0.7120 [0.15, 0.85] 0.8197 0.7681 0.7687 [0.05, 0.95] 0.8106 0.7554 0.7859 SVR m9, m3 [6.2830, 7.5598] 0.9053 0.8953 0.5700 search of fused IQA methods since they all select the same methods for fusion. How- ever, the computation complexity is much lower for SFMS and BIRD comparing to the exhaustive search (see N = 3 case in Table 3.6). Hence, we can conclude that a good fused IQA methods selection algorithm (e.g., SFMS or BIRD) is indeed necessary for both performance improvement and complexity reduction. Next, we want to test if dierent objective functions have dierent impacts on the performance. Three objective functions (3.23), (3.24), and (3.25) are applied separately on the fused IQA methods selection algorithm. From Table 3.18, we know that they all lead to choose the same methods for fusion. This result is quite reasonable since PCC becomes higher when SROCC value increases in most of the cases. In the meantime, the RMSE value also decreases accordingly. That's why we obtain the same performance no matter which one is applied. In Table 3.19 we compare the performance between IQA methods combined by linear weights and SVR. We try several linear combinations of weights mentioned in [66] to combine m9 and m3. The best PCC result is 0.8751, which is still lower than 0.9053, obtained by SVR. The comparison in Table 3.19 clearly shows the advantage of SVR over linear combination. Here, we only choose two methods for fusion in SVR to have a fair comparison with linear combination approach. 92 Table 3.20: Comparison of Computation Time among 15 IQA Models in CSIQ Database IQA Model Time (sec/image) MS-SSIM 0.21 SSIM 0.10 VIF 3.22 VSNR 0.63 NQM 0.49 PSNR-HVS 4.95 IFC 2.00 PSNR 0.07 FSIM 0.79 MAD 53.57 IW-SSIM 0.94 CF-MMF (SFMS) (3 methods fused) 61.74 CF-MMF (BIRD) (3 methods fused) 55.78 CD-MMF (SFMS) (3 methods fused) 64.40 CD-MMF (BIRD) (3 methods fused) 59.63 Finally, the computational complexity among 15 IQA models is compared in Table 3.20. The measurement is in terms of the computation time required to eval- uate an image of size 512512 by using a computer with Intel core i7 processor @1.73 GHz. It can be seen that 10 of the other 11 IQA models require less than 5 seconds nishing the assessment, except MAD. MAD still needs about 54 seconds to complete the job. Since MAD is also one of the IQA methods used in CF-MMF and CD-MMF, the computation time for both MMF-based approaches is denitely greater than 54 seconds. As we can see in Table 3.20, the computation time for MMF methods is approximately 1 minute, while CD-MMF also requires 3 to 4 seconds more than CF-MMF for the assess- ment. However, it would be 5 to 6 seconds shorter when we select methods by using BIRD instead of SFMS. 3.7 Conclusion and Future Work As far as we know, there is no single existing image quality index gives the best perfor- mance in all situations. Thus, in this work, we have proposed an open, inclusive frame- work for better performance with the current level of technology and for easy extension when new technology emerges. To be more specic, we have presented a multi-method fusion (MMF) approach for image quality assessment and proposed two MMF-based 93 quality indices, based upon machine learning. It was shown by experiments with six dierent databases (totally 3752 images) that both of them outperform state-of-the-art quality indices by a signicant margin. As expected, the complexity of the MMF method is higher since it involves the calculation of multiple methods. However, with the help of the algorithms (SFMS and BIRD), we can reduce the number of fused methods and lower the complexity of MMF. As long as we can keep the number of fused IQA methods not more than three, the computational time of MMF is around 1 minute per image. In most cases, we only need 3 methods for the fusion to achieve satisfactory performance (i.e., PCC over 0.93). Even in the most dicult (in terms of correlation) TID2008 database, MMF can also achieve 0.9307 and 0.9517 on PCC (Tables 3.7 and 3.10) by using 3 methods for CF-MMF and CD-MMF, respectively. The performance of MMF outperforms other well-known image quality indices. Another advantage of the proposed MMF methodology is its exibility in including new methods. For example, in this work, we have added two new methods, which are FSIM [148] and MAD [63], into the candidate method set to replace poor-performed methods (VIFP [117] and UQI [131]) that we used in [75]. It turns out that the PCC performance of CD-MMF improves from 0.9438 to 0.9538 for TID2008 database. This is a good demonstration of the forward inclusiveness of the proposed methodology. Although the proposed MMF has excellent performance, one issue concerning con- text classication for CD-MMF needs to be resolved in the future. Since one image may consist of multiple distortion types, the strict classication of images into one specic context may lead to the wrong context category, and then aect the subsequent qual- ity prediction. One possible and better solution to overcome this shortcoming is to use unsupervised classication for context determination. Another alternative is to attach beliefs to the classication of the context and weight the corresponding regressed pre- dicted quality indices (as in [88]). The two approaches mentioned above should be able to help to further improve the overall performance of MMF. 94 In the end, we perform the tests across databases. The results are quite promising. Therefore, the generality of our proposed MMF approach is also demonstrated. 95 Chapter 4 A ParaBoost Method to Image Quality Assessment 4.1 Introduction For decades, the Peak Signal-to-Noise Ratio (PSNR) has been the most well-known method to gauge the quality of an image or a video. Although researchers have doubts on the credibility of PSNR (or its relative, MSE) [51], [132], it is still widely used nowadays. The main reason is that PSNR is easy to compute. Wang et al. [133] proposed to use the Structural SIMilarity (SSIM) index to measure image quality in 2004, which has attracted a lot of attention due to its simplicity and good performance. With increasing demand of image quality assurance and assessment, more and more databases are made publicly available in recent years, such as LIVE [9], TID2008 [16], CSIQ [2], and TID2013 [15] to facilitate the development of image quality metrics. The SSIM and its variant IW-SSIM [135] work well across databases. The feature similarity (FSIM) index [148] outperforms the SSIM in several databases. Recently, the learning-based approach emerges as a strong competitor in the IQA eld since it is dicult to predict visual quality under various distortion types and rich image contents using a single formula [76, 79]. Examples of learning-based IQA methods can be found in [83, 121, 122, 91, 88, 75, 78, 59], among many others. Simply speaking, they extract features from images, and use a machine learning approach to get a score prediction model (called a scorer), which is used to predict the perceived quality of test images. However, there exist many distortion types and it is dicult to nd a single prediction model to cover all of them. Liu et al. [78] proposed a fusion approach called 96 MMF that fuses the scores of a couple of IQA methods to generate a new score using a machine learning method. These IQA methods include PSNR, SSIM, FSIM, etc. They are called strong (or universal) scorers since they are not designed for specic distortion types. Intuitively, the fusion of stronger scorers will result in an even stronger scorer. Thus, it is not surprising that MMF outperforms each individual scorer in the ensemble. In this work, we still adopt an ensemble approach for full-reference image quality assessment, yet examine the fusion of scores from a larger number of weak scorers. To derive weak scorers, we extract features from existing image quality metrics and train them to form basic image quality scorers (BIQSs). Next, we select additional features which are useful in characterizing specic distortion types, and train them to construct auxiliary image quality scorers (AIQSs). Since BIQSs and AIQSs are only trained on small image subsets with certain distortion types, they are weak scorers with respect to a wide variety of distortions in an IQA database. Finally, we propose a ParaBoost scheme to fuse BIQSs and AIQSs to form an ensemble system to cope with a wide range of distortion types. The main advantage of the ParaBoost method is that we can design an IQS tailored to a specic distortion type and add it to the ensemble system. Thus, the corresponding IQA scoring system can be easily extended to images with new distortion types. Extensive experiments are conducted to demonstrate the superior performance of the ParaBoost method. Experimental results show that it ourperforms existing IQA methods by a signicant margin. Specically, the Spearman rank order correlation coecients (SROCCs) of the ParaBoost method with respect to the LIVE, CSIQ, TID2008 and TID2013 image quality databases are 0.98, 0.97, 0.98 and 0.96, respectively. The rest of this chapter is organized as follows. We give a brief review on recent learning-based IQA methods in Section 4.2. Then, BIQSs and AIQSs are presented in Section 4.3. The ParaBoost method is described in Section 4.4 and and the process of selecting a suitable subset of scorers is discussed in Section 4.5, respectively. Exper- imental results are reported in Section 4.6, where we conduct extensive performance 97 comparisons with four image quality databases. Finally, concluding remarks are given in Section4.7. 4.2 Review of Previous Work The machine learning methodology has been applied to image quality evaluation. Nar- waria and Lin [91] used the singular value decomposition (SVD) to quantify the major structural information in images and then adopt support vector regression (SVR) to learn complex data patterns and map the detected features to scores for image quality prediction. The result is better than those obtained by formula-based methods. Moorthy and Bovik [88] also developed a learning-based IQA method. In the rst stage, the algorithm estimates the presence of ve distortions in an image, including JPEG, JPEG2000, white noise, Gaussian blur, and fast fading. The probability of each distortion in an image is obtained via classication. In the second stage, image quality is computed for each of these distortions. Finally, the nal image quality is the probability-weighted sum of scores from the second stage. The index can be applied to images consisting of multiple distortion types. Liu et al. [75, 78] proposed a multi-method fusion (MMF) method for image quality assessment. It is motivated by the observation that no single method gives the best performance in all situations. A regression approach is used to combine scores of multiple IQA methods in the MMF. First, a large number of image samples are collected, each of which has a score labeled by human observers and scores associated with dierent IQA methods. The MMF score is obtained by a nonlinear combination of scores computed by multiple methods (including SSIM [133], FSIM [148], etc.) with suitable weights obtained by a training process. To improve the predicted scores furthermore, distorted images are classied into ve groups based on distortion types and regression is performed within each group, which is called the "context-dependent MMF" (CD-MMF). So far, MMF 98 oers one of the best IQA results in several popular databases such as LIVE, CSIQ and TID 2008. A block-based MMF (BMMF) [59] method was also proposed for image quality assess- ment. First, an image is decomposed into small blocks. Blocks are then classied into three types (smooth, edge, and texture) while distortions are classied into ve groups. Finally, one proper IQA metric is selected for each block based on the block type and the distortion group. Pooling over all blocks leads to the nal quality score of a test image. It oers competitive performance with the MMF for the TID2008 database. As compared with previous work, the ParaBoost method proposed in this work has several unique characteristics. It fuses scores from a set of weak IQSs, where each IQS can be designed to predict the quality of some specic image distortion types. The proposed ParaBoost system can perform well in situations where individual IQS cannot perform well. An IQS bank structure is adopted to optimize the overall performance of the IQA system. The structure of the ensemble system is modular so that we can add or discard an IQS easily depending on the application need. Each IQS is built by training images on dierent distortion types to increase the diversity among all scorers which can help optimize the ensemble performance. No need for distortion classication stage which can degrade the overall perfor- mance when the classication rate is low. 4.3 Feature Exraction for Image Quality Scorers In this work, we use the term "image quality scorer (IQS)", to denote a method that can give a quality score to an image. Two IQS types are examined in this section: basic IQSs (BIQSs) and auxiliary IQSs (AIQSs). BIQSs are derived from well-known IQA metrics 99 while AIQSs are designed to tailor to specic distortion types. In this section, we focus on feature extraction for BIQSs and AIQSs. 4.3.1 Features for Basic Image Quality Scorers (BIQSs) We derive features for BIQSs from several well-known image quality metrics by decom- posing their contributing factors. To be more specic, we choose three components (i.e., luminance (L), contrast (C), and structure (S)) of SSIM [133] as features of the rst three BIQSs, respectively. Furthermore, we extract two other components; namely, phase congruency (PC) and gradient magnitude (GM)), from FSIM [148]) and use them as features for the 4th and 5th scorers since PC and GM have totally dierent charac- teristics with L, C, and S. Finally, PSNR is selected as the 6th basic scorer because of its simplicity and superior capability in predicting quality of images with additive noise [75, 78]. These 6 BIQSs are detailed below. The features of the rst three BIQSs are the similarity measures of luminance (L), contrast (C), structure (S) between reference and distorted images, respectively. Suppose x represents both image patches extracted from the same spatial location of reference and distorted images, and r (x), d (x), 2 r (x), 2 d (x), rd (x) are means, variances, the covariance of x from the reference and distorted images, respectively. When there are N such image patches for the whole image I, the luminance similarity measure between the two images is selected as the feature for BIQS #1: BIQS #1 :S L = 1 N X x2I 2 r (x) d (x) +C 1 2 r (x) + 2 d (x) +C 1 : (4.1) The contrast similarity measure is selected as the feature for BIQS #2: BIQS #2 :S C = 1 N X x2I 2 r (x) d (x) +C 2 2 r (x) + 2 d (x) +C 2 : (4.2) 100 The structure similarity measure is selected as the feature for BIQS #3: BIQS #3 :S S = 1 N X x2I rd (x) +C 3 r (x) d (x) +C 3 : (4.3) Constants in (4.1)-(4.3) are C 1 = (K 1 D) 2 ;C 2 = (K 2 D) 2 ;C 3 = C 2 =2;K 1 = 0:01;K 2 = 0:03, and D is the dynamic range of pixel values (i.e., D = 255 for the 8-bit pixel representation). The 4th and 5th BIQSs measure the similarity of phase congruency (PC) and gradient magnitude (GM) between reference and distorted images. Assume x represents both image patches extracted from the same spatial location of reference and distorted images, and PC r (x), PC d (x), GM r (x), and GM d (x) are PCs and GMs of x from the reference and the distorted images, respectively. For an image I with N image patches, the PC similarity measure between the two images is selected as the feature for BIQS #4: BIQS #4 :S PC = 1 N X x2I 2PC r (x)PC d (x) +T 1 PC 2 r (x) +PC 2 d (x) +T 1 ; (4.4) and the GM similarity between two images is selected as the feature for BIQS #5 BIQS #5 :S GM = 1 N X x2I 2GM r (x)GM d (x) +T 2 GM 2 r (x) +GM 2 d (x) +T 2 ; (4.5) where T 1 and T 2 are positive constants which are added to avoid instability of S PC and S GM . The 6th BIQS is based on the PSNR value, which is related to the mean-squared error (MSE). For two images I r and I d , of size XY , the MSE can be computed via MSE = 1 XY X x X y [I r (x;y)I d (x;y)] 2 : (4.6) 101 Then, the PSNR value in decibels is used as the feature for BIQS #6 BIQS #6 :PSNR = 10 log D 2 MSE ; (4.7) where D is the maximum value that a pixel can take (e.g., 255 for 8-bit images). 4.3.2 Features of Auxiliary Image Quality Scorers (AIQSs) There are 17 image distortion types in the TID2008 database [16, 108], which are listed in Table 4.1 for easy reference. In Table 4.2, we summarize the Spearman rank order correlation coecient (SROCC) between objective and subjective scores for the perfor- mance of each BIQS with respect to all distortion types. The higher the SROCC is, the better match between these two scores. As indicated in Table 4.2, each BIQS has its respective advantage in predicting image quality scores for certain distortion types. For example, BIQS #5 can predict the image quality quite well for distortion types 8-15, and 17 while BIQS #6 has the best performance for distortion types 1-7 among six BIQSs. As we can see in Table 4.2, several distortion types (e.g., types 14, 16, and 17) cannot be handled well even with all six BIQSs. Hence, we need to nd more features to design new scorers to boost the performance. These scorers are called AIQSs since they are designed to support BIQSs in addressing specic distortion types. The feature of the 1st AIQS is the zero-crossing (ZC) rate [137], which is dened as z h (i;j) = 8 < : 1 ZC happens at d h (i;j) 0 otherwise; (4.8) where d h (i;j) = x(i;j + 1)x(i;j);j 2 [1;N 1], is the dierence signal along the horizontal line, and x(i;j);i 2 [1;M];j 2 [1;N] for an image of size MN. The horizontal ZC rate can be written as Z h = 1 M(N 2) M X i=1 N2 X j=1 z h (i;j): (4.9) 102 Table 4.1: Image Distortion Types in TID2008 Database Type Type of distortion 1 Additive Gaussian Noise 2 Dierent additive noise in color components 3 Spatially correlated noise 4 Masked noise 5 High frequency noise 6 Impulse noise 7 Quantization noise 8 Gaussian blur 9 Image denoising 10 JPEG compression 11 JPEG2000 compression 12 JPEG transmission errors 13 JPEG2000 transmission errors 14 Non-eccentricity pattern noise 15 Local block-wise distortions of dierent intensity 16 Mean shift (intensity shift) 17 Contrast change Table 4.2: SROCC Performance of BIQSs vs. Distortion Types in TID2008 Database Distortion type BIQS #1 (S L ) BIQS #2 (S C ) BIQS #3 (S S ) BIQS #4 (S PC ) BIQS #5 (S GM ) BIQS #6 (PSNR) 1 0.0710 0.7052 0.7958 0.4896 0.8737 0.8917 2 0.0520 0.5340 0.7688 0.0431 0.8202 0.8814 3 0.0735 0.7304 0.8112 0.6090 0.8581 0.9089 4 -0.1183 0.0534 0.6877 0.0504 0.2824 0.8274 5 0.1334 0.7885 0.8796 0.4593 0.9036 0.9182 6 0.1077 0.4458 0.6942 0.3724 0.7113 0.9041 7 0.5365 0.8079 0.7788 0.7197 0.8347 0.8615 8 0.1384 0.9265 0.9575 0.8115 0.9460 0.8668 9 0.5729 0.9483 0.9424 0.8310 0.9552 0.9340 10 0.2217 0.8688 0.9097 0.7747 0.9405 0.8945 11 0.1187 0.9451 0.9520 0.9255 0.9720 0.8033 12 -0.0244 0.7597 0.8360 0.5758 0.8570 0.7142 13 0.1143 0.7543 0.8444 0.7117 0.8695 0.7687 14 0.1445 0.4260 0.6587 0.5335 0.6902 0.5551 15 0.0449 0.7749 0.8264 0.6534 0.8365 0.5456 16 0.6778 0.1590 0.0651 0.0443 0.2910 0.6750 17 -0.0509 0.6604 0.0316 0.0509 0.7658 0.6097 We can calculate the vertical component Z v in a similar fashion. Finally, the overall ZC rate is selected as the feature for AIQS #1 as shown below: AIQS #1 :ZC = Z h +Z v 2 : (4.10) AIQS #1 is particularly useful in evaluating distortion type 5 (high frequency noise) and distortion type 11 (JPEG 2000 compression). 103 Figure 4.1: Spatial relationship of pixel of interest in GLCM. The feature of the 2nd AIQS is derived from the gray-level co-occurrence matrix (GLCM), which is also known as the gray-tone spatial-dependence matrix [54]. The GLCM characterizes the texture of an image by calculating how often a pixel with intensity (gray-level) valuel occurs in a specic spatial relationship to a pixel with value m. Here, we are interested in two spatial relationships as shown in Fig. 4.1. Each element at (l;m) in the resultant GLCM is simply the sum of frequencies for one pixel with value l and its neighbor pixel satisfying the desired spatial relationship with value m. Then, the contrast dierence feature of GLCM is selected as the feature of AIQS #2 as given by AIQS #2 : X l;m jlmj 2 p(l;m); (4.11) wherep(l;m) is the joint probability for occurrence of pixel pairs having gray level values l and m with a dened spatial relationship (say, Fig. 4.1) in the image. In essence, the 2nd AIQS in (4.11) returns a measure of the intensity contrast between a pixel and its horizontal and vertical neighbors over the whole image. AIQS #2 is also useful for distortion type 11 (JPEG 2000 compression). The features of the 3rd AIQS are derived from the rotation-invariant and uniform local binary pattern (LBP) operator [97], which is in form of LBP riu2 P;R = 8 < : P P1 p=0 s (g p g c ) if U(LBP P;R ) 2 P + 1 otherwise; (4.12) 104 where U(LBP P:R ) =js (g p1 g c )s (g 0 g c )j + P1 X p=1 js (g p g c )s (g p1 g c )j; (4.13) and s (g p g c ) = 8 < : 1 if g p g c 0 otherwise: (4.14) withg c corresponding to the gray value of the center pixel of the local neighborhood and g p (p = 0;:::;P 1) corresponding to the gray values of P equally spaced pixels on a circle of radiusR (R> 0) that form a circularly symmetric neighbor set. The superscript riu2 stands for the use of rotation-invariant "uniform" patterns that have a U value of at most 2. Then the features of AIQS #3 can be written as AIQS #3 : p jnhist r (b)nhist d (b)j; (4.15) whereb = 0; 1;:::;P + 1 represents the bin of the histogram, and nhist r ,nhist d denote the normalized histograms of (4.12) for reference and distorted images, respectively. As shown in Fig. 4.2, we choose P = 8 and R = 1 for simplicity of the LBP operator. Therefore, we have 10 (i.e., P + 2) values to represent the scorer in (4.15). If a larger P value is chosen, then the complexity of LBP operator will be higher because the implementation needs a lookup table of 2 P elements [97]. AIQS #3 is useful in evaluating distortion types 6-11. The features of the 4th and the 5th AIQSs are used to characterize the edge structure of an image. An image is rst divided into M non-overlapping 16 16 patches, and the Sobel edge operator [110, 53] is used in each patch to generate horizontal gradient g h and vertical gradient g v . Then, we can obtain the edge magnitude and edge orientation via q g 2 v +g 2 h and tan 1 gv g h , respectively. 105 Figure 4.2: Circularly symmetric neighbor set in LBP. Suppose that h 1;i (b) and h 2;i (b) represent the n-bin histograms of edge magnitude from thei-th image patch of the reference and distorted images, respectively, and we can compute the root-mean-squared error (RMSE) between histogramsh 1;i (b) andh 2;i (b) by RMSE mag;i = v u u t 1 n n X b=1 [h 1;i (b)h 2;i (b)] 2 ; i = 1;:::;M; (4.16) where b denotes the bin of the histogram, and n is the number of bins (n = 10 in this work for low computation). Similarly,h 3;i (b) andh 4;i (b) represent the histograms of edge orientation from thei-th image patch of the reference and distorted images, respectively, and then the RMSE between histogramsh 3;i (b) andh 4;i (b) can be computed in a similar way as (4.16): RMSE ori;i = v u u t 1 n n X b=1 [h 3;i (b)h 4;i (b)] 2 ; i = 1;:::;M: (4.17) Finally, features of the 4th and 5th AIQSs are given by: AIQS #4 : [RMSE mag;1 ; ;RMSE mag;M ] T ; (4.18) and AIQS #5 : [RMSE ori;1 ; ;RMSE ori;M ] T : (4.19) 106 Figure 4.3: Extracting features of the 4th and 5th AIQSs. We show the procedure of extracting features for the 4th and the 5th AIQSs in Fig. 4.3. These two AIQSs are useful in evaluating distortion type 14. Features of other AIQSs are also derived from a local region. We divide an image into N non-overlapping blocks of size 8 8 and compute three quantities, known as the mean-squared error (MSE), the mean dierence ratio (MDR), and the contrast dierence ratio (CDR), respectively, for each block: MSE j = 1 64 8 X x=1 8 X y=1 [IB r (x;y)IB d (x;y)] 2 ; j = 1;:::;N; (4.20) MDR j = m d m r m r ; j = 1;:::;N; (4.21) CDR j = c d c r c r ; j = 1;:::;N; (4.22) where m d = 1 64 P 8 x=1 P 8 y=1 [IB d (x;y)], m r = 1 64 P 8 x=1 P 8 y=1 [IB r (x;y)], c d = maxfIB d (x;y)g minfIB d (x;y)g, c r = maxfIB r (x;y)g minfIB r (x;y)g, and where IB d andIB r represent the 8 8 image blocks of the distorted and the reference images, 107 respectively. Then, features of the 6th to the 8th AIQSs are the 10-bin histograms of (4.20), (4.21) and (4.22): AIQS #6 :Histogram 10bin (MSE j jj = 1;:::;N); (4.23) AIQS #7 :Histogram 10bin (MDR j jj = 1;:::;N); (4.24) AIQS #8 :Histogram 10bin (CDR j jj = 1;:::;N): (4.25) AIQS #6 and #8 are designed to address distortion types 15 and 17, respectively, while AIQS #7 can be used to boost the performance of BIQSs with respect to distortion type 16. In above, AIQS #1 is designed to evaluate global distortions while AIQSs #2-8 are used to evaluate local distortions. We need more AIQSs to evaluate global distortions. For the 9th AIQS, we consider the mean absolute dierence (mAD) between reference (I r ) and distorted (I d ) images. For reference and distorted images of size MN, we use three components of the YIQ color space simultaneously to account for dierences in dierent color dimensions [110, 53]. The mAD in Y, I, and Q components can be described below: mAD i = 1 MN M X m=1 N X n=1 jI r;i (m;n)I d;i (m;n)j; i =Y;I;Q: (4.26) Then, the feature vector of the 9th AIQS is formed by concatenating the three compo- nents of mAD in (4.26): AIQS #9 : [mAD Y ;mAD I ;mAD Q ] T ; (4.27) AIQS #9 is a powerful scorer for distortion types 1-13. Besides, it can be used to boost the performance of BIQSs with respect to distortion type 14. 108 The mean shift distortion (i.e., distortion type 16 in TID2008) for color images is one of the most challenging distortion types for BIQSs as shown in Table 4.2. We design two more AIQSs (i.e. AIQS #10 and AIQS #11) to boost the overall performance. The 10th AIQS measures the histogram range (or the dynamic range (DR)) of distorted images in three components of YIQ, respectively. We use the 256-bin histograms for Y, I, and Q components of distorted image I d , respectively, as shown in Fig. 4.4. The DR of Y, I, and Q components can be written as DR i (I d ) =b 1;i b 0;i ; i =Y;I;Q; (4.28) where b 0;i and b 1;i are, respectively, the rst and the last bins in the histogram with values signicantly larger than zero. Then, the feature vector of the 10th AIQS is AIQS #10 : [DR Y (I d );DR I (I d );DR Q (I d )] T : (4.29) Finally, the feature vector of AIQS #11 is a 15-element vector consisting of the global mean shift and the DR in YIQ, YCbCr and RGB color spaces. It is in form of AIQS #11 : 2 6 6 6 6 4 Y Y (Ir ) ; I I (Ir ) ; Q Q (Ir ) ; R R (Ir ) ; G G (Ir ) ; B B (Ir ) ; DR Y (I r ); DR I (I r ); DR Q (I r ); DR R (I r ); DR G (I r ); DR B (I r ); Y DR Y (Ir ) ; Cb DR Cb (Ir ) ; Cr DR Cr (Ir ) 3 7 7 7 7 5 T (4.30) ,where i = i (I d ) i (I r ), i =Y;I;Q;R;G;B;Cb;Cr, and i (I d ) and i (I r ) denote the global mean of distorted and reference images, respectively, on color component i, andDR i (I r ) has the same denition as that in (4.28) except that it is applied to reference image I r instead of distorted image I d . The eleven AIQSs as described in this section are designed to complement BIQSs to account for some distortion types that are dicult to be assessed. We will elaborate this point in the next section. 109 Figure 4.4: Extracting features of the 10th AIQS. 4.4 IQS Evaluation, Training and ParaBoost 4.4.1 Contribution Evaluation of IQSs We list the individual performance of all BIQSs and AIQSs for the TID2008 Database in Table 4.3. Most BIQSs (except for BIQS #1) has better performance than AIQSs. AIQSs are not suitable to work alone because of their poor performance (with SROCC < 0:5). However, they can be used together with BIQSs to boost the overall performance of the entire IQA system. To demonstrate this point, we show the SROCC performance of eleven AIQSs for seventeen distortion types in the TID2008 Database [16] in Table 4.4. We are particularly interested in the use of AIQSs to boost the performance of BIQSs in evaluating images with distortion types 14 (i.e. non-eccentricity pattern noise), 16 (mean shift) and 17 (contrast change). Apparently, AIQS #4 has superior performance against distortion type 14. Also, as compared with Table 4.2, AIQS #8 and AIQS #11 can signicantly improve the 110 Table 4.3: Performance of Each IQS in TID2008 Database IQS PCC SROCC RMSE BIQS #1 0.1444 0.4748 1.3279 BIQS #2 0.7556 0.7798 0.8790 BIQS #3 0.7766 0.7683 0.8454 BIQS #4 0.8126 0.8016 0.7821 BIQS #5 0.8246 0.8353 0.7591 BIQS #6 0.5422 0.5477 1.1276 AIQS #1 0.5057 0.2844 1.1577 AIQS #2 0.4006 0.1626 1.2295 AIQS #3 0.3579 0.3187 1.2530 AIQS #4 0.4394 0.3479 1.2054 AIQS #5 0.5189 0.4886 1.1471 AIQS #6 0.2649 0.2268 1.2940 AIQS #7 0.1578 0.2906 1.3251 AIQS #8 0.4026 0.2534 1.2284 AIQS #9 0.3451 0.2679 1.2595 AIQS #10 0.1714 0.4536 1.3221 AIQS #11 0.0936 0.0705 1.3360 Table 4.4: SROCC Performance of AIQSs vs. Distortion Types in TID2008 Database Distortion type AIQS #1 AIQS #2 AIQS #3 AIQS #4 AIQS #5 AIQS #6 AIQS #7 AIQS #8 AIQS #9 AIQS #10 AIQS #11 1 0.7184 0.3652 0.5019 0.5695 0.4783 0.2539 0.2774 0.3088 0.8870 0.5694 0.2305 2 0.7570 0.5334 0.4340 0.4031 0.4140 -0.1069 0.3155 0.5132 0.8806 0.5764 -0.0042 3 0.3884 0.2313 0.4500 0.2318 0.5161 -0.0138 0.5135 0.0687 0.9073 0.5844 -0.0440 4 0.7084 0.1674 0.6409 0.4535 0.4843 0.2118 -0.0293 0.1851 0.8209 0.6795 -0.0187 5 0.8437 0.7557 0.7080 0.2265 0.4811 0.5020 0.6710 0.3127 0.9182 0.8115 0.2220 6 0.3992 0.5363 0.8767 -0.0353 0.7002 0.8531 0.4903 0.2363 0.8809 0.6653 0.3722 7 0.5954 -0.0935 0.6679 -0.0243 0.0436 0.6057 0.4506 0.6224 0.8593 0.8419 -0.0456 8 0.6956 0.3625 0.8617 0.4958 0.4580 0.3183 0.4996 0.3001 0.8506 0.2191 -0.0504 9 0.4301 0.5636 0.7815 0.1094 0.1301 0.3947 0.6497 0.7247 0.9164 0.7819 0.5321 10 0.7422 0.2103 0.7936 0.3010 0.5044 0.0540 -0.0292 0.7255 0.9234 0.3789 0.1443 11 0.8874 0.7989 0.8728 0.2133 0.1438 0.2869 0.7203 0.7061 0.8697 -0.1221 0.1997 12 -0.0890 0.4201 0.7974 0.4382 0.6974 0.4743 0.5328 0.5672 0.8247 0.3857 0.1152 13 0.3790 0.1668 0.5022 0.5380 0.4248 0.3105 0.1709 0.3213 0.8756 0.1237 -0.1000 14 0.3041 -0.1553 0.7011 0.8737 0.7489 0.0324 0.4853 0.6194 0.5992 -0.0290 -0.0197 15 -0.1225 0.4097 0.7440 0.2972 0.7534 0.8208 0.4836 0.2471 0.7695 0.3776 0.0458 16 0.0942 0.3180 0.3763 0.3345 0.2740 0.2873 0.4294 0.2864 0.6785 0.4506 0.7207 17 0.0215 0.4746 0.7040 0.2956 0.1644 0.1861 0.7749 0.8324 0.5826 0.8259 0.7426 correlation performance for distortion type 17 and 16, respectively. We will give a brief discussion on why these AIQSs work well on these special distortion types below. The non-eccentricity pattern noise can be dierentiated more easily when an image is transformed into its edge map as shown in Fig. 4.5, where the non-eccentricity pattern noise occurs in the logo region of the hats and signicant dierences can be observed by comparing Figs. 4.5 (c) and (d). As a result, the dierence between the distorted and the reference images can be captured by comparing the histogram dierence of their gradient magnitudes as done in feature extraction of AIQS #4. The mean shift dierence in images can be quantied via features of AIQS #11, which considers the global mean shift 111 (a) Image (Reference) (b) Image (non-eccentricity pattern noise) (c) Sobel edge map of (a) (d) Sobel edge map of (b) Figure 4.5: The original and distorted hat images and their corresponding Sobel edge maps. and the dynamic range of histograms in several color spaces. The contrast change can be described by computing the contrast dierence between the distorted and reference images as done in deriving the features of AIQS #8. We show the use of AIQSs to boost the performance of BIQSs distortion types 14, 16, and 17 in Table 4.5. We see that the inclusion of AIQS #8 can boost the SROCC performance by over 0.05 on distortion type 17. Furthermore, by including three more AIQSs (#4, 6, 9) and 4 more AIQSs (#3, 7, 10, 11), we are able to boost the performance for distortion types 14, and 16, respectively. 112 Table 4.5: SROCC Performance Boosting by Considering Both BIQS and AIQS in TID2008 Database Distortion type All BIQSs All BIQSs + AIQS #4,6,9 All BIQSs + AIQS #3,7,10,11 All BIQSs + AIQS #8 14 0.9182 0.9345 - - 16 0.9446 - 0.9477 - 17 0.8412 - - 0.8978 Table 4.6: IQS Categorization IQS Global BIQS #6 features AIQS #1,9,10,11 Local BIQS #1,2,3,4,5 features AIQS #2,3,4,5,6,7,8 Generally speaking, we can classify IQSs into two categories based on their feature types as given in Table 4.6; i.e., global features that grasp viewers' quality impression of the whole image and local features that capture ne details in local regions. A good IQA system should contain both of them. 4.4.2 Training of BIQS and AIQS Models We train a model for each BIQS with proper image subsets as shown in Table 4.7. This strategy oers some advantages, including the saving of training time, increased diversity among BIQSs [106], and the enhanced performance for a small number of distortion types since each BIQS is trained for the distortion types where it is supposed to perform well. Similarly, we train a model for each AIQS with only one distortion type as shown in Table 4.8 since each AIQS is designed to target at one dicult distortion type only. For example, we have low SROCC performance for distortion types 14 and 16 as shown in Table 4.2. To boost the performance, we add 3 AIQSs (#4, 6, 9) for distortion type 14 and 4 AIQSs (#3, 7, 10, 11) for distortion type 16. Thus, the training time can be saved and each AIQS can target at one distortion type, becoming an expert for one specic type of image distortion. Moreover, several scorers can cooperate and conquer the extremely dicult ones. 113 Table 4.7: Training Image Sets (Distortion Types) for BIQSs BIQS # 1 2 3 4 5 6 Training image distortion type 16 8-11 8-15 8-11 8-15,17 1-7 Table 4.8: Training Image Sets (Distortion Types) for AIQSs AIQS # 1 2 3 4 5 6 7 8 9 10 11 Training image distortion type 13 15 16 14 12 14 16 17 14 16 16 4.4.3 ParaBoost The nal IQA system, including both basic and auxiliary IQSs, is shown in Fig. 4.6. We call it a ParaBoost system since AIQSs are used to boost the performance of BIQSs in a parallel conguration. In general, we consider a ParaBoost system consisting ofn BIQSs andr AIQSs. Given m training images, we can obtain the quality score of the i-th training image for each individual IQS, which is denoted by s i;j , where i = 1; 2;:::;m, and j = 1; 2;:::;n +r. Then, the quality score of the ParaBoost system can be modeled as PB(s i ) = w T '(s i ) +b; (4.31) where s i = (s i;1 ;:::;s i;n+r ) T is the quality score vector for the i-th image, w = (w 1 ;:::;w n+r ) T is the weighting vector, and b is the bias. In the training stage, we determine weight vector w and biasb from the training data that minimize the dierence between PB(s i ) and the (dierential) mean opinion score ((D)MOS i ) obtained by human observers; namely, min w;b kPB(s i ) (D)MOS i k 1 ; i = 1;:::;m; (4.32) wherek:k 1 denotes the l 1 norm. To solve this problem, we demand that the maximum absolute dierence in (4.32) is bounded by a certain level ", and adopt the support vector regression (SVR) [120] for 114 Figure 4.6: The ParaBoost IQA system. its solution. We choose the radial basis function (RBF) as the kernel function in SVR. A linear kernel is also tested, yet its performance is not as good. One explanation is that quality score vector s i andMOS i (orDMOS i ) are not linearly correlated. For this reason, we only show results with the nonlinear RBF kernel for the rest of this work. In the test stage, we dene the quality score vector s k of the k-th test image, where k = 1; 2;:::;l and where l denotes the number of test images, and (4.31) is used to determine the quality score of the ParaBoost method, PB(s k ). In all experiments, we use the 5-fold cross-validation, which is widely used in machine learning [42, 30], to select our training and testing sets. First, we divide the image set into 5 non-overlapping sets. One set is used for test while the remaining 4 sets are used for training. We rotate this assignment 5 times so that each set is only used as the test set once. The test results from the 5 folds are then averaged to compute the overall correlation coecients and the error. This procedure can test if over-tting occurs. 115 Before applying SVR [34], we linearly scale the scores obtained from each IQS to the same range [0, 1] for normalization. The linear scaling process is conducted on both training and test data [56] via y = x min(X) max(X) min(X) ; (4.33) where y is the scaled score, x is the raw score, and max(X), and min(X) specify the maximum and minimum values of the score X, respectively. 4.5 Scorer Selection in ParaBoost IQA System To reach a balance between accurate quality evaluation and low computational complex- ity of the ParaBoost IQA system, it is desirable to develop a process that can add IQSs gradually and systematically. In this section, we propose two scorer selection methods using statistical testing [52] that can select one IQS at one time to add to the ParaBoost system. The rst method is based on the one-way analysis of variance (1-way ANOVA) and the paired t-test, which is a parametric test method. The procedure is described below. Method 1: 1-way ANOVA and the paired t test 1. Divide the N scores obtained by each IQS into m groups, where each score group has n i (i = 1;:::;m) corresponding images and N = P m i=1 n i . 2. Compute the F values for each IQS via Compute the mean of each group s 1 ; s 2 ; ; s m s i = 1 n i n i X j=1 s ij ;i = 1;:::;m; where s ij represents the score of the j-th image of the i-th group. 116 Compute the sum of squared deviations for each group SS 1 ;SS 2 ; ;SS m SS i = n i X j=1 (s ij s i ) 2 ;i = 1;:::;m: Compute the within groups sum of squares SS within = m X i=1 SS i : Compute the within group variance 2 within = SS within DF within =MS within ; where DF within =Nm: Compute the between groups sum of squares SS between = m X i=1 n i s 2 i 1 N m X i=1 n i s i ! 2 : Compute the between groups variance 2 between = SS between DF between =MS between ; where DF between =m 1: Then, the F statistic value is F = 2 between 2 within : (4.34) 3. Rank IQSs in the descending order of their F statistic values, and the ParaBoost system can select them one by one according to this order to achieve the best tradeo between performance and complexity. 4. Check if there is a signicant dierence between two consecutive IQSs using the paired t test. If the answer is yes, we proceed normally as described in Step 3. 117 Otherwise, we will skip the 2nd IQS in the pair and proceed to the next one in the ordered list for possible inclusion in the ParaBoost system. The second selection method is to use the Kruskal-Wallis (K-W) statistics and the Wilcoxon signed-rank test, which is a nonparametric statistical test method and can be applied when the score data are not normally distributed. It is described below. Method 2: K-W statistic and the Wilcoxon signed-rank test 1. Divide theN scores obtained by each IQS intom groups and each score group has n i (i = 1;:::;m) corresponding images, where N = P m i=1 n i . 2. Rank all scores in the ascending order regardless of their group. 3. Compute the Kruskal-Wallis statistic value H for each IQS using following equa- tions: Compute the average rank for score group i via R i = 1 n i n i X j=1 R ij ; i = 1;:::;m; where R ij represents the rank of the j-th image of the i-th group. Conpute the average rank for all images via R = 1 + 2 + +N N = N + 1 2 : Then, the Kruskal-Wallis statistic value can be determined via H = 12 N(N + 1) m X i=1 n i R i R 2 : (4.35) 4. Rank IQSs in the descending order by H values and then select them one by one for inclusion in the ParaBoost IQA system according to this order. 118 Table 4.9: 1-Way ANOVA and K-W Statistic for Each IQS in TID2008 Database IQS 1-way ANOVA K-W statistic F value Rank H value Rank BIQS #1 2.4957 17 417.2 7 BIQS #2 133.5516 4 1101.8 3 BIQS #3 176.0039 3 1009.4 4 BIQS #4 263.0011 2 1116.7 2 BIQS #5 270.1121 1 1209.5 1 BIQS #6 40.7945 5 488.2 5 AIQS #1 26.8595 8 303.8 8 AIQS #2 8.8806 13 147.6 12 AIQS #3 15.4905 10 224.4 11 AIQS #4 26.8884 7 267.9 9 AIQS #5 27.5269 6 474.6 6 AIQS #6 10.8617 12 141.4 13 AIQS #7 6.4441 14 90.9 15 AIQS #8 12.9817 11 119.7 14 AIQS #9 19.4745 9 259.5 10 AIQS #10 2.8555 16 48.9 16 AIQS #11 3.2488 15 31.3 17 5. If there is no signicant dierence between two successive IQSs using the Wilcoxon signed-rank test, the 2nd IQS will be skipped and we will proceed to the next one in the ordered list. In the two above methods, ifF >F crit (orH >H crit ), we reject the null hypothesis: H 0 :fNo signicant dierence on the IQS among score groupsg with probability P < , where is the signicance level. Typically, is set to 0.05 or 0.01. As a result, we select scorers with higher F or H values rst to have higher discriminating power. TheF andH values and the corresponding rank of each scorer are listed in Table 4.9. We see that there are some dierences for the order of the selected IQSs between these two methods. For example, the 7th IQS is AIQS #4 by Method 1, but BIQS #1 by Method 2. This is the point from which the performance of the ParaBoost system decided by Method 1 is always better than that decided by Method 2 with the same number of IQSs as shown in Table 4.10. Also, as shown in Table 4.10, we only need 16 IQSs to achieve the best performance using Method 1. However, 17 IQSs are needed to achieve the same performance if we 119 Table 4.10: Performance of Selected IQSs by Methods 1 and 2 in TID2008 Database Method 1 (1-way ANOVA + paired t test) Method 2 (K-W statistic + Wilcoxon signed-rank test) Selected IQS PCC SROCC RMSE Selected IQS PCC SROCC RMSE BIQS #5 0.8246 0.8353 0.7591 BIQS #5 0.8246 0.8353 0.7591 BIQS #5,4 0.8685 0.8678 0.6651 BIQS #5,4 0.8685 0.8678 0.6651 BIQS #5,4,3 0.8728 0.8705 0.6549 BIQS #5,4,2 0.8749 0.8718 0.6499 BIQS #5,4,3,2 0.8788 0.8752 0.6404 BIQS #5,4,2,3 0.8788 0.8752 0.6404 BIQS #5,4,3,2,6 0.9022 0.8995 0.5787 BIQS #5,4,2,3,6 0.9022 0.8995 0.5787 BIQS #5,4,3,2,6 AIQS #5 0.9038 0.9012 0.5742 BIQS #5,4,2,3,6 AIQS #5 0.9038 0.9012 0.5742 BIQS #5,4,3,2,6 AIQS #5,4 0.9117 0.9120 0.5513 BIQS #5,4,2,3,6,1 AIQS #5 0.9109 0.9085 0.5538 BIQS #5,4,3,2,6 AIQS #5,4,1 0.9352 0.9316 0.4752 BIQS #5,4,2,3,6,1 AIQS #5,1 0.9320 0.9254 0.4865 BIQS #5,4,3,2,6 AIQS #5,4,1,9 0.9545 0.9544 0.4003 BIQS #5,4,2,3,6,1 AIQS #5,1,4 0.9412 0.9384 0.4535 BIQS #5,4,3,2,6 AIQS #5,4,1,9,3 0.9584 0.9580 0.3832 BIQS #5,4,2,3,6,1 AIQS #5,1,4,9 0.9555 0.9554 0.3958 BIQS #5,4,3,2,6 AIQS #5,4,1,9,3,8 0.9667 0.9658 0.3436 BIQS #5,4,2,3,6,1 AIQS #5,1,4,9,3 0.9593 0.9591 0.3791 BIQS #5,4,3,2,6 AIQS #5,4,1,9,3,8,6 0.9698 0.9690 0.3271 BIQS #5,4,2,3,6,1 AIQS #5,1,4,9,3,2 0.9642 0.9646 0.3560 BIQS #5,4,3,2,6 AIQS #5,4,1,9,3,8,6,2 0.9714 0.9721 0.3187 BIQS #5,4,2,3,6,1 AIQS #5,1,4,9,3,2,6 0.9683 0.9695 0.3354 BIQS #5,4,3,2,6 AIQS #5,4,1,9,3,8,6,2,7 0.9728 0.9730 0.3107 BIQS #5,4,2,3,6,1 AIQS #5,1,4,9,3,2,6,8 0.9713 0.9719 0.3193 BIQS #5,4,3,2,6 AIQS #5,4,1,9,3,8,6,2,7,11 0.9747 0.9752 0.2998 BIQS #5,4,2,3,6,1 AIQS #5,1,4,9,3,2,6,8,7 0.9726 0.9726 0.3119 BIQS #5,4,3,2,6 AIQS #5,4,1,9,3,8,6,2,7,11,10 0.9767 0.9772 0.2879 BIQS #5,4,2,3,6,1 AIQS #5,1,4,9,3,2,6,8,7,10 0.9748 0.9747 0.2993 BIQS #5,4,3,2,6,1 AIQS #5,4,1,9,3,8,6,2,7,11,10 0.9767 0.9772 0.2880 BIQS #5,4,2,3,6,1 AIQS #5,1,4,9,3,2,6,8,7,10,11 0.9767 0.9772 0.2880 adopt Method 2. Thus, Method 1 is a better choice for TID2008 database, which is probably due to the fact that most of the scorer outputs follow the normal distribution. To verify this conjecture, we calculate the kurtosis of the score distribution for each IQS, and summarize their values in Table 4.11. Based on [22], the scores of each IQS are considered to be normally distributed if the kurtosis value is between 2 and 4. Although BIQS #1, AIQS #8 and AIQS #11 have kurtosis values slightly greater than 4, they are still close to the normal distribution. We see from Table 4.11 that 13 out of 17 IQSs are normally distributed, which explains the better performance of Method 1 for TID2008 database. 120 Table 4.11: Normal Distribution Test for Each IQS in TID2008 Database IQS Kurtosis Normal distribution BIQS #1 4.0066 Yes BIQS #2 5.4900 No BIQS #3 3.8265 Yes BIQS #4 7.3002 No BIQS #5 2.9064 Yes BIQS #6 3.1276 Yes AIQS #1 2.6999 Yes AIQS #2 2.5784 Yes AIQS #3 2.2457 Yes AIQS #4 20.8384 No AIQS #5 7.0692 No AIQS #6 2.7488 Yes AIQS #7 2.4628 Yes AIQS #8 4.2552 Yes AIQS #9 2.5750 Yes AIQS #10 3.1408 Yes AIQS #11 4.1181 Yes 4.6 Experimental Results 4.6.1 Image Quality Databases We evaluate the performance of the proposed ParaBoost method with three commonly used image quality databases (namely; TID2008, LIVE and CSIQ) as well as a new database known as TID2013. They are brie y described below. The Tampere Image Database (TID2008) [16, 108] includes 25 reference images, 17 distortion types for each reference image, and 4 levels for each distortion type. The database contains 1,700 distortion images, and the MOS provided in this database ranges from 0 to 9. The LIVE Image Quality Database [9] has 29 reference images and 779 test images, consisting of ve distortion types (JPEG2000, JPEG, white noise in the RGB compo- nents, Gaussian blur, and transmission errors in the JPEG2000 bit stream using a fast- fading Rayleigh channel model). The subjective quality scores provided in this database are DMOS, ranging from 0 to 100. The Categorical Image Quality (CSIQ) Database [2] contains 30 reference images, and each image contains 6 distortion types (JPEG compression, JPEG2000 compression, global contrast decrements, additive Gaussian white noise, additive Gaussian pink noise, 121 and Gaussian blurring) at 4 to 5 dierent levels, resulting in 866 distorted images. The score ratings (from 0 to 1) are reported in DMOS. Besides the 17 distortion types in TID2008, the Tampere Image Database 2013 (TID2013) [15, 107] introduces seven new distortion types. They are: change of color saturation (#18), multiplicative Gaussian noise (#19), comfort noise (#20), lossy com- pression of noisy images (#21), image color quantization with dither (#22), chromatic aberrations (#23), sparse sampling and reconstruction (#24). Consequently, TID2013 has the richest diversity, consisting of 25 reference images, 24 distortion types for each reference, and 5 levels for each distortion type. The database contains 3,000 distorted images with their subjective scores evaluated in MOS ranging from 0 to 9. 4.6.2 Performance Measures for IQA Methods We use three indices to measure the performance of IQA methods [127, 128]. The rst one is the Pearson correlation coecient (PCC) between the objective and the subjective scores. It is used to evaluate prediction accuracy. The second one is the Spearman rank order correlation coecient (SROCC) between the objective and the subjective scores. It is used to evaluate prediction monotonicity. The third one is the root-mean-squared error (RMSE) between the objective and the subjective scores. In order to compute PCC and SROCC, we use the following monotonic logistic function [127] and the procedure described in [127] to t the objective scores to the subjective quality scores (MOS or DMOS) f(x) = 1 2 1 + exp ( x 3 j 4 j ) + 2 ; (4.36) wherex is the predicted objective score,f(x) is the tted objective score, and parameters j , j = 1; 2; 3; 4, are chosen to minimize the least-squares error between the subjective score and the tted objective score. Initial estimates of the parameters are chosen based on the recommendation in [127]. 122 Table 4.12: Performance of Each IQS in the Other Three Databases IQS LIVE CSIQ TID2013 PCC SROCC RMSE PCC SROCC RMSE PCC SROCC RMSE BIQS #1 0.8327 0.8406 15.1277 0.4340 0.4229 0.2365 -0.5674 -0.6757 1.2397 BIQS #2 0.8929 0.8973 12.3046 0.8146 0.8369 0.1523 0.7570 0.6952 0.8100 BIQS #3 0.9363 0.9385 9.5967 0.8609 0.7809 0.1336 0.7355 0.6622 0.8399 BIQS #4 0.8839 0.8944 12.7795 0.8143 0.7657 0.1524 0.7263 0.5492 0.8521 BIQS #5 0.9503 0.9551 8.5092 0.9071 0.9232 0.1105 0.8117 0.7533 0.7240 BIQS #6 0.8700 0.8726 13.4697 0.7777 0.7850 0.1650 0.6377 0.6229 0.9549 AIQS #1 0.7353 0.7494 18.5185 0.4461 0.2006 0.2350 -0.0229 -0.1257 1.2397 AIQS #2 0.5389 0.4170 23.0150 0.4679 0.3001 0.2320 -0.3387 -0.3870 1.2397 AIQS #3 0.9552 0.9525 8.0867 0.6858 0.5229 0.1911 -0.4936 -0.5373 1.2397 AIQS #4 0.5514 0.5078 22.7924 0.6062 0.5477 0.2088 0.3896 0.2535 1.1417 AIQS #5 0.5945 0.6809 21.9706 0.5985 0.5652 0.2103 0.4255 0.2015 1.1218 AIQS #6 0.1796 0.3194 26.8780 0.3788 0.3264 0.2430 -0.4893 -0.5105 1.2397 AIQS #7 0.4130 0.3628 24.8828 0.1880 0.1771 0.2578 -0.0655 -0.1594 1.2397 AIQS #8 0.5561 0.5202 22.7084 0.3356 0.2933 0.2473 0.0575 0.0645 1.2376 AIQS #9 0.8467 0.8493 14.5389 0.7811 0.7761 0.1639 0.5066 0.3424 1.0688 AIQS #10 0.3894 0.1076 25.1651 0.2259 0.3475 0.2557 -0.4517 -0.5079 1.2397 AIQS #11 0.3452 0.1947 25.6430 0.2923 0.1759 0.2511 -0.3955 -0.4894 1.2397 Table 4.13: Selected IQSs by Methods 1 and 2 in Four Databases Database IQS Selection Method Selected IQSs No. of BIQSs No. of AIQSs TID2008 1-way ANOVA + paired t test BIQS #5,4,3,2,6 AIQS #5,4,1,9,3,8,6,2,7,11,10 5 11 K-W statistic + Wilcoxon signed-rank test BIQS #5,4,2,3,6,1 AIQS #5,1,4,9,3,2,6,8,7,10 6 10 LIVE 1-way ANOVA + paired t test BIQS #5,3,2,6,4,1 AIQS #3,9,5 6 3 K-W statistic + Wilcoxon signed-rank test BIQS #5,3,2,4,6,1 AIQS #3,9,4 6 3 CSIQ 1-way ANOVA + paired t test BIQS #5,3,4,2,6 AIQS #3,5,4,9,2,1,6,10 5 8 K-W statistic + Wilcoxon signed-rank test BIQS #5,2,3,4,6,1 AIQS #9,4,5,3,1,2,6 6 7 TID2013 1-way ANOVA + paired t test BIQS #5,4,2,3,6,1 AIQS #9,5,4,8,7,1,11,2,3,10,6 6 11 K-W statistic + Wilcoxon signed-rank test BIQS #5,2,4,3,6,1 AIQS #4,5,9,8,1,7,2,11,3,10,6 6 11 4.6.3 Performance Comparison To evaluate the performance of each individual IQS, we show three performance indices for both BIQSs and AIQSs against three databases (LIVE, CSIQ, and TID2013) in Table 4.12. Similar to what was shown in Table 4.3, most BIQSs (except BIQS #1) have better correlation performance with MOS (DMOS) and AIQSs do not perform well when working alone. The only exceptions are AIQS #3 and AIQS #9. The former has excellent performance for the LIVE database while the latter performs well for both LIVE and CSIQ. 123 With the two selection methods described in Section 4.5, we show the smallest set of IQSs that achieves the best performance in Table 4.13. As indicated in the table, the numbers of IQSs needed for the ParaBoost system in TID2008 and TID2013 are 16 and 17, and these two databases have over 17 and 24 distortion types, respectively. Due to the large variety of distortion types, they cannot be easily and correctly evaluated using a small number of scorers. On the other hand, we can provide good quality assessment for LIVE and CSIQ with fewer IQSs (9 and 13, respectively) since there are only 5-6 distortion types in these two databases. The ParaBoost performance of the selected IQSs in Table 4.13 is listed in Table 4.14, which includes several benchmarking cases with a dierent combination of BIQSs or AIQSs. We observe that the fusion of all IQSs does not oer the best performance while the complexity is the highest. Thus, it is essential to have a scorer selection mechanism. For example, we only need 9 IQSs to give the highest PCC and SROCC for the LIVE database. By comparing Tables 4.14, 4.12 and 4.3, we see that the SROCC gains of the ParaBoost IQA system over the single best-performing IQS with respect to LIVE, CSIQ, TID2008 and TID2013 are 0.03, 0.05 0.14 and 0.20, respectively. The rich diversity of IQSs, special training strategy, the ParaBoost structure and the IQS selection scheme all contribute to the excellent performance of the proposed ParaBoost IQA system. The performance gain is larger if the distortion types in a given database are more diversied. In Table 4.15, we compare the performance of the proposed ParaBoost method with several state-of-the-art image quality metrics such as VSNR [33], VIF [117], SSIM [133], MS-SSIM [139], IW-SSIM [135], FSIM [148], MAD [63], CF-MMF and CD-MMF [75, 78]. The top three IQA models are highlighted in bold. As shown in Table 4.15, the two ParaBoost IQA methods rank the 1st and 2nd in TID2008, LIVE, and TID2013. For CSIQ, they still rank the 2nd and 3rd. Furthermore, the proposed ParaBoost method has an impressive performance on both TID2008 and TID2013. For instance, in TID 2013, the SROCC gains are around 0.16 and 0.04 over the existing best formula-based approach (i.e., FSIM) and learning-based method (i.e., CD-MMF), respectively. 124 Table 4.14: Performance Comparisons for Dierent Combinations of IQSs Database Selected IQSs Total no. of IQSs PCC SROCC RMSE TID2008 All BIQSs 6 0.9093 0.9068 0.5583 All AIQSs 11 0.9331 0.9334 0.4827 All IQSs 17 0.9767 0.9772 0.2880 Table 4.13 (Method 1) 16 0.9767 0.9772 0.2879 Table 4.13 (Method 2) 16 0.9748 0.9747 0.2993 LIVE All BIQSs 6 0.9693 0.9693 6.7202 All AIQSs 11 0.9782 0.9789 5.6802 All IQSs 17 0.9736 0.9757 6.2382 Table 4.13 (Method 1) 9 0.9820 0.9819 5.1556 Table 4.13 (Method 2) 9 0.9835 0.9837 4.9377 CSIQ All BIQSs 6 0.9588 0.9595 0.0746 All AIQSs 11 0.9566 0.9508 0.0765 All IQSs 17 0.9743 0.9710 0.0592 Table 4.13 (Method 1) 13 0.9767 0.9732 0.0564 Table 4.13 (Method 2) 13 0.9763 0.9726 0.0569 TID2013 All BIQSs 6 0.8565 0.8251 0.6398 All AIQSs 11 0.8920 0.8871 0.5604 All IQSs 17 0.9567 0.9575 0.3610 Table 4.13 (Method 1) 17 0.9567 0.9575 0.3610 Table 4.13 (Method 2) 17 0.9567 0.9575 0.3610 Finally, to test the generality of the proposed ParaBoost method, we train the system based on one database and test it on the other three databases. The experiment results are shown in Table 4.16. As shown in Table 4.16, the SROCC values are over 0.95 for most cases except when the system is trained on LIVE or CSIQ but tested on TID2013. This can be explained by the fact that there are 24 distortion types in TID2013, which cannot be well covered by a training process performed on smaller distortion type sets (i.e., 5 and 6 distortion types in LIVE and CSIQ, respectively). We would like to point out that the SROCC values of the resulting ParaBoost IQA system is still greater than 0.95 for 10 out of the 12 cross-database evaluation cases. This shows the robustness of the proposed ParaBoost IQA method. 4.7 Conclusion and Future Work In this work, we have proposed a new concept of IQSs (image quality scorers) consisting of basic and auxiliary IQSs, based upon careful, critical and comprehensive analysis of the existing metrics; and we have then developed a ParaBoost approach for objective image quality assessment. Apart from formulation of general IQSs, we have designed 125 Table 4.15: Performance Comparison among 12 IQA Models in Four Databases IQA model TID2008 (1700 images) LIVE (779 images) PCC SROCC RMSE PCC SROCC RMSE PSNR 0.5355 0.5245 1.1333 0.8701 0.8756 13.4685 VSNR 0.6820 0.7046 0.9815 0.9235 0.9279 10.4816 VIF 0.8055 0.7496 0.7953 0.9597 0.9636 7.6737 SSIM 0.7715 0.7749 0.8537 0.9384 0.9479 9.4439 MS-SSIM 0.8389 0.8528 0.7303 0.9402 0.9521 9.3038 IW-SSIM 0.8488 0.8559 0.7094 0.9425 0.9567 9.1301 FSIM 0.8710 0.8805 0.6592 0.9540 0.9634 8.1938 MAD 0.8306 0.8340 0.7474 0.9672 0.9669 6.9419 CF-MMF (3 methods) 0.9307 0.9275 0.4907 0.9734 0.9732 6.2612 CD-MMF (3 methods) 0.9517 0.9439 0.4120 0.9802 0.9805 5.4134 ParaBoost (Method 1) 0.9767 0.9772 0.2879 0.9820 0.9819 5.1556 ParaBoost (Method 2) 0.9748 0.9747 0.2993 0.9835 0.9837 4.9377 IQA model CSIQ (866 images) TID2013 (3000 images) PCC SROCC RMSE PCC SROCC RMSE PSNR 0.8001 0.8057 0.1576 0.6727 0.6394 0.9172 VSNR 0.8005 0.8108 0.1573 - 0.5005 - VIF 0.9253 0.9194 0.0996 0.7711 0.6769 0.7894 SSIM 0.8594 0.8755 0.1342 0.7893 0.7417 0.7612 MS-SSIM 0.8666 0.8774 0.1310 0.8130 0.7698 0.7217 IW-SSIM 0.9025 0.9212 0.1131 0.8296 0.7779 0.6922 FSIM 0.9095 0.9242 0.1091 0.8560 0.8015 0.6408 MAD 0.9502 0.9466 0.0818 0.8221 0.7807 0.7058 CF-MMF (3 methods) 0.9797 0.9755 0.0527 0.9146 0.8973 0.5012 CD-MMF (3 methods) 0.9675 0.9668 0.0664 0.9249 0.9143 0.4714 ParaBoost (Method 1) 0.9767 0.9732 0.0564 0.9567 0.9575 0.3610 ParaBoost (Method 2) 0.9763 0.9726 0.0569 0.9567 0.9575 0.3610 Table 4.16: Cross-Database SROCC Performance of ParaBoost System Test database ||||| TID2008 LIVE CSIQ TID2013 Model TID2008 - 0.9726 0.9650 0.9506 LIVE 0.9597 - 0.9659 0.9313 CSIQ 0.9510 0.9723 - 0.9213 TID2013 0.9703 0.9739 0.9648 - several IQSs to target at some specic image distortion types which are very dicult to be dealt with. Then, dierent training image sets are used to train the devised IQSs to be able to gain high diversity. In addition, we use two statistical testing based methods: 1) 1-way ANOVA + paired t test, and 2) K-W statistic + Wilcoxon signed-rank test, in order to select the optimal combination of IQSs and also be able to achieve the best performance with the minimum number of scorers. In the nal stage, a non-linear SVR score fuser is used to combine the outputs from the selected IQSs instead of the conventional weighting or voting schemes. 126 The experimental results across four well-known public databases (totally 6,345 images) show that the proposed framework outperforms other existing state-of-the-art IQA mod- els (including both formula-based and learning-oriented methods) with clear explanation for its success. The extension of the ParaBoost approach to the problem of video quality assessment (VQA) is under our current investigation. One main challenge along this direction is the lack of large video quality assessment databases. Furthermore, the design of powerful invididual video quality scorers (VQS) is still an open issue. 127 Chapter 5 No-Reference Image Quality Assessment by Ensemble Method 5.1 Introduction With the recent advancement of high technology handheld devices (e.g., digital cameras, camcorders, tablets, and smart phones), people can easily share their lives with pictures through social networks, such as Facebook and Twitter. However, the process of captur- ing, compressing, and publishing images to the website has inevitably caused dierent degrees of degradation to original quality of images. To be able to maintain a satisfac- tory perceptual quality of experience (QoE) toward digital images, we have to decide the distortion level in each stage rst and then the quality can be improved afterward. The process of judging the quality (or distortion degree) of images is called the image quality assessment (IQA). In addition, researchers are also hoping to nd an approach that can automatically assess the image quality without human interaction and also be time-saving. Thus, objective IQA has been introduced to meet the needs. The methods to objective IQA can be classied into three categories [44]. They are full-reference (FR), reduced-reference (RR), and no-reference (NR). FR IQA method needs a distortion-free image as the reference for distorted one to compare with. Some well-known FR methods include SSIM [133], FSIM [148], MAD [63], and MMF [75, 78]. They are well-established and have been proved to be able to work very well for several databases. If the information of the reference image is partially available (e.g., in the form of a set of extracted features), then this is the so-called RR IQA method. Moreover, if the original image is not available for us to compare with distorted ones, then we 128 call this approach as NR (or blind) IQA method. The NR method does not always perform as well as the FR because it judges the quality solely based on the distorted image without any reference. However, it can be used in wider scope of applications (e.g., multimedia over wired/wireless networks, image/video retargeting, and computer graphics/animation) and the computational requirement is usually less since there is no need to process the reference. Therefore, more and more researchers start to dive into the development of NR IQA methods. In this work, we present a learning-based approach to NR IQA. As we know, formula- based approaches only can target at a very narrow range of distortion types and predict the visual quality well within 1 to 2 distortion types [76, 79]. Since our goal is to provide a solution to general purpose NR IQA that can be robust at predicting quality scores for any distortion types and image contents, the machine learning approach seems to be a better choice than just using a formula. In addition, existing NR learning-based IQA approaches only can perform well in some database [89, 113, 87, 145, 146]. The reason is that these methods cover only 1-2 types of features, such as distortion related features, and natural scene statistics (NSS) features. Hence, they are only eective in predicting scores for a database with fewer distortion types (e.g., LIVE), but fail to estimate the true human opinion scores (i.e., mean opinion score (MOS)) about visual quality for the database with more distortion types (e.g., TID2008 and TID2013). To overcome this shortcoming, we propose a method that extracts features from multiple perceptual domains (brightness, contrast, color, distortion, and texture) to be able to cover more distortion types and image contents. First, we train each feature to get a score prediction model, called a scorer. Then we feed each test image into the scorers to obtain a score for each scorer. Finally, the ensemble approach is applied to fuse the scores from the selected scorers. This method is called multi-perceptual-domain scorer ensemble (MPDSE). The rest of the chapter is organized as follows. First, the existing NR IQA methods will be reviewed in Section 5.2. Then, we will describe how to extract the features from 129 multiple perceptual domains in Section 5.3. In Sections 5.4, the ensemble method is employed to fuse scores from some of the scorers. Extensive performance comparisons will be conducted and reported in Section 5.5. Finally, concluding remarks will be given in Section 5.6. 5.2 Review of Previous Work The existing NR IQA methods can be classied into two types. One is the formula- based approach, and the other is learning-based approach. Most of the formula-based approaches are distortion specic, which assumes the type of image distortion is known. Usually, they only can target at one to two distortion types. For example, Wang et al. [137] introduced three features to measure blockiness, activity, and blurriness for JPEG compressed images. They are the average dierences across block boundaries, the average absolute dierence between in-block image samples, and the zero-crossing rate. Then three features are combined into a quality assessment model. In addition, Ferzli et al. [45] proposed an objective image sharpness metric, called Just Noticeable Blur Metric (JNBM). They claimed the just noticeable blur (JNB) is a function of local contrast and can be used to derive an edge-based sharpness metric with probability summation model over space. The experiment results showed this method can successfully predict the relative amount of sharpness/blurriness in images, even with dierent scenes. Natural scene statistics (NSS) have been applied to learning-based NR IQA [89, 113, 87]. When images are properly normalized or transformed to another domain (e.g., wavelet or DCT, local descriptors (e.g., wavelet coecients) can be modeled by some probability distribution. Since the shapes of the distributions are dierent for a reference (undistorted) image and its corresponding distorted version, this feature can be used to dierentiate the image quality. Well-known examples are DIIVINE [89], BLIINDS-II [113], and BRISQUE [87]. We will brie y introduce them below. 130 DIIVINE [89] is a two-stage framework, including distortion identication followed by distortion-specic quality assessment. First, the distorted image is decomposed using the wavelet transform and the obtained subband coecients are utilized to extract the statistical features. The features can be mapped to the quality score by a regression model. Then the algorithm estimates the probability of presence of ve distortions (including JPEG, JPEG2000, white noise, Gaussian blur, and fast fading) in an image. In the next stage, image quality score is computed for each of these distortions. Finally, the nal image quality is the probability-weighted sum of scores from each distortion category. The index can be applied to images consisting of multiple distortion types. Later, a NSS-based NR IQA approach has been developed, which was called BLIINDS-II [113]. First, an image is partitioned into equally sized blocks, and then local DCT coecients are computed on each of the blocks. The second stage of this approach applies a generalized Gaussian model to each block of DCT coecients. In the third stage, they derive several parameters based on the generalized Gaussian model. These parameters become the features which can be used to predict quality scores. The nal stage is to employ a Bayesian model to predict the image quality for the image. The most recent NSS approach is proposed by Mittal [87]. This model (BRISQUE) does not compute the distortion-specic features, but uses the scene statistics from locally normalized luminance coecients as features to quantify the potential loss of natural- ness in the image because of the presence of the distortions. Moreover, this approach does not need to transform images into another domain and this can greatly lower the computational complexity, making it suitable for real-time applications. Instead of using above hand-craft local descriptors, Ye et al. proposed another type of approaches which are based on feature learning. The rst approach they proposed is called CBIQ [145]. In the rst step, the Gabor lters are used for local feature extraction. Then the visual codebook is created by using a clustering algorithm on Gabor feature vectors from all training images. The third step is to encode features via hard-assignment coding and average pooling. Finally, the codewords are used as the input to a regression 131 model for estimating quality scores. However, the use of Gabor lter based features and a very large codebook (300,000 codewords) makes this approach highly computationally expensive. To improve the rst approach, they developed the second method, called CORNIA [146]. First, the raw-image-patch local descriptors are used, which can be easily com- puted. Second, they use a codebook based approach to learn features automatically. The codebook is constructed by K-means clustering on local features extracted from unla- beled training images. Third, they use soft-assignment coding with maximum pooling for encoding. This process is parameter free and also computationally ecient. In the last stage, support vector regression (SVR) with linear kernel is adopted for quality esti- mation. Comparing to CBIQ, CORNIA does not require labels to construct codebook. It is also able to estimate the quality with more accuracy by using a smaller codebook size (10,000 codewords). Based on above discussions, we are aware that learning-based approach is a better candidate for general purpose NR IQA algorithms since the complicated relationship between quality scores and features for images with dierent distortion levels cannot be easily expressed by a single formula. In addition, NSS-based methods can only be well- suited for assessing natural scene images instead of articial images. This also means that the performance of NSS method is image content dependent. Without the training on some type of image contents, the NSS method will not be robust on estimating quality scores for all kinds of images. The feature learning approach uses a large set of codewords to capture dierent types of distortions. But the performance of this approach drops drastically when only a small set of codewords are used. Therefore, to be able to achieve satisfactory performance, this type of method has to consume a large amount of memories. 132 5.3 Multi-Perceptual-Domain Features As we know, the existing machine learning methods extract features from 1 to 2 domains. In order to have accurate prediction on visual quality for more distortion types and image contents, we extend our features to multiple perceptual domains (i.e., brightness, contrast, color, distortion, and texture). The features for each domain are described in detail as follows. 5.3.1 Brightness Human perceptual quality towards images can be aected by light conditions. This can also be corroborated by the fact the lighting condition has to be kept the same for all participating subjects in the image quality subjective viewing test [21]. In the brightness domain, we extract two features and combine them into one feature vector. The rst feature is arithmetic average brightness AB i = 1 MN N X n=1 M X m=1 I i (m;n); (5.1) whereI i (m;n);i = 1; 2; 3; 4 are the values of Y, Y, V, L* channels for YIQ, YCbCr, HSV, and L*a*b* color spaces of an image, respectively. The other feature is the logarithmic average brightness, which is dened as LB i = R i MN exp N X n=1 M X m=1 log(" + I i (m;n) R i ) ! ; (5.2) where I i (m;n);i = 1; 2; 3; 4 are the same as above, " is a small number to prevent from computing log 0, and R 1 ;R 2 = 255;R 3 ;R 4 = 100. The dierence between two average brightness features is the logarithmic average brightness is the conjugate representation 133 for brightness and dynamic range of the brightness [68]. We combine these two features into a feature vector as below f brightness = [AB 1 ;AB 2 ;AB 3 ;AB 4 ;LB 1 ;LB 2 ;LB 3 ;LB 4 ]: (5.3) This feature set is used to train the scorers in brightness domain. 5.3.2 Contrast The rst feature from contrast domain is the contrast statistics giving a measure of the local variations in the gray-level co-occurrence matrix (GLCM), also known as the gray- tone spatial-dependence matrix [54]. The GLCM characterizes the texture of an image by calculating how often a pixel with the intensity (gray-level) valuel occurs in a specic spatial relationship to a pixel with the value m. By default, the spatial relationship is dened as the pixel of interest and the pixel to its right (horizontally adjacent), but here we specify one more spatial relationship between the pixel of interest and the pixel below to be able to account for the texture with vertical orientation in addition to horizontal orientation. The spatial relationship is specied in Fig. 5.1. The contrast feature of GLCM is dened as GLCM C = X l;m jlmj 2 p(l;m); (5.4) wherep(l;m) is the joint probability for occurrence of pixel pairs having gray level valuesl andm with a dened spatial relationship (i.e., Fig. 5.1) in the image. Essentially, (5.4) returns a measure of the intensity contrast between a pixel and its neighbor over the whole image. Since the specied spatial relationship includes two directions, the contrast feature of GLCM also has two values. These two feature values are concatenated into one feature vector shown below f contrast #1 = [GLCM C;1 ;GLCM C;2 ]: (5.5) 134 Figure 5.1: Spatial relationship of pixel of interest in GLCM. The second feature of contrast domain is designed according to the following pro- cedure. First, we divide the image into N non-overlapping blocks with a size 16 16. Then we compute two local features (in terms of mean and contrast) for each block: mean = = 1 256 16 X x=1 16 X y=1 [IB(x;y)]; contrast =c = maxfIB(x;y)g minfIB(x;y)g; whereIB is the 1616 image block. Next, we arrange the contrast and mean from each block in pair by the following form f contrast #2 = [(c;) 1 ; (c;) 2 ; ; (c;) j ; ; (c;) N ]; (5.6) where (c;) j represents the local contrast and mean computing from the j-th block. Finally, we use (5.6) to compute the second feature in the contrast domain. 5.3.3 Color As we know, color is also an important factor that can alter personal feelings about visual quality [125, 102]. To address this problem, we propose two features related to the color information. Before calculating the color features, we need to perform color space transformation, such as RGB to YIQ, and RGB to YCbCr. Since dierent color spaces have dierent 135 Table 5.1: Mapping Table for Each Color Channel Color channel Original intensity value Mapped intensity value R [0, 255] - G [0, 255] - B [0, 255] - Y [0, 1] [0, 255] I [-0.523, 0.523] [0, 255] Q [-0.596, 0.596] [0, 255] Y [16, 235] - Cb [16, 240] - Cr [16, 240] - L* [0, 100] - a* [-128, 128] [0, 255] b* [-128, 128] [0, 255] Figure 5.2: Histogram of channel Y and corresponding characteristics. ranges of intensity values [110, 53], we use Table 5.1 to map original intensity values from all color channels to the similar range of values. For example, [-0.523, 0.523] in channel I and [-0.596, 0.596] in channel Q are both mapped to [0, 255]. To extract the rst color feature, we generate the 256-bin (0-255) histogram for channel Y, I, Q, a*, and b* of the image. In addition, the 101-bin (0-100) histogram is generated for channel L*. Then we use the characteristics of the histogram to dene 136 the feature. One example of the histogram for channel Y is plotted in Fig. 5.2. Several characteristics (max, pos, and DR) are also labeled in Fig. 5.2 for easy reference. The rst color feature is extracted from both YIQ and L*a*b* color spaces and summarized as follows. f color #1 = 2 6 6 6 6 6 6 6 4 (I Y ); (I I ); (I Q ); (I L ); (I a ); (I b ); DR(I Y ); DR(I I ); DR(I Q ); DR(I L ); DR(I a ); DR(I b ); max(I Y ); max(I I ); max(I Q ); max(I L ); max(I a ); max(I b ); pos(I Y ); pos(I I ); pos(I Q ); pos(I L ); pos(I a ); pos(I b ) 3 7 7 7 7 7 7 7 5 ; (5.7) where (I i ) and DR(I i ) are the global mean and dynamic range (histogram range) of the imageI on color channeli (i = Y, I, Q, L*, a*, b*), respectively. Also, max(I i ) and pos(I i ) are number and position of the most frequent bin for the histogram of image I on color channel i, respectively. Similarly, the second color feature is extracted from the YCbCr histogram of the image. The feature is described as f color #2 = DR(I Y ) (I Y ) ; DR(I Cb ) (I Cb ) ; DR(I Cr ) (I Cr ) : (5.8) whereDR(I i ) and(I i ) are the dynamic range and global mean of the image I on color channel i (i = Y, Cb, Cr), respectively. 5.3.4 Distortion Distortion dependent features have been frequently used for image quality estimation. The most commonly seen distortions are blockiness, blurriness, and noises. In general, the distortion degrades the image quality. The more severe the distortion is, the worse the image quality would be. Hence, to quantify the degree of distortion is also equivalent to a way to measuring the quality for images. In this distortion domain, we present six distortion related features to help build several scorer models for quality prediction. 137 The rst three features [137] are calculated horizontally and vertically, and then combined into a single value by averaging. 1. Blockiness (along the horizontal direction): It is dened as the average dierences across block boundaries B h = 1 M(bN=8c 1) M X i=1 bN=8c1 X j=1 jd h (i; 8j)j; (5.9) where d h (i;j) = x(i;j + 1)x(i;j);j2 [1;N 1], is the dierence signal along horizontal line, and x(i;j);i2 [1;M];j2 [1;N] for an image of size MN. 2. Average absolute dierence between in-block image samples (along the horizontal direction): A h = 1 7 2 4 8 M(N 1) M X i=1 N1 X j=1 jd h (i;j)jB h 3 5 : (5.10) 3. Zero-crossing (ZC) rate (along the horizontal direction): Dene z h (i;j) = 8 < : 1 ZC happens at d h (i;j) 0 otherwise; Then, the horizontal ZC rate can be estimated as: Z h = 1 M(N 2) M X i=1 N2 X j=1 z h (i;j): (5.11) Similarly, we can compute vertical components B v ;A v ;Z v . Finally, the features are given by f distortion #1 =B = B h +B v 2 ; (5.12) f distortion #2 =A = A h +A v 2 ; (5.13) f distortion #3 =Z = Z h +Z v 2 : (5.14) 138 4. Average edge-spread (ES): First, edge detection is applied to the image. Then we follow the procedure outlined in [98] to compute the total edge-spread (ES). The average ES is obtained by dividing the total amount of ES by the number of edge pixels in the image, which can be computed by f distortion #4 =ES avg = ES total #of edgepixels : (5.15) 5. Average block variance (BV) [78]: First, we divide the whole image into 4 4 blocks and classify them into "smooth" or "non-smooth" blocks according to the existence of edges. Then, we collect a set of smooth blocks of size 4x4 and make sure that they do not cross the boundary of 8 8 DCT blocks. Finally, the average block variance of the image is calculated by f distortion #5 =BV avg = P smoothblocks BV #of smoothblocks : (5.16) 6. Divide the image into N non-overlapping 16 16 blocks. Then estimate the local additive noise power (np) for each block [73]. The feature is then obtained by concatenating the estimated noise power in each block as given by f distortion #6 = [np 1 ;np 2 ; ;np j ; ;np N ]; (5.17) where np j represents the estimated additive noise power in the j-th block. 139 5.3.5 Texture The rst feature in the texture domain is derived from rotation-invariant and uniform local binary pattern (LBP) [97], which is in the form of LBP riu2 P;R = 8 < : P P1 p=0 s (g p g c ) if U(LBP P;R ) 2 P + 1 otherwise; (5.18) where U(LBP P:R ) =js (g p1 g c )s (g 0 g c )j + P1 X p=1 js (g p g c )s (g p1 g c )j; and s (g p g c ) = 8 < : 1 if g p g c 0 otherwise: withg c corresponding to the gray value of the center pixel of the local neighborhood and g p (p = 0;:::;P 1) corresponding to the gray values of P equally spaced pixels on a circle of radiusR (R> 0) that form a circularly symmetric neighbor set. The superscript riu2 stands for the use of rotation-invariant "uniform" patterns that have a U value of at most 2. Then the features can be written as f texture #1 =NHist(b);b = 0; 1;:::;P + 1; (5.19) where b represents the bin of the histogram, and NHist denotes the normalized his- tograms of (5.18). As shown in Fig. 5.3, we choose P = 8 and R = 1 for simplicity of the LBP operator. Therefore, we have 10 (i.e., P + 2) values to represent the feature in (5.19). If a largerP value is chosen, then the complexity of LBP operator will be higher because the implementation needs a lookup table of 2 P elements [97]. To be able to capture the texture with more details, we adopt two commonly used features [148] (i.e., phase congruency (PC) and gradient magnitude (GM)) for images. 140 Figure 5.3: Circularly symmetric neighbor set in LBP. Assume x represents the image patch extracted from the spatial location of an image, and PC(x), GM(x) are PC and GM of x. For an image I with M image patches, the PC measure is selected as the second feature for texture domain f texture #2 = 1 M X x2I PC(x); (5.20) and the GM measure is selected as the third feature f texture #3 = 1 M X x2I GM(x); (5.21) Since gray-level co-occurrence matrix (GLCM) is eective for texture analysis, we extract another feature, called "Homogeneity" of GLCM [54] to characterize the texture. The "Homogeneity" of GLCM can measure the closeness of the distribution of elements in the GLCM to the GLCM diagonal and is dened as GLCM H = X l;m p(l;m) 1 +jlmj ; (5.22) where p(l;m) has the same denition as in Section 5.3.2. Using the same spatial rela- tionship in Fig. 5.1, we can obtain one set of feature with two values f texture #4 = [GLCM H;1 ;GLCM H;2 ]: (5.23) 141 The 5th and 6th features are used to characterize the edge structure of an image. An image is rst divided into M non-overlapping 16 16 patches, and the Sobel edge operator [110, 53] is used in each patch to generate horizontal gradient g h and vertical gradient g v . Then, we can obtain the edge magnitude and edge orientation via edge magnitude = q g 2 v +g 2 h ; and edge orientation = tan 1 g v g h : Suppose that h 1;i (b) and h 2;i (b) represent the n-bin histograms of edge magnitude and edge orientation from thei-th image patch of the image, respectively, and we can compute MAX mag;i = maxfh 1;i (b)jb = 1; ;ng; i = 1;:::;N; (5.24) MAX ori;i = maxfh 2;i (b)jb = 1; ;ng; i = 1;:::;N; (5.25) whereb denotes the bin of the histogram, andn is the number of bins (we choosen = 10 in this work for low computation). Finally, these features are given by: f texture #5 = [MAX mag;1 ; ;MAX mag;N ]: (5.26) f texture #6 = [MAX ori;1 ; ;MAX ori;N ]: (5.27) The procedure of extracting the 5th and 6th features is shown in Fig. 5.4. 5.4 Multi-Perceptual-Domain Scorer Ensemble 5.4.1 Building Scorer Models After the feature extraction, each feature can be trained to become an independent model and predict quality scores for images. This model is called image quality scorer 142 Figure 5.4: Extraction procedure of the 5th and 6th texture features. Table 5.2: List of Multi-Perceptual-Domain Features Name Domain & # Name Domain & # Name Domain & # f 1 brightness f 6 distortion #1 f 12 texture #1 f 2 contrast #1 f 7 distortion #2 f 13 texture #2 f 3 contrast #2 f 8 distortion #3 f 14 texture #3 f 4 color #1 f 9 distortion #4 f 15 texture #4 f 5 color #2 f 10 distortion #5 f 16 texture #5 - - f 11 distortion #6 f 17 texture #6 (IQS). Since the scorers are trained via the features from multiple perceptual domains, we also call these scorers as multi-perceptual-domain scorers (MPDSs). To facilitate future analysis, we summarize the features obtained in Section 5.3 into Table 5.2. The scorer trained by feature f i (i = 1;:::; 17) is denoted as IQS(f i ). 5.4.2 Ensemble System The implementation of the proposed multi-perceptual-domain scorer ensemble (MPDSE) IQA system is shown in Fig. 5.5. First, we train each feature (f i ) obtained from the training image set to get a score prediction model, called IQS(f i ). Then we feed the features extracted from the testing image set into the IQSs to obtain a score for each scorer. Finally, the ensemble [106] approach (e.g., SVR) is applied to fuse the scores from the selected IQSs to get the nal quality score. Here, we consider an MPDSE system consisting of n IQSs. Suppose we have m training images, and for the i-th training image, we can obtain its quality scores from each individual IQS, which is denoted by s i;j , where i = 1; 2;:::;m, denoting the image 143 Figure 5.5: The MPDSE IQA system. index and j = 1; 2;:::;n, denoting the IQS index. Also, we dene the quality score vector s i = (s i;1 ; ;s i;n ) T for the i-th image. The MPDSE quality score is dened as MPDSE(s i ) = w T '(s i ) +b; (5.28) where w = (w 1 ; ;w n ) T is the weight vector, and b is the bias. In the training stage, we would like to determine the weight vector w and the bias b from the training data that minimize the dierence between MPDSE(s i ) and the (dierential) mean opinion score ((D)MOS i ) obtained by human viewers; namely, min w;b kMPDSE(s i ) (D)MOS i k 1 ;i = 1;:::;m; (5.29) wherek:k 1 denotes the l 1 norm. 144 To solve this problem, we request that the maximum absolute dierence in (5.29) is bounded by a certain level ", and choose the support vector regression (SVR) [120] for its solution. In performing SVR, we adopt radial basis function (RBF) as the kernel function. We also try a linear kernel in all the experiments conducted in this work, and the results turn out to be not as good as the RBF kernel. This is probably because the RBF kernel is able to handle the situation when quality score vector s i and (D)MOS i are not linearly related. Therefore, we only demonstrate the results from nonlinear RBF kernel. In the testing stage, we dene the quality score vector s k of thek-th test image, where k = 1; 2;:::;l, with l being the number of test images, and (5.28) is used to determine the quality score of the MPDSE method, MPDSE(s k ). In all experiments, we use the 5-fold cross-validation strategy (which is widely used in the machine learning [30, 42]) to select our training and testing sets. First, we divide the image set into 5 sets. In order to obtain unbiased results, we make sure the testing set and training set use dierent original images. One set is used for testing, and the remaining 4 sets are used for training; we rotate this 5 times, and each set is only used as the testing set once. The testing results from 5 folds are then combined to compute the overall correlation coecients and error. This procedure can test if over-tting occurs. Before applying SVR [34], we linearly scale the scores obtained from each IQS to the same range [0, 1] to facilitate numerical calculation, and more importantly, to avoid the quality scores in a larger numerical range dominating those in a smaller numerical range. The linear scaling operation is performed for both training and testing data [56] via y = x min(X) max(X) min(X) ; (5.30) where y is the scaled score, x is the raw score, and max(X), and min(X) specify the maximum and minimum values of the raw scores X, respectively. 145 5.4.3 IQS Selection First, given a scorer set S = fs j jj = 1;:::; 17g, we want to nd a subset S N = fs i1 ;s i2 ; ;s iN g, with N < 17, to optimize an objective function J(S N ), which is: J(S N ) =RMSE(MPDSE(S N ); (D)MOS); (5.31) where RMSE is the root-mean-squared error between predicted objective scores and subjective scores. Sequential Forward Scorer Selection (SFSS) is the simplest greedy search algorithm to achieve the above goal. Starting from an empty scorer set S 0 , we sequentially add one scorer s that results in the smallest objective function J(S k +s ) between MOS(DMOS) and MPDSE(s i ) to the set when combined with the scorer set S k that have already been selected. The algorithm can be stated below for clarity: Algorithm: Sequential Forward Scorer Selection (SFSS) 1. Start with the empty IQS set S 0 =fg. 2. Select the next best IQS. s = arg min s2SS k J(S k +s) 3. Update S k+1 = S k +s ; k = k + 1: 4. if J(S k+1 )J(S k ) then go to 2. else 146 selected IQS set = S k . end if 5.5 Experiments 5.5.1 Databases In addition to LIVE Image Quality Database [9], Tampere Image Database 2013 (TID2013) [15, 107] is also tested in this work. Since TID2013 is the latest and also contains more distortion types and distorted images than other databases, we use it to test the robustness of our proposed method. These two databases will be brie y introduced as below. The LIVE Image Quality Database [9] has 29 reference images and 779 test images, consisting of ve distortion types (JPEG2000, JPEG, white noise in the RGB compo- nents, Gaussian blur, and transmission errors in the JPEG2000 bit stream using a fast- fading Rayleigh channel model). The subjective quality scores provided in this database are DMOS, ranging from 0 to 100. Comparing to Tampere Image Database (TID2008) [16, 108], the Tampere Image Database 2013 (TID2013) [15, 107] introduced seven new distortion types, including change of color saturation (#18), multiplicative Gaussian noise (#19), comfort noise (#20), lossy compression of noisy images (#21), image color quantization with dither (#22), chromatic aberrations (#23), sparse sampling and reconstruction (#24). This makes TID2013 a more diversifying and challenging database than ever since it has 25 reference images, 24 types of distortions (Table 5.3) for each reference image, and 5 dierent levels for each type of distortion. The whole database contains 3,000 distorted images, with MOS (ranging from 0 to 9) provided in this database.. 147 Table 5.3: Image Distortion Types in TID2013 Type Type of distortion 1 Additive Gaussian Noise 2 Dierent additive noise in color components 3 Spatially correlated noise 4 Masked noise 5 High frequency noise 6 Impulse noise 7 Quantization noise 8 Gaussian blur 9 Image denoising 10 JPEG compression 11 JPEG2000 compression 12 JPEG transmission errors 13 JPEG2000 transmission errors 14 Non eccentricity pattern noise 15 Local block-wise distortions of dierent intensity 16 Mean shift (intensity shift) 17 Contrast change 18 Change of color saturation 19 Multiplicative Gaussian noise 20 Comfort noise 21 Lossy compression of noisy images 22 Image color quantization with dither 23 Chromatic aberrations 24 Sparse sampling and reconstruction 5.5.2 Performance Measure Indices We use the following three indices to measure IQA model performance [127, 128]. The rst index is the Pearson correlation coecient (PCC) between objective and subjec- tive scores after nonlinear regression analysis. It provides an evaluation of prediction accuracy. The second index is the Spearman rank order correlation coecient (SROCC) between the objective and subjective scores. It is considered as a measure of prediction monotonicity. The third index is the RMSE. Before computing the rst and second indices, we need to use the logistic function and the procedure outlined in [127] to t the objective model scores to the MOS (or DMOS). The monotonic logistic function used to t the objective prediction scores to the subjective quality scores [127] is: f(x) = 1 2 1 + exp ( x 3 j 4 j ) + 2 ; (5.32) where x is the objective prediction score, f(x) is the tted objective score, and the parameters j (j = 1; 2; 3; 4) are chosen to minimize the least squares error between the 148 Table 5.4: Performance of Each IQS in LIVE and TID2013 IQS LIVE TID2013 PCC SROCC RMSE PCC SROCC RMSE IQS(f 1 ) 0.3769 0.0945 25.3067 0.0927 0.0881 1.2343 IQS(f 2 ) 0.3794 0.0126 25.2792 0.3533 0.1207 1.1597 IQS(f 3 ) 0.6600 0.5763 20.5273 0.6992 0.7155 0.8863 IQS(f 4 ) 0.6329 0.4956 21.1546 0.4422 0.3917 1.1119 IQS(f 5 ) 0.3994 0.2766 25.0480 0.2362 0.2257 1.2046 IQS(f 6 ) 0.4060 0.0333 24.9686 0.3049 0.1104 1.1806 IQS(f 7 ) 0.3880 -0.0492 25.1822 0.4147 0.1567 1.1281 IQS(f 8 ) 0.6672 0.6862 20.3526 0.5009 0.2445 1.0729 IQS(f 9 ) 0.3444 0.1740 25.6502 0.4261 0.0976 1.1215 IQS(f 10 ) 0.3863 0.0565 25.2015 0.1592 0.0464 1.2239 IQS(f 11 ) 0.6032 0.5879 21.7912 0.6931 0.6675 0.8937 IQS(f 12 ) 0.8463 0.8502 14.5546 0.5886 0.4425 1.0022 IQS(f 13 ) 0.2997 0.0636 26.0666 0.3250 0.1088 1.1724 IQS(f 14 ) 0.2636 0.1752 26.3557 0.2664 0.0634 1.1949 IQS(f 15 ) 0.5732 0.4888 22.3889 0.4244 0.2243 1.1225 IQS(f 16 ) 0.3862 0.4235 25.2028 0.6286 0.6438 0.9641 IQS(f 17 ) 0.4482 0.4361 24.4236 0.6445 0.6808 0.9478 subjective score and the tted objective score. Initial estimates of the parameters were chosen based on the recommendation in [127]. For a perfect match between the objective prediction scores and the subjective quality scores, we have PCC = 1, SROCC = 1 and RMSE = 0. 5.5.3 Performance Comparisons In order to identify the contribution of each IQS, we list the performance of each IQS for two databases (LIVE and TID2013) in Table 5.4. As we can see in Table 5.4, for LIVE database, the top three well-performed scorers are IQS(f 12 ), IQS(f 8 ), and IQS(f 3 ), which are related to the texture, distortion, and contrast, respectively. And IQS(f 3 ), IQS(f 11 ), IQS(f 17 ) perform well with respect to TID2013. The scorers are from contrast, distortion, and texture, respectively. Especially, IQS(f 3 ) has better performance for both databases. To reduce the complexity of the proposed method, we use the SFSS algorithm intro- duced in Section 5.4.3 to be able to achieve the optimal performance with minimum number of IQSs. The selected IQSs are listed in Table 5.5. It is observed that we need to combine the scorers from multiple perceptual domains (e.g., all ve domains for both databases) to have a better performance. In addition, we only need 11 IQSs to achieve 149 Table 5.5: Selected IQS by SFSS in LIVE and TID2013 Database Selected IQSs No. of IQSs LIVE IQS(f 12 ), IQS(f 15 ), IQS(f 3 ), IQS(f 13 ), IQS(f 8 ), IQS(f 5 ), IQS(f 7 ), IQS(f 6 ), IQS(f 11 ), IQS(f 1 ), IQS(f 10 ) 11 TID2013 IQS(f 3 ), IQS(f 17 ), IQS(f 8 ), IQS(f 4 ), IQS(f 11 ), IQS(f 12 ), IQS(f 5 ), IQS(f 16 ), IQS(f 10 ), IQS(f 15 ), IQS(f 6 ), IQS(f 13 ), IQS(f 1 ), IQS(f 14 ), IQS(f 9 ), IQS(f 2 ) 16 Table 5.6: Performance Comparisons for Dierent Combinations of IQSs Database Selected IQSs Total no. of IQSs PCC SROCC RMSE LIVE All IQSs 17 0.9397 0.9388 9.3423 Table 5.5 11 0.9447 0.9423 8.9593 TID2013 All IQSs 17 0.9113 0.9076 0.5102 Table 5.5 16 0.9114 0.9078 0.5101 the best performance in LIVE database. In other words, as shown in Table 5.6, using all IQSs cannot guarantee you the best performance results. Contrarily, choosing the IQSs wisely (with some selection algorithm) can help lower the complexity and also achieve the best performance. In Table 5.7, we compare the proposed MPDSE approach with other existing state- of-the-art methods, including DIIVINE, BLIINDS-II, BRISQUE, and CORNIA, which all belong to NR ones. The well-known FR methods (PSNR, SSIM) and corresponding performances are also summarized in Table 5.7 for easy comparison. In brief, our pro- posed MPDSE framework outperforms all existing well-performed IQA models with a very signicant margin in TID2013. Moreover, the MPDSE approach also has a competi- tive performance with BRISQUE and CORNIA in LIVE database. From the observation from Table 5.7, it can be concluded that extracting features from a diversity of domains can help cope with the situation having more than 20 distortion types (e.g., TID2013). However, using one to two types of features only can work well for the environment where several distortion types exist (e.g., LIVE). Therefore, multi-perceptual-domain features (scorers) are essential for a robust IQA model. Furthermore, as shown in Table 5.8, our proposed method (MPDSE) performs the best for JPEG and fast fading (FF) distortions in LIVE database. But for JPEG2000 and Blur, CORNIA ranks the number one. DIIVINE also beats the others on white 150 Table 5.7: Performance Comparison among 7 IQA Models in LIVE and TID2013 IQA model LIVE (779 images) TID2013 (3000 images) PCC SROCC RMSE PCC SROCC RMSE PSNR (FR) 0.8701 0.8756 13.4685 0.6727 0.6394 0.9172 SSIM (FR) 0.9384 0.9479 9.4439 0.7893 0.7417 0.7612 DIIVINE (NR) 0.8443 0.8560 14.6397 0.5408 0.3552 1.0427 BLIINDS-II (NR) 0.9144 0.9115 11.0587 0.4692 0.3936 1.0948 BRISQUE (NR) 0.9424 0.9395 9.2164 0.4749 0.3674 1.0910 CORNIA (NR) 0.9350 0.9420 9.7015 0.5750 0.4280 1.0142 MPDSE (NR) 0.9447 0.9423 8.9593 0.9114 0.9078 0.5101 Table 5.8: SROCC Performance of 7 IQA Models w.r.t Distortion Types in LIVE Database Distortion type PSNR (FR) SSIM (FR) DIIVINE (NR) BLIINDS-II (NR) BRISQUE (NR) CORNIA (NR) MPDSE (NR) JP2K 0.8954 0.9614 0.9185 0.9299 0.9139 0.9430 0.9422 JPEG 0.8809 0.9764 0.8141 0.9471 0.9647 0.9550 0.9682 WN 0.9854 0.9694 0.9878 0.9597 0.9786 0.9760 0.9745 Blur 0.7823 0.9517 0.9581 0.9103 0.9511 0.9690 0.9370 FF 0.8907 0.9556 0.8586 0.8348 0.8768 0.9060 0.9111 noise (WN). However, in TID2013, as shown in Table 5.9, our approach (MPDSE) has superior performance than other NR IQA models in all distortion types. 5.6 Conclusion In this work, we have proposed a new NR IQA model based on a dierent perspective. We design and utilize a rich diversity of features instead of just using distortion related or NSS features. These features are extracted from multiple perceptual domains, such as brightness, contrast, color, distortion, and texture. Then the quality score prediction model (also called a scorer) is built for each feature via machine learning. Since each scorer plays a dierent role at judging image quality. Some are based on color, and the other are based on the contrast or texture. Hence, instead of using conventional weighting scheme or the ad-hoc approach, we use ensemble method to combine the opinions from all IQSs intelligently. In addition, SFSS is employed in the system to perform IQS selection. This is a more systematic and reasonable way to achieve better performance and reduce the complexity. In the end, comprehensive experiments on both databases (LIVE and TID2013) show that the proposed MPDSE framework outperforms 151 Table 5.9: SROCC Performance of 7 IQA Models w.r.t Distortion Types in TID2013 Database Distortion type PSNR (FR) SSIM (FR) DIIVINE (NR) BLIINDS-II (NR) BRISQUE (NR) CORNIA (NR) MPDSE (NR) 1 0.9291 0.8671 0.8553 0.7226 0.8523 0.7564 0.9422 2 0.8981 0.7727 0.7120 0.6498 0.7090 0.7146 0.8492 3 0.9200 0.8515 0.4626 0.7674 0.4908 0.7047 0.9480 4 0.8323 0.7767 0.6752 0.5128 0.5748 0.7197 0.8492 5 0.9140 0.8634 0.8788 0.8246 0.7528 0.8117 0.9362 6 0.8968 0.7503 0.8063 0.6502 0.6299 0.7651 0.8636 7 0.8808 0.8657 0.1650 0.7816 0.7984 -0.0451 0.8858 8 0.9149 0.9668 0.8344 0.8557 0.8134 0.9207 0.9282 9 0.9480 0.9254 0.7231 0.7116 0.5864 0.8447 0.9443 10 0.9189 0.9200 0.6288 0.8643 0.8521 0.8809 0.9220 11 0.8840 0.9468 0.8534 0.8984 0.8925 0.9065 0.9578 12 0.7685 0.8493 0.2387 0.3055 0.3170 0.6685 0.8636 13 0.8883 0.8828 0.3178 0.6211 0.3594 0.6681 0.8847 14 0.6863 0.7822 0.2912 0.1109 0.1454 0.4158 0.7731 15 0.1553 0.5721 0.2498 0.3618 0.3535 0.3816 0.6990 16 0.7671 0.7752 0.0105 0.1368 0.1269 0.2729 0.6153 17 0.4400 0.5853 0.4601 0.3401 0.2172 0.1530 0.7094 18 0.0766 -0.4137 0.2773 0.4251 0.4677 0.5042 0.7327 19 0.8905 0.7803 0.7873 0.7166 0.7242 0.6553 0.9229 20 0.8411 0.8566 0.4524 0.4124 0.4424 0.4758 0.9209 21 0.9145 0.9057 0.6327 0.7194 0.6852 0.8729 0.9571 22 0.9269 0.8542 0.4362 0.7359 0.7640 0.2366 0.9043 23 0.8872 0.8775 0.6608 0.5444 0.6160 0.8154 0.8608 24 0.9042 0.9461 0.8334 0.8181 0.7841 0.8704 0.9479 other existing state-of-the-art learning-based NR IQA models with clear explanation for its success. 152 Chapter 6 Conclusion and Future Work 6.1 Summary of the Research In Chapter 2, we have rst reviewed the existing visual quality assessment methods and their classication in a comprehensive perspective. Then we introduced the recent devel- opments in image quality assessment (IQA), including the popular public image quality databases that play an important role in facilitating the relevant research activities in this eld and several well-performed image quality metrics. In a similar format, we also discussed the recent developments for video quality assessment (VQA) in general, the publicly available video quality databases and several state-of-the-art VQA metrics. In addition, we have compared the major existing IQA and VQA metrics, and given some discussion, with using the most comprehensive image and video quality databases respec- tively. In the end, we introduce the machine learning methods that can be applied on visual quality assessment. One important class of applications of visual quality assessment is perceptual image and video coding. The perceptually driven coding methods have demonstrated their merits, compared with the traditional MSE based coding techniques. Such research takes a dierent path (i.e., removing perceptual signal redundancy apart from the statistical one) to further improve the coding performance and makes it more use-oriented since humans are the ultimate appreciator of almost all processed visual signals. Existing and interesting methods include: utilizing a perceptual quality index to measure distortion; utilizing JND and VA models in coding; integrating motion or texture information to improve the coding eciency in a perceptual sense. We believe that there are still a lot of possibilities for perceptual coding and beyond, which wait for being discovered. 153 As far as we know, there is no single existing image quality index gives the best performance in all situations. Thus, in Chapter 3, we have proposed an open, inclusive framework for better performance with the current level of technology and for easy extension when new technology emerges. To be more specic, we have presented a multi-method fusion (MMF) approach for image quality assessment and proposed two MMF-based quality indices, based upon machine learning. It was shown by experiments with six dierent databases (totally 3752 images) that both of them outperform state- of-the-art quality indices by a signicant margin. As expected, the complexity of the MMF method is higher since it involves the calculation of multiple methods. However, with the help of the algorithms (SFMS and BIRD), we can reduce the number of fused methods and lower the complexity of MMF. As long as we can keep the number of fused IQA methods not more than three, the computational time of MMF is around 1 minute per image. In most cases, we only need 3 methods for the fusion to achieve satisfactory performance (i.e., PCC over 0.93). Even in the most dicult (in terms of correlation) TID2008 database, MMF can also achieve 0.9307 and 0.9517 on PCC by using 3 methods for CF-MMF and CD-MMF, respectively. The performance of MMF outperforms other well-known image quality indices. Another advantage of the proposed MMF methodology is its exibility in including new methods. For example, in this work, we have added two new methods, which are FSIM [148] and MAD [63], into the candidate method set to replace poor-performed methods (VIFP [117] and UQI [131]) that we used in [75]. It turns out that the PCC performance of CD-MMF improves from 0.9438 to 0.9538 for TID2008 database. This is a good demonstration of the forward inclusiveness of the proposed methodology. In the end, we perform the tests across databases. The results are quite promising. Therefore, the generality of our proposed MMF approach is also demonstrated. In Chapter 4, we have proposed a new concept of IQSs (image quality scorers) consist- ing of basic and auxiliary IQSs, based upon careful, critical and comprehensive analysis of the existing metrics; and we have then developed a ParaBoost approach for objective 154 image quality assessment. Apart from formulation of general IQSs, we have designed several IQSs to target at some specic image distortion types which are very dicult to be dealt with. Then, dierent training image sets are used to train the devised IQSs to be able to gain high diversity. In addition, we use two statistical testing based methods: 1) 1-way ANOVA + paired t test, and 2) K-W statistic + Wilcoxon signed-rank test, in order to select the optimal combination of IQSs and also be able to achieve the best performance with the minimum number of scorers. In the nal stage, a non-linear SVR score fuser is used to combine the outputs from all the IQSs instead of the conventional weighting or voting schemes. The experimental results across four well-known public databases (totally 6,345 images) show that the proposed framework outperforms other existing state-of-the-art IQA models (including both formula-based and learning-oriented methods) with clear explanation for its success. The extension of the ParaBoost approach to the problem of video quality assessment (VQA) is under our current investigation. One main challenge along this direction is the lack of large video quality assessment databases. Furthermore, the design of powerful invididual video quality scorers (VQS) is still an open issue. For Chapter 5, we have proposed a new NR IQA model based on a dierent perspec- tive. We design and utilize a rich diversity of features instead of just using distortion related or NSS features. These features are extracted from multiple perceptual domains, such as brightness, contrast, color, distortion, and texture. Then the quality score pre- diction model (also called a scorer) is built for each feature via machine learning. Since each scorer plays a dierent role at judging image quality. Some are based on color, and the other are based on the contrast or texture. Hence, instead of using conven- tional weighting scheme or the ad-hoc approach, we use ensemble method to combine the opinions from all IQSs intelligently. In addition, SFSS is employed in the system to perform IQS selection. This is a more systematic and reasonable way to achieve better performance and reduce the complexity. In the end, comprehensive experiments on both 155 databases (LIVE and TID2013) show that the proposed MPDSE framework outperforms other existing state-of-the-art learning-based NR IQA models with clear explanation for its success. 6.2 Future Research Directions Although many visual quality assessment metrics have been developed for both image and video during the past decade, there are still great technological challenges ahead and much space for improvement, toward eective, reliable, ecient and widely accepted replacement for MSE/PSNR, for both standalone and embedded applications. We will discuss the possible directions in this section. 6.2.1 PSNR or SSIM-modied Metrics PSNR has always been criticized its poor correlation with human subjective evaluations. However, according to our observations [75, 78], PSNR sometimes still can work very well on some specic distortion types, such as additive and quantization noise. Hence, a lot of metrics have been developed or derived from PSNR, such as PSNR-HVS [43], EPSNR [64], and SPHVSM [60]. They either incorporate some related HVS characteris- tics into PSNR or include some experimental observations to modify PSNR to improve the correlation. Promising results can be achieved in this way of modication. Among the quality metrics we just mentioned above, only the EPSNR is developed to use on video quality assessment. As a single metric, the SSIM is considered the well-performed metric among all visual quality evaluation metrics, in terms of consistency. Thus, researchers in the eld have managed to transform it by changing its pooling method or using other image features. Several examples of the former are V-SSIM [136], Speed-SSIM [134], 3-SSIM [67], and IW-SSIM [135], while FSIM index [148] is an example of the latter. They are all proven quite useful in improving the quality prediction performance, especially 156 FSIM, which shows superior performance in several image quality databases, including TID2008, CSIQ, LIVE, and IVC. Building new metrics based upon more mature metrics (like PSNR and SSIM) is expected to continue, especially in new application scenarios (e.g., for 3D scenes, mobile media, medical imaging, image/video retargeting, computer graphics, and so on). 6.2.2 Multiple Strategies or Multi-Metric Fusion Approaches MAD [63] and MMF [75, 78] are the representatives for multiple strategies and multi- metric fusion, respectively. Especially for the latter one, appropriate fusion of existing metrics opens the chances to build on the strength of each participating metric and the resultant framework can be even used when new, good metrics emerge. More careful and in-depth investigation is needed for this topic. Most recently, a block-based MMF (BMMF) [59] approach is proposed on coping with image quality assessment. The authors rst decomposed images into smaller block size. Then they classify the blocks into three types (smooth, edge, and texture). And they also divided all the images into ve dierent distortion groups, like in [75, 78]. Finally, only one appropriate quality metric is selected for each block based on the distortion group and the block type. Fusion through all the blocks leads to the nal quality score for each image. It oers competitive performance with the MMF for the TID2008 database. 6.2.3 Migration from IQA to VQA Up to now, more research has been performed for IQA. As mentioned before, video quality evaluation can be done by using image quality metrics on a frame-by-frame basis, and then averaging to obtain a nal video quality score. However, this only works well when video contents do not have large motion in temporal domain. When there exists a large motion, we need to nd the temporal structure and temporal features. The most common method is to use motion estimation to nd out the motion vectors and measure the variations in temporal domain. One simple realization of this idea is 157 in [80]. The authors extended one existing image quality assessment metric to a video quality metric by considering temporal information and converted it into a compensation factor to correct the video quality score obtained in the spatial domain. There are also other video quality metrics that utilize motion estimation to detect the temporal varia- tions, such as Speed-SSIM [134], MOVIE [115], TetraVQM [27], MSE TIM [71], STAQ [26], and ST-MAD [129]. All the above approaches improve the correlation between pre- dictions and subjective quality scores more or less. This demonstrates that the temporal variation is indeed an important factor we need to consider for VQA. Another feasible method is to extend original image quality metric into a video quality metric by considering three additional processing steps: temporal channel decomposition, temporal masking, and temporal pooling. One example of this is recently proposed in [70]. Their resultant video quality metric shows a quite good performance in matching subjective scores for LIVE Video Quality Database. Similarly, we can also use the MMF strategy on video quality assessment, via fusing the scores obtained from all available video quality metrics. A possible problem of this approach is the high complexity since multiple metrics and video data are involved. One solution to realize ecient MMF for video is to pick up the best features used in all metrics, including both spatial and temporal features, instead of using all participating metrics as they are. Moreover, this solution gives a chance to eliminate the repetition in feature detection among dierent metrics, and proper machine learning techniques will be customized for this purpose. In addition, VA modeling [81] may play a more active role in VQA than IQA. 6.2.4 Audiovisual Quality Assessment for 4G Networks During the recent years, the term Quality of Experience (QoE) has been used and dened as the users' perceived Quality of Service (QoS). More often than not in multimedia applications, the quality assessment has to be performed with audio and video (images) 158 being presented together. It is an important but less investigated research topic, in spite of some early work in this area [46, 48, 50]. It has been proposed that a better QoE can be achieved when the QoS is considered both in the network and application layers as a whole [62]. In the application layer, QoS is aected by the factors such as resolution, frame rate, sampling rate, number of channels, color, video codec type, audio codec type, and layering strategy. The network layer introduces impairment parameters such as packet loss, jitter, network delay, burstiness, and decreased throughput, etc. These are all the key factors that aect the overall audiovisual QoE. Hence, the investigation into the quality assessment methods for both audio and video is also important and meaningful since video chats and video conferences over 4G networks may be frequently used by the general public in the near future. We believe this is a signicant extension of the current research work and very meaningful in total multimedia experience evaluation. Currently there is no public database for joint audiovisual quality and experience evaluation. The establishment of such databases will facilitate the research and promote the advancement in this eld. 6.2.5 Perceptual Image/Video Coding The accuracy of IQA is becoming better and better. The performance of perceptual image coding could be further improved under some specic conditions. The perceptual considerations can help the performance to be enhanced compared to the traditional image coding. As the introduced applications above, IQA metrics have been associated to video coding for some time. More and more related research is in progress. In general, VQA related video compression is less investigated. Bovik et al. [115] addressed motion-based video integrity evaluation (MOVIE) index to evaluate video quality. The MOVIE index based on Gabor decomposition is calculated from two com- ponents, which are Spatial MOVIE map and Temporal MOVIE map. The spatial part is established as combination of SSIM and VIF; the temporal part is brought by using 159 motion information. The performance of MOVIE shows the potential to be employed to video coding. Nevertheless, it is challenging to be handled in video coding because it needs to parse the whole video to give the index. Hence, modifying VQA to low complexity and real-time processing would be a possible goal to integrate VQA to video coding. These are issues to apply VQA to perceptual video coding. 6.2.6 No-Reference (NR) Quality Metrics As we know, the NR method does not perform as well as the FR one in general since it judges the quality solely based on the distorted medium and without any reference available. However, it can be used in wider scope of applications because of its suitability in both situations with and without reference information. Moreover, the computational requirement is usually less since there is no need to process the reference. In addition to the traditional NR cases (like the relay site and receiving end of transmission), there are emerging NR applications (e.g., super-resolution construction, image and video retarget- ing/adaption, and computer graphics/animation). That is the reason why several NR quality metrics have been proposed recently, including MREBN [37] and JNBM [45] in images, and CVQ [61] and V-Factor [143] in videos. We believe that there will be more quality metrics developing along this direction. 160 Bibliography [1] A57 Database. [Online]. Available: http://foulard.ece.cornell.edu/dmc27/vsnr/vs- nr.html. [2] Categorical Image Quality (CSIQ) Database. [Online]. Available: http://vision.okstate.edu/csiq. [3] Digital Video Library. [Online]. Available: http://www.cdvl.org/. [4] EPFL-PoliMI Video Quality Assessment Database. [Online]. Available: http://vqa.como.polimi.it/. [5] IRCCyN/IVC 1080i Database. [Online]. Available: http://www.irccyn.ec- nantes.fr/spip.php?article541. [6] IRCCyN/IVC SD RoI Database. [Online]. Available: http://www.irccyn.ec- nantes.fr/spip.php?article551. [7] IVC Image Quality Database. [Online]. Available: http://www2.irccyn.ec- nantes.fr/ivcdb. [8] IVC-LAR Database. [Online]. Available: http://www.irccyn.ec- nantes.fr/ autrusse/Databases/LAR. [9] LIVE Image Quality Assessment Database. [Online]. Available: http://live.ece.utexas.edu/research/quality/subjective.htm. [10] LIVE Video Quality Database. [Online]. Available: http://live.ece.utexas.edu/research/quality/live video.html. [11] LIVE Wireless Video Quality Assessment Database. [Online]. Avail- able:http://live.ece.utexas.edu/research/quality/live wireless video.html. [12] MMSP 3D Image Quality Assessment Database. [Online]. Available: http://mmspg.ep .ch/cms/page-58394.html. [13] MMSP 3D Video Quality Assessment Database. [Online]. Available: http://mmspg.ep .ch/3dvqa. 161 [14] MMSP Scalable Video Database. [Online]. Available: http://mmspg.ep .ch/svd. [15] Tampere Image Database 2013. [Online]. Available: http://www.ponomarenko.info/tid2013.htm. [16] Tampere Image Database. [Online]. Available: http://www.ponomarenko.info/tid2008.htm. [17] Toyoma Database. [Online]. Available: http://mict.eng.u- toyama.ac.jp/mictdb.html. [18] VQEG FRTV Phase I Database, 2000. [Online]. Available: ftp://ftp.crc.ca/crc/vqeg/TestSequences/. [19] VQEG HDTV Database. [Online]. Available: http://www.its.bldrdoc.gov/vqeg/pr- ojects/hdtv/. [20] Wireless Imaging Quality (WIQ) Database. [Online]. Available: http://www.bth.se/tek/rcg.nsf/pages/wiq-db. [21] Methodology for the Subjective Assessment of the Quality of Television Pictures. Recommendation ITU-R BT.500-11, 2002. [22] Methodology for the Subjective Assessment of the Quality of Television Pictures. Recommendation ITU-R BT.500-13, 2012. [23] Objective perceptual video quality measurement techniques for digital cable tele- vision in the presence of a full reference. Recommendation ITU-T J.144, Feb. 2004. [24] Objective perceptual video quality measurement techniques for standard denition digital broadcast television in the presence of a full reference. Recommendation ITU-R BT.1683, Jan. 2004. [25] Subjective Video Quality Assessment Methods for Multimedia Applications. Rec- ommendation ITU-T P.910, Sep. 1999. [26] S. A. Amirshahi and M. Larabi. Spatial-temporal video quality metric based on an estimation of qoe. In Quality of Multimedia Experience (QoMEX), 2011 Third International Workshop on, pages 84{89. IEEE, 2011. [27] M. Barkowsky, J. Bialkowski, B. Eskoer, R. Bitto, and A. Kaup. Temporal trajectory aware video quality measure. Selected Topics in Signal Processing, IEEE Journal of, 3(2):266{279, 2009. [28] D. Basak, S. Pal, and D. C. Patranabis. Support vector regression. Neural Infor- mation Processing-Letters and Reviews, 11(10):203{224, 2007. [29] R. Bellman. Adaptive control processes: a guided tour, volume 4. Princeton uni- versity press Princeton, 1961. 162 [30] C. M. Bishop and N. M. Nasrabadi. Pattern recognition and machine learning, volume 1. springer New York, 2006. [31] M. Bosch, F. Zhu, and E. J. Delp. Segmentation-based video compression using texture and motion models. Selected Topics in Signal Processing, IEEE Journal of, 5(7):1366{1377, 2011. [32] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classiers. In Proceedings of the fth annual workshop on Computational learning theory, pages 144{152. ACM, 1992. [33] D. M. Chandler and S. S. Hemami. Vsnr: A wavelet-based visual signal-to-noise ratio for natural images. Image Processing, IEEE Transactions on, 16(9):2284{ 2298, 2007. [34] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. [35] S. S. Channappayya, A. C. Bovik, and R. W. Heath. Rate bounds on ssim index of quantized images. Image Processing, IEEE Transactions on, 17(9):1624{1639, 2008. [36] Z. Chen and C. Guillemot. Perceptually-friendly h. 264/avc video coding based on foveated just-noticeable-distortion model. Circuits and Systems for Video Tech- nology, IEEE Transactions on, 20(6):806{819, 2010. [37] M. Choi, J. Jung, and J. Jeon. No-reference image quality assessment using blur and noise, 2009. [38] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273{ 297, 1995. [39] Z. Cui and X. Zhu. Subjective quality optimized intra mode selection for h. 264 i frame coding based on ssim. In Image and Graphics (ICIG), 2011 Sixth Interna- tional Conference on, pages 157{162. IEEE, 2011. [40] S. J. Daly. Visible dierences predictor: an algorithm for the assessment of image delity. In SPIE/IS&T 1992 Symposium on Electronic Imaging: Science and Tech- nology, pages 2{15. International Society for Optics and Photonics, 1992. [41] N. Damera-Venkata, T. D. Kite, W. S. Geisler, B. L. Evans, and A. C. Bovik. Image quality assessment based on a degradation model. Image Processing, IEEE Transactions on, 9(4):636{650, 2000. [42] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classication. John Wiley & Sons, 2012. [43] K. Egiazarian, J. Astola, N. Ponomarenko, V. Lukin, F. Battisti, and M. Carli. New full-reference quality metrics based on hvs. In CD-ROM Proceedings of the Second International Workshop on Video Processing and Quality Metrics, 2006. 163 [44] U. Engelke and H.-J. Zepernick. Perceptual-based quality metrics for image and video services: A survey. In Next Generation Internet Networks, 3rd EuroNGI Conference on, pages 190{197. IEEE, 2007. [45] R. Ferzli and L. J. Karam. A no-reference objective image sharpness metric based on the notion of just noticeable blur (jnb). Image Processing, IEEE Transactions on, 18(4):717{728, 2009. [46] M. R. Frater, J. F. Arnold, and A. Vahedian. Impact of audio on subjective assessment of video quality in videoconferencing applications. Circuits and Systems for Video Technology, IEEE Transactions on, 11(9):1059{1062, 2001. [47] J. H. Friedman. On bias, variance, 0/1xloss, and the curse-of-dimensionality. Data mining and knowledge discovery, 1(1):55{77, 1997. [48] M. Furini and V. Ghini. A video frame dropping mechanism based on audio perception. In Global Telecommunications Conference Workshops, 2004. GlobeCom Workshops 2004. IEEE, pages 211{216. IEEE, 2004. [49] X. Gao, W. Lu, D. Tao, and X. Li. Image quality assessment based on multiscale geometric analysis. Image Processing, IEEE Transactions on, 18(7):1409{1423, 2009. [50] G. Ghinea and J. P. Thomas. Quality of perception: user quality of service in multimedia presentations. Multimedia, IEEE Transactions on, 7(4):786{789, 2005. [51] B. Girod. What's wrong with mean-squared error? In Digital images and human vision, pages 207{220. MIT press, 1993. [52] S. A. Glantz. Primer of biostatistics. 2005. [53] R. C. Gonzalez and R. E. Woods. Digital image processing. 2007. [54] R. M. Haralick, K. Shanmugam, and I. H. Dinstein. Textural features for image classication. Systems, Man and Cybernetics, IEEE Transactions on, (6):610{621, 1973. [55] I. Hontsch and L. J. Karam. Adaptive image coding with perceptual distortion control. Image Processing, IEEE Transactions on, 11(3):213{222, 2002. [56] C.-W. Hsu, C.-C. Chang, C.-J. Lin, et al. A practical guide to support vector classication, 2003. [57] Y.-H. Huang, T.-S. Ou, P.-Y. Su, and H. H. Chen. Perceptual rate-distortion opti- mization using structural similarity index as quality metric. Circuits and Systems for Video Technology, IEEE Transactions on, 20(11):1614{1624, 2010. [58] N. Jayant, J. Johnston, and R. Safranek. Signal compression based on models of human perception. Proceedings of the IEEE, 81(10):1385{1422, 1993. 164 [59] L. Jin, K. Egiazarian, and C.-C. J. Kuo. Perceptual image quality assessment using block-based multi-metric fusion (bmmf). In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 1145{1148. IEEE, 2012. [60] L. Jin, N. Ponomarenko, and K. Egiazarian. Novel image quality metric based on similarity. In Signals, Circuits and Systems (ISSCS), 2011 10th International Symposium on, pages 1{4. IEEE, 2011. [61] Y. Kawayoke and Y. Horita. Nr objective continuous video quality assessment model based on frame quality measure. In Image Processing, 2008. ICIP 2008. 15th IEEE International Conference on, pages 385{388. IEEE, 2008. [62] A. Khan, Z. Li, L. Sun, and E. Ifeachor. Audiovisual quality assessment for 3g networks in support of e-healthcare services. In Proceedings of the 3rd International Conference on Computational Intelligence in Medicine and Healthcare. Citeseer, 2007. [63] E. C. Larson and D. M. Chandler. Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of Electronic Imag- ing, 19(1):011006{011006, 2010. [64] C. Lee, S. Cho, J. Choe, T. Jeong, W. Ahn, and E. Lee. Objective video quality assessment. Optical engineering, 45(1):017004{017004, 2006. [65] J.-S. Lee and T. Ebrahimi. Ecient video coding in h. 264/avc by using audio- visual information. In Multimedia Signal Processing, 2009. MMSP'09. IEEE Inter- national Workshop on, pages 1{6. IEEE, 2009. [66] A. Leontaris, P. C. Cosman, and A. R. Reibman. Quality evaluation of motion- compensated edge artifacts in compressed video. Image Processing, IEEE Trans- actions on, 16(4):943{956, 2007. [67] C. Li and A. C. Bovik. Three-component weighted structural similarity index. In IS&T/SPIE Electronic Imaging, pages 72420Q{72420Q. International Society for Optics and Photonics, 2009. [68] C. Li and T. Chen. Aesthetic visual quality assessment of paintings. Selected Topics in Signal Processing, IEEE Journal of, 3(2):236{252, 2009. [69] Q. Li and Z. Wang. Reduced-reference image quality assessment using divisive normalization-based image representation. Selected Topics in Signal Processing, IEEE Journal of, 3(2):202{211, 2009. [70] S. Li, L. Ma, and K. N. Ngan. Video quality assessment by decoupling additive impairments and detail losses. In Quality of Multimedia Experience (QoMEX), 2011 Third International Workshop on, pages 90{95. IEEE, 2011. 165 [71] S. Li, L. Ma, F. Zhang, and K. N. Ngan. Temporal inconsistency measure for video quality assessment. In Picture Coding Symposium (PCS), 2010, pages 590{ 593. IEEE, 2010. [72] Z. Li, S. Qin, and L. Itti. Visual attention guided bit allocation in video compres- sion. Image and Vision Computing, 29(1):1{14, 2011. [73] J. S. Lim. Two-dimensional signal and image processing. Englewood Clis, NJ, Prentice Hall, 1990, 710 p., 1, 1990. [74] W. Lin and C.-C. J. Kuo. Perceptual visual quality metrics: A survey. Journal of Visual Communication and Image Representation, 22(4):297{312, 2011. [75] T.-J. Liu, W. Lin, and C.-C. J. Kuo. A multi-metric fusion approach to visual quality assessment. In Quality of Multimedia Experience (QoMEX), 2011 Third International Workshop on, pages 72{77. IEEE, 2011. [76] T.-J. Liu, W. Lin, and C.-C. J. Kuo. Recent developments and future trends in visual quality assessment. In Proceedings of Asia-Pacic Signal and Information Processing Association Annual Submit and Conference, pages 18{21, 2011. [77] T.-J. Liu, W. Lin, and C.-C. J. Kuo. A fusion approach to video quality assessment based on temporal decomposition. In Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacic, pages 1{5. IEEE, 2012. [78] T.-J. Liu, W. Lin, and C.-C. J. Kuo. Image quality assessment using multi-method fusion. Image Processing, IEEE Transactions on, 22(5):1793{1807, 2013. [79] T.-J. Liu, Y.-C. Lin, W. Lin, and C.-C. J. Kuo. Visual quality assessment: recent developments, coding applications and future trends. APSIPA Transactions on Signal and Information Processing, 2, e4 2013. [80] T.-J. Liu, K.-H. Liu, and H.-H. Liu. Temporal information assisted video quality metric for multimedia. In Multimedia and Expo (ICME), 2010 IEEE International Conference on, pages 697{701. IEEE, 2010. [81] Z. Lu, W. Lin, X. Yang, E. Ong, and S. Yao. Modeling visual attention's modu- latory aftereects on visual sensitivity and quality evaluation. Image Processing, IEEE Transactions on, 14(11):1928{1942, 2005. [82] J. Lubin. A visual discrimination model for imaging system design and evaluation. Vision models for target detection and recognition, 2:245{357, 1995. [83] H. Luo. A training-based no-reference image quality assessment algorithm. In Image Processing, 2004. ICIP'04. 2004 International Conference on, volume 5, pages 2973{2976. IEEE, 2004. 166 [84] P. Marziliano, F. Dufaux, S. Winkler, and T. Ebrahimi. A no-reference perceptual blur metric. In Image Processing. 2002. Proceedings. 2002 International Conference on, volume 3, pages III{57. IEEE, 2002. [85] M. Masry, S. S. Hemami, and Y. Sermadevi. A scalable wavelet-based video dis- tortion metric and applications. Circuits and Systems for Video Technology, IEEE Transactions on, 16(2):260{273, 2006. [86] M. A. Masry and S. S. Hemami. A metric for continuous quality evaluation of compressed video with severe distortions. Signal processing: Image communication, 19(2):133{146, 2004. [87] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-reference image quality assessment in the spatial domain. Image Processing, IEEE Transactions on, 21(12):4695{4708, 2012. [88] A. K. Moorthy and A. C. Bovik. A two-step framework for constructing blind image quality indices. Signal Processing Letters, IEEE, 17(5):513{516, 2010. [89] A. K. Moorthy and A. C. Bovik. Blind image quality assessment: From natural scene statistics to perceptual quality. Image Processing, IEEE Transactions on, 20(12):3350{3364, 2011. [90] M. Naccari and F. Pereira. Advanced h. 264/avc-based perceptual video coding: Architecture, tools, and assessment. Circuits and Systems for Video Technology, IEEE Transactions on, 21(6):766{782, 2011. [91] M. Narwaria and W. Lin. Objective image quality assessment based on support vector regression. Neural Networks, IEEE Transactions on, 21(3):515{519, 2010. [92] M. Narwaria and W. Lin. Machine learning based modeling of spatial and temporal factors for video quality assessment. In Image Processing (ICIP), 2011 18th IEEE International Conference on, pages 2513{2516. IEEE, 2011. [93] M. Narwaria and W. Lin. Video quality assessment using temporal quality varia- tions and machine learning. In Multimedia and Expo (ICME), 2011 IEEE Inter- national Conference on, pages 1{6. IEEE, 2011. [94] P. Ndjiki-Nya, C. Stuber, and T. Wiegand. Texture synthesis method for generic video sequences. In Image Processing, 2007. ICIP 2007. IEEE International Con- ference on, volume 3, pages III{397. IEEE, 2007. [95] A. Ninassi, O. Le Meur, P. Le Callet, and D. Barba. Considering temporal vari- ations of spatial visual distortions in video quality assessment. Selected Topics in Signal Processing, IEEE Journal of, 3(2):253{265, 2009. [96] B. T. Oh, Y. Su, C. Segall, and C.-C. Kuo. Synthesis-based texture video coding with side information. Circuits and Systems for Video Technology, IEEE Trans- actions on, 21(5):647{659, 2011. 167 [97] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution gray-scale and rotation invariant texture classication with local binary patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(7):971{987, 2002. [98] E. Ong, W. Lin, Z. Lu, X. Yang, S. Yao, F. Pan, L. Jiang, and F. Moschetti. A no-reference quality metric for measuring image blur. In Signal Processing and Its Applications, 2003. Proceedings. Seventh International Symposium on, volume 1, pages 469{472. IEEE, 2003. [99] E. Ong, W. Lin, Z. Lu, S. Yao, X. Yang, and L. Jiang. No-reference jpeg-2000 image quality metric. In Multimedia and Expo, 2003. ICME'03. Proceedings. 2003 International Conference on, volume 1, pages I{545. IEEE, 2003. [100] T. Oommen, D. Misra, N. K. Twarakavi, A. Prakash, B. Sahoo, and S. Bandopad- hyay. An objective analysis of support vector machine based classication for remote sensing. Mathematical Geosciences, 40(4):409{424, 2008. [101] T.-S. Ou, Y.-H. Huang, and H. H. Chen. Ssim-based perceptual rate control for video coding. Circuits and Systems for Video Technology, IEEE Transactions on, 21(5):682{691, 2011. [102] S. Ouni, E. Zagrouba, and M. Chambah. A new no-reference method for color image quality assessment. assessment (NR-IQA), 40(17), 2012. [103] J. Park, K. Seshadrinathan, A. Bovik, et al. Video quality pooling adaptive to perceptual distortion severity. Image Processing, IEEE Transactions on, 22(2):610{ 620, 2013. [104] E. Peli. Contrast in complex images. JOSA A, 7(10):2032{2040, 1990. [105] M. H. Pinson and S. Wolf. A new standardized method for objectively measuring video quality. Broadcasting, IEEE Transactions on, 50(3):312{322, 2004. [106] R. Polikar. Ensemble based systems in decision making. Circuits and Systems Magazine, IEEE, 6(3):21{45, 2006. [107] N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazarian, L. Jin, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C.-C. J. Kuo. Color image database tid2013: Peculiarities and preliminary results. 2013. [108] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, M. Carli, and F. Bat- tisti. Tid2008-a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics, 10(4):30{45, 2009. [109] N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. Astola, and V. Lukin. On between-coecient contrast masking of dct basis functions. In Proceedings of the Third International Workshop on Video Processing and Quality Metrics, volume 4, 2007. 168 [110] W. K. Pratt. Digital image processing. 2007. [111] A. Rehman and Z. Wang. Reduced-reference ssim estimation. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pages 289{292. IEEE, 2010. [112] T. Richter and K. J. Kim. A ms-ssim optimal jpeg 2000 encoder. In Data Com- pression Conference, 2009. DCC'09., pages 401{410. IEEE, 2009. [113] M. A. Saad, A. C. Bovik, and C. Charrier. Blind image quality assessment: A natural scene statistics approach in the dct domain. Image Processing, IEEE Transactions on, 21(8):3339{3352, 2012. [114] B. Sch olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural computation, 12(5):1207{1245, 2000. [115] K. Seshadrinathan and A. C. Bovik. Motion tuned spatio-temporal quality assess- ment of natural videos. Image Processing, IEEE Transactions on, 19(2):335{350, 2010. [116] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack. Study of subjective and objective quality assessment of video. Image Processing, IEEE Transactions on, 19(6):1427{1441, 2010. [117] H. R. Sheikh and A. C. Bovik. Image information and visual quality. Image Processing, IEEE Transactions on, 15(2):430{444, 2006. [118] H. R. Sheikh, A. C. Bovik, and G. De Veciana. An information delity criterion for image quality assessment using natural scene statistics. Image Processing, IEEE Transactions on, 14(12):2117{2128, 2005. [119] H. R. Sheikh, M. F. Sabir, and A. C. Bovik. A statistical evaluation of recent full reference image quality assessment algorithms. Image Processing, IEEE Transac- tions on, 15(11):3440{3451, 2006. [120] A. J. Smola and B. Sch olkopf. A tutorial on support vector regression. Statistics and computing, 14(3):199{222, 2004. [121] S. Suresh, V. Babu, and N. Sundararajan. Image quality measurement using sparse extreme learning machine classier. In Control, Automation, Robotics and Vision, 2006. ICARCV'06. 9th International Conference on, pages 1{6. IEEE, 2006. [122] S. Suresh, R. Venkatesh Babu, and H. Kim. No-reference image quality assess- ment using modied extreme learning machine classier. Applied Soft Computing, 9(2):541{552, 2009. [123] D. Tan, C. Tan, and H. Wu. Perceptual color image coding with jpeg2000. Image Processing, IEEE Transactions on, 19(2):374{383, 2010. 169 [124] P. C. Teo and D. J. Heeger. Perceptual image distortion. In Image Processing, 1994. Proceedings. ICIP-94., IEEE International Conference, volume 2, pages 982{986. IEEE, 1994. [125] N. Thakur and S. Devi. A new method for color image quality assessment. Inter- national Journal of Computer Applications, 15(2):10{17, 2011. [126] H. Tong, M. Li, H.-J. Zhang, and C. Zhang. No-reference quality assessment for jpeg2000 compressed images. In Image Processing, 2004. ICIP'04. 2004 Interna- tional Conference on, volume 5, pages 3539{3542. IEEE, 2004. [127] VQEG. Final report from the video quality experts group on the validation of objective models of video quality assessment, phase I. Mar. 2000. [Online]. Avail- able: http://www.its.bldrdoc.gov/vqeg/projects/frtv phaseI. [128] VQEG. Final report from the video quality experts group on the validation of objective models of video quality assessment, phase II. Aug. 2003. [Online]. Avail- able: http://www.its.bldrdoc.gov/vqeg/projects/frtv phaseII. [129] P. V. Vu, C. T. Vu, and D. M. Chandler. A spatiotemporal most-apparent- distortion model for video quality assessment. In Image Processing (ICIP), 2011 18th IEEE International Conference on, pages 2505{2508. IEEE, 2011. [130] S. Wang, A. Rehman, Z. Wang, S. Ma, and W. Gao. Ssim-motivated rate-distortion optimization for video coding. Circuits and Systems for Video Technology, IEEE Transactions on, 22(4):516{529, 2012. [131] Z. Wang and A. C. Bovik. A universal image quality index. Signal Processing Letters, IEEE, 9(3):81{84, 2002. [132] Z. Wang and A. C. Bovik. Mean squared error: love it or leave it? a new look at signal delity measures. Signal Processing Magazine, IEEE, 26(1):98{117, 2009. [133] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assess- ment: From error visibility to structural similarity. Image Processing, IEEE Trans- actions on, 13(4):600{612, 2004. [134] Z. Wang and Q. Li. Video quality assessment using a statistical model of human visual speed perception. JOSA A, 24(12):B61{B69, 2007. [135] Z. Wang and Q. Li. Information content weighting for perceptual image quality assessment. Image Processing, IEEE Transactions on, 20(5):1185{1198, 2011. [136] Z. Wang, L. Lu, and A. C. Bovik. Video quality assessment based on structural distortion measurement. Signal processing: Image communication, 19(2):121{132, 2004. [137] Z. Wang, H. R. Sheikh, and A. C. Bovik. No-reference perceptual quality assess- ment of jpeg compressed images. In Image Processing. 2002. Proceedings. 2002 International Conference on, volume 1, pages I{477. IEEE, 2002. 170 [138] Z. Wang and E. P. Simoncelli. Translation insensitive image similarity in complex wavelet domain. In In Acoustics, Speech, and Signal Processing, 2005. Proceed- ings.(ICASSP05). IEEE International Conference on. Citeseer, 2005. [139] Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In Signals, Systems and Computers, 2003. Conference Record of the Thirty-Seventh Asilomar Conference on, volume 2, pages 1398{1402. IEEE, 2003. [140] A. B. Watson, J. Hu, and J. F. MCGOwAN. Digital video quality metric based on human vision. Journal of Electronic imaging, 10(1):20{29, 2001. [141] Z. Wei and K. N. Ngan. Spatio-temporal just noticeable distortion prole for grey scale image/video in dct domain. Circuits and Systems for Video Technology, IEEE Transactions on, 19(3):337{346, 2009. [142] S. Winkler. Digital video quality: vision models and metrics. Wiley, 2005. [143] S. Winkler and P. Mohandas. The evolution of video quality measurement: from psnr to hybrid metrics. Broadcasting, IEEE Transactions on, 54(3):660{668, 2008. [144] C.-L. Yang, R.-K. Leung, L.-M. Po, and Z.-Y. Mai. An ssim-optimal h. 264/avc inter frame encoder. In Intelligent Computing and Intelligent Systems, 2009. ICIS 2009. IEEE International Conference on, volume 4, pages 291{295. IEEE, 2009. [145] P. Ye and D. Doermann. No-reference image quality assessment using visual code- books. Image Processing, IEEE Transactions on, 21(7):3129{3138, 2012. [146] P. Ye, J. Kumar, L. Kang, and D. Doermann. Unsupervised feature learning framework for no-reference image quality assessment. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1098{1105. IEEE, 2012. [147] C. Yim and A. C. Bovik. Quality assessment of deblocked images. Image Process- ing, IEEE Transactions on, 20(1):88{98, 2011. [148] L. Zhang, L. Zhang, X. Mou, and D. Zhang. FSIM: a feature similarity index for image quality assessment. Image Processing, IEEE Transactions on, 20(8):2378{ 2386, 2011. 171
Abstract (if available)
Abstract
Research on visual quality assessment has been active during the last decade. This dissertation consists of six parts centered on this subject. In Chapter 1, we highlight the significance and contributions of our research work. The previous work in this area is also thoroughly reviewed. ❧ In Chapter 2, we provide an in‐depth review of recent developments in the field. As compared with others' work, our survey has several contributions. First, besides image quality databases and metrics, we put emphasis on video quality databases and metrics since this is a less investigated area. Second, we discuss the application of visual quality evaluation to perceptual coding as an example for applications. Thirdly, we compare the performance of state‐of‐the‐art visual quality metrics with experiments. Finally, we introduce the machine learning methods that can be applied on visual quality assessment. ❧ In Chapter 3, a new methodology for objective image quality assessment (IQA) with multi‐method fusion (MMF) is proposed. The research is motivated by the observation that there is no single method that can give the best performance in all situations. To achieve MMF, we adopt a regression approach. The new MMF score is set to be a nonlinear combination of scores from multiple methods with suitable weights obtained by a training process. In order to improve the regression results further, we divide distorted images into three to five groups based on the distortion types and perform regression within each group, which is called ""context‐dependent MMF"" (CD‐MMF). One task in CD‐MMF is to determine the context automatically, which is achieved by a machine learning approach. To further reduce the complexity of MMF, we perform algorithms to select a small subset from the candidate method set. The result is very good even if only 3 quality assessment methods are included in the fusion process. The proposed MMF method using support vector regression (SVR) is shown to outperform a large number of existing IQA methods by a significant margin when being tested in six representative databases. ❧ In Chapter 4, an ensemble method for full‐reference image quality assessment (IQA) based on the parallel boosting (or ParaBoost in short) idea is proposed in this work. We first extract features from existing image quality metrics and train them to form basic image quality scorers (BIQSs). Then, we select additional features to address specific distortion types and train them to construct auxiliary image quality scorers (AIQSs). Both BIQSs and AIQSs are trained on small image subsets of certain distortion types and, as a result, they are weak performers with respect to a wide variety of distortions. Finally, we adopt the ParaBoost framework to fuse the scores of BIQSs and AIQSs to evaluate images containing a wide range of distortion types. This ParaBoost methodology can be easily extended to images of new distortion types. Extensive experiments are conducted to demonstrate the superior performance of the ParaBoost method, which ourperforms existing IQA methods by a significant margin. Specifically, the Spearman rank order correlation coefficients (SROCCs) of the ParaBoost method with respect to the LIVE, CSIQ, TID2008 and TID2013 image quality databases are 0.98, 0.97, 0.98 and 0.96, respectively. ❧ In Chapter 5, a no‐reference learning‐based approach to assess image quality is presented in this work. The developed features are extracted from multiple perceptual domains, including brightness, contrast, color, distortion, and texture. The features are then trained to become a model (scorer) which can predict scores. The scorer selection algorithm is utilized to help simplify the proposed system. In the final stage, the ensemble method is used to combine the prediction results from all scorers. Being different from other existing image quality assessment (IQA) methods based on natural scene statistics (NSS) or distortion dependent features, the proposed quality prediction model is robust with respect to more than 24 image distortion types. The extensive experiments on two well‐known databases confirm the performance robustness of our proposed model. ❧ Chapter 6 summarizes the work presented in the dissertation. In addition, we have pointed out and discussed several possible directions for future visual signal quality assessment, i.e., PSNR or SSIM‐modified metrics, multiple strategy and multi‐metric fusion approaches, migration of IQA to VQA, joint audiovisual assessment, perceptual image/video coding, and NR quality assessment, with reasoning based upon our experience and understanding of the related research.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced techniques for stereoscopic image rectification and quality assessment
PDF
Experimental design and evaluation methodology for human-centric visual quality assessment
PDF
Facial age grouping and estimation via ensemble learning
PDF
Techniques for compressed visual data quality assessment and advanced video coding
PDF
A data-driven approach to compressed video quality assessment using just noticeable difference
PDF
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Block-based image steganalysis: algorithm and performance evaluation
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Efficient template representation for face recognition: image sampling from face collections
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
A data-driven approach to image splicing localization
PDF
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
PDF
Classification and retrieval of environmental sounds
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Machine learning methods for 2D/3D shape retrieval and classification
Asset Metadata
Creator
Liu, Tsung-Jung
(author)
Core Title
A learning‐based approach to image quality assessment
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/23/2016
Defense Date
06/11/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
ensemble,fusion,image quality assessment,image quality scorer,machine learning,OAI-PMH Harvest,ParaBoost
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Georgiou, Panayiotis G. (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
liut@usc.edu,tjliu0412@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-446378
Unique identifier
UC11287037
Identifier
etd-LiuTsungJu-2727.pdf (filename),usctheses-c3-446378 (legacy record id)
Legacy Identifier
etd-LiuTsungJu-2727.pdf
Dmrecord
446378
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Liu, Tsung-Jung
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
ensemble
fusion
image quality assessment
image quality scorer
machine learning
ParaBoost