Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
(USC Thesis Other)
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Explainable and Lightweight Techniques for Blind Visual Quality
Assessment and Saliency Detection
by
Zhanxuan Mei
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2024
Copyright 2024 Zhanxuan Mei
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Prior and Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 GreenBIQA: A Lightweight and High-performance Blind Image Quality Assessment
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 GreenBVQA: Blind Video Quality Assessment at the Edge . . . . . . . . . . . . . . . 9
1.3.3 GreenSaliency: A Lightweight and Efficient Image Saliency Detection Method . . . . . 10
1.3.4 GSBIQA: Green Saliency-guided Blind Image Quality Assessment Method . . . . . . . 11
1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2: Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Blind Image Quality Assessment (BIQA) Methods . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Conventional BIQA Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1.1 Natural Scene Statistics (NSS) Methods . . . . . . . . . . . . . . . . . . . . . 13
2.1.1.2 Codebook-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Deep-Learning-based (DL-based) BIQA Methods . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Saliency-guided BIQA Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Blind Video Quality Assessment (BVQA) Methods . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Conventional BVQA Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Deep-Learning-based (DL-based) BVQA Methods . . . . . . . . . . . . . . . . . . . . 18
2.3 Image Saliency Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Conventional Image Saliency Detection Methods . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 DL-based Image Saliency Detection Methods . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Green Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Video Quality Assessment at the Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 3: GreenBIQA: A Lightweight High-Performance Blind Image Quality Assessment . . . . . . 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Image Cropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.2 Unsupervised Representation Generation . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2.1 Spatial Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2.2 Joint Spatio-Color Representations . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Supervised Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.4 Distortion-specific Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
ii
3.2.4.1 Synthetic Distortions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4.2 Authentic Distortions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.5 Regression and Decision Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1.2 Benchmarking Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Performance Evalution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2.1 Comparison among Benchmarking Methods . . . . . . . . . . . . . . . . . . . 37
3.3.2.2 Synthetic-Distortion Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2.3 Authentic-Distortion Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Cross-Domain Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.4 Model Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.4.1 Model Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.4.2 Inference Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.4.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.4.4 Memory/Latency Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.6 Weak Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.7 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Chapter 4: GreenBVQA: Blind Video Quality Assessment at the Edge . . . . . . . . . . . . . . . . . . 48
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Video Data Cropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.2 Unsupervised Representation Generation . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.2.1 Spatial Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2.2 Spatio-Color Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2.3 Temporal Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2.4 Spatio-Temporal Representations . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.3 Supervised Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.4 MOS Regression and Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.1 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.1.2 Benchmarking Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.1.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.1.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.2.1 Same-Domain Training Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.2.2 Cross-Domain Training Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.3 Comparison of Model Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.3.1 Model sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.3.2 Inference time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.3.3 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.4 Abalation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 An Edge Computing System with BVQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 5: GreenSaliency: A Lightweight and Efficient Image Saliency Detection Method . . . . . . . 72
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
iii
5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 Multi-layer Hybrid Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1.1 Multi-layer Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1.2 Spatial Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.1.3 Subspace Approximation with Adjusted Bias (Saab) Transform . . . . . . . . 78
5.2.1.4 Relevent Feature Test (RFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.2 Multi-path Saliency Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2.1 Saliency Map Prediction and Saliency Residual Prediction . . . . . . . . . . 83
5.2.2.2 Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.2.3 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.1.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2.1 Benchmarking Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.3 Model Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.3.1 Model Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.3.2 Inference Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.3.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 6: GSBIQA: Green Saliency-guided Blind Image Quality Assessment Method . . . . . . . . . 94
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.1 Green Saliency Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.2 Saliency-guided Data Cropping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.3 Green BIQA Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.3.1 Spatial Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2.3.2 Spatio-color Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2.3.3 Saliency Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.4 Local Patch Score Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.5 Saliency-guided Global Quality Score Prediction . . . . . . . . . . . . . . . . . . . . . 100
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.1.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.1.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.2.1 Benchmarking Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3.2.2 Comparison among Benchmarking Methods . . . . . . . . . . . . . . . . . . . 102
6.3.2.3 Qualitative Analysis on Exemplary Images . . . . . . . . . . . . . . . . . . . 103
6.3.3 Model Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3.3.1 Model Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3.3.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3.3.3 Inference Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Chapter 7: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
iv
7.2 Future Work on GreenBIQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.3 Future Work on GreenBVQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.4 Future Work on GreenSaliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
v
List of Tables
3.1 Four benchmarking IQA datasets, where the number of distorted images, the number of
reference images, the number of distortion types and collection methods of each dataset are
listed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Performance comparison in PLCC and SROCC metrics between our GreenBIQA method and
nine benchmarking methods on four IQA databases, where the nine benchmarking methods
are categorized into four groups as discussed in Sec. 3.3.1 and the best performance numbers
are shown in boldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Comparison of the SROCC performance for each of six individual distortion types in the CSIQ
dataset, where WN, JPEG, JP2K, FN, GB, and CC denotes white Gaussian noise, JPEG
compression, JPEG-2000 compression, pink Gaussian noise, Gaussian blur, and contrast
decrements, respectively. The last column shows the weighted average of the SROCC metrics. 38
3.4 Comparison of the SROCC performance under the cross-domain learning scenario. . . . . . . 39
3.5 Comparison of SROCC/PLCC performance, no. of model parameters, model sizes (memory
usage), no. of GigaFlops, and no. of KiloFlops per pixel of several BIQA methods tested on
the LIVE-C dataset, where “X” denotes the multiple no. . . . . . . . . . . . . . . . . . . . . . 42
3.6 Ablation Study for GreenBIQA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 The selected hyper-parameters for spatial representation generation in our experiments,
where the transform dimensions are denoted by {(H × W), C} for spatial and channel
dimensions, the stride parameter is denoted as {Spatial Stride2}, and L, M, and H represent
low-frequency, mid-frequency, and high-frequency representations, respectively. . . . . . . . . 54
4.2 The selected hyper-parameters for spatial-color representations generation, where the
transform dimensions are denoted by {(H × W) × T, C} for spatial, color, and channel
dimensions and L and H represent low-frequency and high-frequency representations,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Summary of 14-D raw temporal representations. . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 The selected hyper-parameters for spatial-temporal representations generation, where the
transform dimensions are denoted by {(H × W) × T, C} for spatial, temporal, and channel
dimensions and L and H represent low-frequency and high-frequency representations,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Dimensions of unsupervised representation and selected supervised features on the KoNViD1k dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vi
4.6 Statistics of three VQA datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Comparison of the PLCC and SROCC performance of 10 benchmarking methods against
three VQA datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.8 Comparison of the SROCC performance under the cross-domain training scenario. . . . . . . 66
4.9 Model complexity comparison, where the reported SROCC and PLCC performance numbers
are against the KoNViD-1k dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.10 Inference time comparison in seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.11 Ablation Study for GreenBVQA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1 Performance comparison in five metrics between our GreenSaliency method and eleven
benchmarking methods on the MIT300 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Performance comparison in five metrics between our GreenSaliency method and ten
benchmarking methods on the SALICON dataset. . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3 Comparison of no. of model parameters, model sizes (memory usage), no. of GigaFlops, and
latency time of several saliency detection methods tested on the SALICON dataset, where
“X” denotes the multiplier compared to our proposed method. . . . . . . . . . . . . . . . . . 89
5.4 Ablation study for GreenSaliency on SALICON dataset. . . . . . . . . . . . . . . . . . . . . . 92
6.1 Performance comparison in PLCC and SROCC metrics between our GSBIQA method and
nine benchmarking methods on two IQA databases. The best performance numbers are
shown in boldface, and ”X” denotes the multiple numbers. . . . . . . . . . . . . . . . . . . . . 102
6.2 Ablation study for GSBIQA on KonIQ-10K dataset. . . . . . . . . . . . . . . . . . . . . . . . 105
vii
List of Figures
1.1 The ground truth MOS of four exemplary images from KonIQ-10K [29] dataset. . . . . . . . 1
1.2 An example pipeline of full-reference, reduced-reference, and no-reference image quality
assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.1 An overview of the proposed GreenBIQA method. . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 An example of five cropped sub-images for authentic-distortion datasets. . . . . . . . . . . . . 27
3.3 An example of nine cropped sub-images for synthetic-distortion datasets. . . . . . . . . . . . 27
3.4 Unsupervised representation generation: spatial representations and joint spatio-color
representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 RFT results of spatial and spatio-color representations. . . . . . . . . . . . . . . . . . . . . . 29
3.6 Distortion-specific classifier and distortion clustering for synthetic and authentic datasets,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Six synthetic distortion types in CSIQ: (a) Gaussian blur, (b) Gaussian noise, (c) Contrast
decrements, (d) Pink Gaussian noise, (e) JPEG, and (f) JPEG-2000. . . . . . . . . . . . . . . 31
3.8 Three distorted images in KonIQ-10k: (a) Dark environment, (b) Underwater, and (c)
Smeared light. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.9 Performance curve on the validation dataset of LIVE-C with different crop numbers and sizes. 36
3.10 Illustration of the tradeoff between (a) the SROCC performance and model sizes and (b)
the SROCC performance and running time with respect to CSIQ and KonIQ-10K datasets
among several BIQA methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.11 Tradeoff between memory usage and latency for three BIQA methods (NIMA, DBCNN, and
GreenBIQA), where latency can be reduced by the usage of a larger memory. . . . . . . . . . 43
3.12 The PLCC performance curves of GreenBIQA and WaDIQaM are plotted as functions of the
percentages of the full training dataset of KonIQ-10K, where the solid line and the banded
structure indicate the mean value and the range of mean plus/minus one standard deviation,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.13 Comparison of the PLCC performance of GreenBIQA using active learning (in green) and
random selection (in red) on the KonIQ-10k dataset. . . . . . . . . . . . . . . . . . . . . . . . 45
viii
4.1 The system diagram of the proposed GreenBVQA method. . . . . . . . . . . . . . . . . . . . 52
4.2 The video data cropping module of GreenBVQA. . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 The block diagram of unsupervised spatial representation generation. . . . . . . . . . . . . . . 56
4.4 The block diagram of unsupervised spatio-color and spatio-temporal representations generation. 57
4.5 RFT results of spatial, spatio-color, temporal, and spatio-temporal representations. . . . . . . 60
4.6 Comparison of the floating point operation (FLOP) numbers. . . . . . . . . . . . . . . . . . . 68
4.7 Illustration of an edge computing system employing GreenBVQA. . . . . . . . . . . . . . . . 69
5.1 An overview of the proposed GreenSaliency method. . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Multi-layer hybrid feature extraction pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 RFT results of 3 × 3 Saab coefficients from three layers: (a) d8, (b) d16, and (c) d32. . . . . . 79
5.4 Multi-path saliency prediction pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Prediction results from different layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.6 Distribution of RFT results in: (a) d16, and (b) d8 layers. . . . . . . . . . . . . . . . . . . . . 83
5.7 Successful cases in GreenSaliency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.8 Failed cases in GreenSaliency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.1 An overview of the proposed GSBIQA method. . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Structure of spatial features extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3 Successful cases in GSBIQA. (a) MOS(G)=3.80, MOS(P)=3.80, (b) MOS(G)=3.76,
MOS(P)=3.76, (c) MOS(G)=3.70, MOS(P)=3.70. . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4 Failed cases in GSBIQA. (a) MOS(G)=2.61, MOS(P)=2.41, (b) MOS(G)=3.23, MOS(P)=3.00,
(c) MOS(G)=2.30, MOS(P)=2.07. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
ix
Abstract
The domain of blind visual quality assessment has garnered escalating attention in recent times, driven
by the escalating influx of user-generated data captured by diverse end devices and users. These data
are subsequently disseminated over the Internet without access to the reference, thereby accentuating the
significance of perceptual quality evaluation. Such user-generated content, spanning images and videos,
constitutes a substantial portion of the data transmitted over the Internet. Consequently, their integration
within many image- and video-related processing domains, such as compression, transmission, and preprocessing, necessitates the aid of perceptual quality assessment. However, the absence of references restricts
the application of reliable full-reference visual quality assessment methodologies. Therefore, no-reference
visual quality assessment, commonly denoted as blind visual quality assessment, emerges as the quintessential
approach to automatically and efficiently predict perceptual quality. In this dissertation, we mainly focus
on two research problems in the field of blind visual quality assessment: 1) blind image quality assessment
(BIQA) and 2) blind video quality assessment (BVQA).
Blind image quality assessment (BIQA) is a task that predicts the perceptual quality of an image without its reference. The burgeoning significance of BIQA research is underscored by the rapid proliferation of
user-generated images and the concurrent surge in mobile applications that often lack professional reference
images. The complexity of the challenge arises from the diversity encompassed within image content and
the varied array of potential distortions. Many existing BIQA methods use deep neural networks (DNNs)
to achieve high BIQA performance. Nonetheless, the limitations posed by the considerable model sizes associated with DNNs hinder their applicability to edge or mobile devices. To bridge this gap, a novel BIQA
method with a small model, low computational complexity, and high performance is proposed and named
x
“GreenBIQA” in this work. GreenBIQA includes five steps: 1) image cropping, 2) unsupervised representation generation, 3) supervised feature selection, 4) distortion-specific prediction, and 5) regression and
decision ensemble. Experimental results show that the performance of GreenBIQA is comparable with that
of state-of-the-art deep-learning (DL) solutions while demanding a much smaller model size and significantly
lower computational complexity.
An emerging direction in existing BIQA methods focuses on enhancing dataset variability by randomly
sampling patches from quality-annotated images and incorporating saliency-guided techniques to improve accuracy. These methods highlight the intrinsic connection between perceptual quality prediction and saliency
prediction. However, conventional saliency predictors employed in these approaches often fail to significantly
improve perceptual quality predictions, and DL-based saliency predictors typically require substantial computational resources. To address these challenges, we develop a novel and lightweight image saliency detector
named GreenSaliency, which offers competitive performance compared to DL-based methods while demanding minimal computational resources. Building upon this innovation, we introduce a novel, efficient, and
saliency-guided BIQA method called Green Saliency-guided Blind Image Quality Assessment (GSBIQA).
This method integrates GreenSaliency with GreenBIQA to form a comprehensive and advanced BIQA approach. This work represents a significant advancement by combining and extending these two lightweight
models into a robust and sophisticated BIQA method.
Transitioning to the field of blind video quality assessment (BVQA), the challenges of dealing with significantly larger video sizes and the increasing prevalence of user-generated videos on the Internet gain
prominence. Similar to its BIQA counterpart, the adoption of deep-learning-based BVQA methods encounters challenges due to their large model sizes and high computational complexities. The intrinsic costs
associated with video quality prediction, notably surpassing those of images, underscore the economic constraints, particularly on edge devices. In light of this, a novel lightweight BVQA method called GreenBVQA
is proposed in this work. GreenBVQA features a small model size, low computational complexity, and high
performance. Its processing pipeline includes the following: video data cropping, unsupervised representation generation, supervised feature selection, and mean-opinion-score (MOS) regression and ensembles. We
xi
conduct experimental evaluations on three BVQA datasets and show that GreenBVQA can offer state-of-theart performance in PLCC and SROCC metrics while demanding significantly smaller model sizes and lower
computational complexity. Additionally, an illustration of GreenBVQA’s application within a video-based
edge computing system augments the practical utility of this methodology.
xii
Chapter 1
Introduction
1.1 Significance of the Research
Objective visual quality assessment refers to the process of quantitatively measuring and evaluating the
quality of visual content, such as images or videos, using computational algorithms or models. Unlike
subjective quality assessment, which relies on human observers’ subjective opinions, objective assessment
methods aim to provide automated and reproducible quality metrics, utilizing datasets provided by subjective
quality assessment. Fig 1.1 depicts four images along with their respective subjective scores, denoted as Mean
Opinion Scores (MOS), obtained by aggregating the subjective opinions of human observers. The MOS serves
as a metric for assessing the perceptual quality of the images, with higher scores indicating superior visual
quality.
MOS = 3.63 MOS = 3.72 MOS = 4.21 MOS = 1.78
Figure 1.1: The ground truth MOS of four exemplary images from KonIQ-10K [29] dataset.
It is often used to classify objective visual quality assessment methods into three categories. The first
one is full-reference visual quality assessment, which assesses perceptual quality by measuring the difference
1
between distorted data and their references. The second one is reduced-reference visual quality assessment,
which offers greater flexibility by utilizing a part of the information from distorted data and references.
The third category is no-reference visual quality assessment, also called blind visual quality assessment,
which only has access to distorted data. Fig 1.2 shows the pipeline of three categories using image quality
assessment as an example.
Full-reference Image
Quality Assessment
Reduced-reference
Image
Quality Assessment
No-reference Image
Quality Assessment
Supplementary
Information
Reference Image Distorted Image
Figure 1.2: An example pipeline of full-reference, reduced-reference, and no-reference image quality assessment.
Given the prevalent absence of reference data for user-generated content, particularly in cases where
users lack access to professional-grade equipment capable of producing high-fidelity reference content, the
landscape of blind visual quality assessment is fraught with challenges. The enormity of content diversity
and the absence of reference data render the task significantly intricate. This complexity is further compounded by the contemporary landscape, where social media platforms are teeming with diverse and prolific
user-generated content, and the heightened prevalence of multi-party video conferencing has engendered a
renewed surge of interest in this domain. As a result, the field of blind visual quality assessment has garnered escalating attention. Blind visual quality assessment includes two essential parts: blind image quality
2
assessment (BIQA) and blind video quality assessment (BVQA). BIQA and BVQA are techniques used
to evaluate the perceived quality of images and videos, respectively, without having access to the original
reference or knowing the specific distortions applied. These methods aim to mimic human perception and
provide objective measures of image and video quality. BIQA methods analyze various visual features and
characteristics of an image to estimate its quality. These features include sharpness, contrast, color accuracy, noise, texture, etc. The methods compare these features to a model of human perception and assign
a quality score to the image. The scores are typically numerical values that indicate the perceived quality,
with higher scores representing better quality. BVQA extends the concept of BIQA to evaluate the quality
of videos. In addition to analyzing individual frames, BVQA methods consider temporal aspects such as
motion, flickering, and compression artifacts that can affect video quality. These methods may also account
for factors like frame rate, video resolution, and visual smoothness to estimate the overall perceived quality
of a video sequence. BIQA and BVQA techniques are essential in various applications, including image and
video processing, compression algorithms, video streaming, and multimedia systems. By providing objective
quality metrics, these methods help optimize and enhance image and video processing pipelines, ensuring
that the output meets the desired quality standards.
BIQA and BVQA have significant practical importance and are technically challenging in various domains.
Here are some practical applications where these techniques are valuable:
• Image and Video Compression. BIQA and BVQA methods play a crucial role in developing and
optimizing image and video compression algorithms. By assessing the quality of compressed images or
videos, these techniques help researchers and engineers evaluate and fine-tune compression algorithms
to achieve better compression efficiency while maintaining acceptable perceptual quality.
• Quality Control. BIQA and BVQA algorithms are utilized in quality control processes to ensure
that images and videos meet the desired quality standards before they are distributed or released. By
automatically evaluating the quality, these methods can assist in identifying and flagging content with
perceptual impairments or artifacts, enabling content providers to perform necessary corrections or
reject subpar content.
3
• Multimedia Systems and Services. In multimedia systems, such as video streaming platforms,
BIQA and BVQA techniques are employed to monitor and optimize the quality of transmitted content.
These methods help in adaptive streaming, where the video quality can be dynamically adjusted based
on network conditions and device capabilities, ensuring a consistent and satisfactory viewing experience
for users.
• Video Surveillance. BIQA and BVQA can be applied in video surveillance systems to assess the
quality of surveillance footage. By automatically detecting and evaluating visual impairments in the
recorded videos, such as blurriness or compression artifacts, these methods assist in maintaining highquality surveillance recordings that are crucial for accurate analysis and identification.
• Content Creation and Enhancement. BIQA and BVQA techniques can be used in content creation
processes to assess the quality of generated images or videos. Content creators, such as photographers
or video editors, can leverage these methods to objectively evaluate their work and make necessary
adjustments or enhancements to improve the overall quality and impact of their content.
• Edge computing system. BIQA and BVQA play a vital role in edge computing systems by providing real-time, context-aware, and resource-efficient quality assessment capabilities. These techniques
enable edge devices to optimize bandwidth usage, enhance content quality, and deliver a seamless user
experience while maintaining data privacy and security.
Overall, blind image quality assessment (BIQA) and blind video quality assessment (BVQA) techniques
provide objective and automated measures of image and video quality, enabling efficient optimization, quality control, and enhancement in various applications related to image and video processing, compression,
multimedia systems, and content creation.
The technical challenges inherent in blind image and video quality assessment are multifaceted and pose
significant complexities for researchers and practitioners alike. Here are some key challenges:
• Modeling Human Perception. Developing accurate models mimicking human perception is a complex challenge. Various factors, including individual differences, contextual information, and cognitive
4
processes, influence human perception. Creating models that can capture these intricacies and effectively predict perceived quality is a significant technical challenge.
• Subjectivity and Variability. Assessing quality objectively is difficult due to the subjective nature
of human perception. Individuals may have varying opinions on the quality of the same image or
video. Moreover, the perceived quality can also depend on the context and the observer’s expectations.
Developing objective assessment methods that can handle this inherent subjectivity and variability is
a considerable challenge.
• Reference-Free Assessment. In the absence of reference images or videos, as is often the case in
blind assessment, developing robust quality metrics becomes daunting. The lack of direct comparison
against pristine references demands the formulation of novel methods that rely solely on the intrinsic
characteristics of the content.
• Generalization to Diverse Content. BIQA and BVQA algorithms need to generalize well across
diverse types of images and videos. The models should be robust enough to handle variations in content
genres, resolutions, lighting conditions, and other factors that can impact visual quality. Achieving
generalization across a wide range of content is a significant technical challenge.
• Dataset Limitations. Acquiring high-quality subjective annotations for large-scale datasets is expensive and time-consuming. Developing methods that can achieve meaningful training with limited
ground truth annotations remains a challenge.
• Resource Efficiency. As blind assessment often finds application in resource-constrained environments, such as edge devices, ensuring computational efficiency while maintaining accurate quality
predictions becomes a delicate balance. As edge computing garners increasing attention, BIQA and
BVQA assume a critical role within the entirety of edge computing systems. To cater to the constraints
of limited computational resources and memory space characteristic of edge or mobile devices, there is
a pressing need for lightweight and efficient BIQA and BVQA methods.
• Machine Learning Complexity. Deep learning approaches, while promising, require substantial
computational resources and extensive labeled datasets for training. Adapting such methods to operate
5
effectively in blind quality assessment scenarios with limited data and computational capabilities is a
significant challenge.
• Real-Time Processing. Providing real-time or near-real-time quality assessment in dynamic scenarios, such as video streaming, necessitates algorithms capable of swift and accurate predictions,
often within stringent computational constraints. Given the exponential growth of visual content, the
demand for scalable BIQA and BVQA techniques becomes imperative. Real-time or near-real-time
processing of large-scale image and video datasets necessitates computationally efficient algorithms
that accommodate the escalating computational requirements.
• Incorporating Temporal Aspects. For BVQA, incorporating temporal aspects, such as motion
and dynamics, is essential. Analyzing the temporal properties of videos and accurately assessing their
quality introduces additional complexities compared to static images. Developing models that can
effectively capture and evaluate temporal aspects of videos is a technical challenge.
Tackling these technical challenges necessitates interdisciplinary collaboration, innovative algorithmic approaches, and a comprehensive understanding of both human perception and computational techniques.
1.2 Prior and Current Status
Research on BIQA has received a lot of attention in recent years. Existing BIQA methods can be categorized
into two types: conventional methods and deep-learning-based (DL-based) methods. Most conventional
methods adopt a standard pipeline: a quality-aware feature extraction followed by a regressor that maps
from the feature space to the quality score space. To give an example, methods based on natural scene
statistics (NSS) analyze the statistical properties of distorted images and compute the distortion degree
as quality-aware features. These quality-aware features can be represented by discrete wavelet transform
(DWT) coefficients [75], discrete cosine transform (DCT) coefficients [86], luminance coefficients in the
spatial domain [72], and so on. Codebook-based methods [109, 114, 115, 124] generate features by extracting
representative codewords from distorted images. After that, a regressor is trained to project from the
feature domain to the quality score domain. Inspired by the success of deep neural networks (DNNs)
6
in computer vision, researchers have developed DL-based methods to solve the BIQA problem. On the
one hand, the DL-based methods achieve high performance because of their strong feature representation
capability and efficient regression fitting. On the other hand, existing annotated IQA datasets may not have
sufficient content to train large DNN models. Given that collecting large-scale annotated IQA datasets is
expensive and time-consuming and that DL-based BIQA methods tend to overfit the training data from
IQA datasets of limited sizes, it is critical to address the overfitting problem caused by small-scale annotated
IQA datasets. Effective DL-based solutions adopt a large pre-trained model that was trained on other
datasets, e.g. ImageNet [17]. The transferred prior information from a pre-trained model improves the test
performance. Based on that, a further direction [14] is enhancing dataset variability by sampling patches
randomly from quality-annotated images and incorporating saliency-guided approaches to improve accuracy,
reflecting the intrinsic link between perceptual quality prediction and saliency prediction. Nevertheless, it is
challenging to implement a large pre-trained model of high complexity on mobile or edge devices. Moreover,
the conventional saliency predictors often used in saliency-guided DL-based methods do not significantly
enhance perceptual quality predictions.
In terms of BVQA methods, one straightforward BVQA solution is to build it upon BIQA methods. That
is, the application of BIQA methods to a set of keyframes of distorted videos individually. However, directly
applying BIQA followed by frame-score aggregation does not produce satisfactory outcomes because of the
absence of temporal information. Thus, it is essential to incorporate temporal or spatio-temporal information.
Other BVQA methods with handcrafted features [95, 103] were evaluated on synthetic-distortion datasets
with simulated distortions such as transmission and compression. Recently, they were evaluated on authenticdistortion datasets and reported in [73, 87]. Their performance on authentic-distortion datasets is somehow
limited. Authentic-distortion VQA datasets arise from the user-generated content (UGC) in the real-world
environment. They contain complicated and mixed distortions with highly diverse contents, devices, and
capture conditions. Inspired by the success of DL-based BIQA methods, DL-based BVQA methods have
been developed [130]. To further enhance performance and reduce distributional shifts [65], pre-trained
models on large-scale image datasets, such as the ImageNet [17], are adopted in [56, 101]. Nonetheless, a
7
parallel challenge faced in BVQA is akin to that encountered in BIQA, where the adoption of extensive
pre-trained models on mobile or edge devices proves to be prohibitively costly.
Overall, the current status of blind image and video quality assessment is characterized by a convergence
of computational techniques, an improved understanding of human perception, and a focus on real-world applicability. The field continues to evolve, addressing challenges posed by new content types, user preferences,
and emerging technologies.
1.3 Contributions of the Research
In the context of BIQA and BVQA, this dissertation presents novel contributions by introducing our proposed
GreenBIQA, GreenBVQA, GreenSaliency, and GSBIQA methods in Chapter 3, Chapter 4, Chapter 5, and
Chapter 6, respectively.
1.3.1 GreenBIQA: A Lightweight and High-performance Blind Image Quality
Assessment Method
As social media content is widely accessed via mobile terminals, it is desired to conduct BIQA with limited model sizes and computational complexity. A lightweight and high-performance BIQA solution is in
great need. To address this void, we study the BIQA problem in depth and propose a new solution called
“GreenBIQA”. This work has the following three main contributions.
• Innovative and Lightweight BIQA Methodology. A noval GreenBIQA approach is introduced to
effectively address the intricacies of blind image quality assessment, spanning images marred by both
synthetic and authentic distortions. Characterized by a transparent and modularized architecture, this
methodology establishes a coherent feedforward training pipeline. The pipeline includes unsupervised
representation generation, supervised feature selection, distortion-specific prediction, and regression,
as well as ensembles of prediction scores.
• Empirical Substantiation via Comprehensive Experiments. To empirically validate the efficacy
of GreenBIQA, a comprehensive set of experiments is conducted across four distinct Image Quality
8
Assessment (IQA) datasets. Its performance outperforms all conventional BIQA methods and DLbased BIQA methods without pre-trained models in prediction accuracy. As compared to state-of-theart BIQA methods with pre-trained networks, the prediction performance of GreenBIAQ is still quite
competitive yet demands a much smaller model size and significantly lower inference complexity.
• Demonstration of Robustness via Weakly-Supervised Learning. To exemplify the robust
nature inherent in GreenBIQA, a dedicated exploration within the framework of weakly-supervised
learning is undertaken. This line of inquiry subjects the methodology to a series of experiments
wherein the available training sample count is purposefully reduced. The outcomes of these experiments
compellingly underscore the sustained efficacy of GreenBIQA in scenarios characterized by diminished
training data. Furthermore, this study illuminates the strategic utilization of active learning in selecting
images for labeling, further reinforcing the adaptive and robust nature of GreenBIQA.
1.3.2 GreenBVQA: Blind Video Quality Assessment at the Edge
GreenBIQA does not perform well on VQA datasets since it does not incorporate temporal information.
To address its shortcomings, we propose a lightweight BVQA method and call it GreenBVQA in this work.
GreenBVQA features a smaller model size and lower computational complexity while achieving highly competitive performance against state-of-the-art DL methods. Three main contributions of our proposed GreenBVQA method are shown below.
• Innovative and Lightweight BVQA Methodology. A novel and innovative lightweight BVQA
methodology is introduced. This approach orchestrates the integration of four distinct representation
types, namely spatial, spatio-color, temporal, and spatio-temporal representations. These representations are synergistically harnessed, with each one subsequently channeled through a supervised feature
selection module for dimensionality reduction. Following this reduction process, the chosen features
from each representation are amalgamated to compose the final comprehensive feature set.
• Empirical Validation through Extensive Experiments. A series of rigorous experiments are
executed across three widely adopted VQA datasets, underscoring the merits intrinsic to the proposed
9
GreenBVQA methodology. Through these empirical trials, our methodology shines by surpassing
conventional BVQA methods in terms of MOS prediction accuracy. Remarkably, GreenBVQA’s performance holds a highly competitive stance against DL-based methodologies, all while offering an
extensively diminished model size, expedited inference times, and reduced computational complexity.
• Illustration of GreenBVQA’s Role within Video-Based Edge Computing. This study introduces a video-based edge computing framework, meticulously illustrating the instrumental role of
GreenBVQA in enhancing a spectrum of video processing tasks at the edge of the network. This
substantial contribution underscores the practical utility of GreenBVQA within a broader context,
elucidating its pivotal contribution to the enhancement of video-related operations in real-world edge
computing scenarios.
1.3.3 GreenSaliency: A Lightweight and Efficient Image Saliency Detection
Method
Many existing image saliency detection methods rely on DNNs to achieve good performance. However,
the high computational complexity associated with these approaches impedes their integration with other
modules or deployment on resource-constrained platforms, such as mobile devices. To address this need, we
propose a novel image saliency detection method named GreenSaliency, which has a small model size, minimal
carbon footprint, and low computational complexity. This work has the following two main contributions.
• Innovative and Lightweight Image Saliency Detection Methodology. Introduction of a novel
method for saliency detection termed GreenSaliency, which is characterized by a transparent and
modularized design. This method features a feedforward training pipeline distinct from the utilization
of DNNs. The pipeline encompasses multi-layer hybrid feature extraction and multi-path saliency
prediction.
• Empirical Substantiation via Comprehensive Experiments. Execution of experiments on two
distinct datasets to assess the predictive capabilities of GreenSaliency. The findings illustrate its superior performance compared to conventional methods and its competitive standing against early-stage
10
DL-based methods. Furthermore, GreenSaliency exhibits efficient prediction performance compared
to state-of-the-art methods employing pre-trained networks on external datasets while necessitating a
significantly smaller model size and reduced inference complexity.
1.3.4 GSBIQA: Green Saliency-guided Blind Image Quality Assessment Method
Blind Image Quality Assessment (BIQA) is an essential task that estimates the perceptual quality of images without reference. While many BIQA methods employ deep neural networks (DNNs) and incorporate
saliency detectors to enhance performance, their large model sizes limit deployment on resource-constrained
devices. To address this challenge, we introduce a novel and non-deep-learning BIQA method with a
lightweight saliency detection module called Green Saliency-guided Blind Image Quality Assessment (GSBIQA). This work represents a significant advancement by integrating and expanding upon the GreenSaliency
and GreenBIQA approaches into a robust and sophisticated BIQA methodology, with three primary contributions:
• Innovative and Lightweight BIQA Methodology. We propose GSBIQA, a novel non-deeplearning BIQA method that features a comprehensive pipeline including steps such as green image
saliency prediction, saliency-guided data cropping, green BIQA feature extraction, local patch prediction, and saliency-guided decision ensemble.
• Empirical Substantiation via Comprehensive Experiments. We validate GSBIQA’s performance through experiments on two authentic IQA datasets. GSBIQA can surpass conventional BIQA
methods and DL-based BIQA methods without pre-trained models in terms of prediction accuracy.
In addition, GSBIQA can compete favorably against state-of-the-art DL-based BIQA methods that
incorporate large pre-trained networks while requiring significantly fewer computational resources and
less model complexity.
• Efficient Integration of GreenSaliency and GreenBIQA. The integration of a learning-based
saliency predictor, whose features are reused in the perceptual quality prediction pipeline, allows for a
11
thorough merger of these two critical components, enhancing the efficiency and accuracy of the BIQA
process.
1.4 Organization of the Dissertation
The subsequent sections of this dissertation are structured as follows. In Chapter 2, we undertake a comprehensive review of the existing literature on both Blind Image Quality Assessment (BIQA) and Blind Video
Quality Assessment (BVQA) tasks, encompassing conventional methods as well as those employing deep neural networks. Furthermore, we explore relevant research in the domains of image saliency detection, green
learning, and edge computing. Chapter 3 presents our novel and lightweight BIQA approach, aptly named
GreenBIQA, which is designed without recourse to deep neural networks. Chapter 4 delves into the extension of GreenBIQA to GreenBVQA, wherein we introduce hierarchical data cropping and leverage temporal
information for improved BVQA performance. Additionally, this chapter presents a detailed exploration
of an edge computing system incorporating BVQA. In Chapter 5, we introduce and detail a novel image
saliency detection method, termed GreenSaliency, characterized by its compact model size, minimal carbon
footprint, and low computational complexity. Subsequently, Chapter 6 presents a novel non-deep-learning
BIQA method, incorporating a lightweight saliency detection module known as Green Saliency-guided Blind
Image Quality Assessment (GSBIQA). Ultimately, in Chapter 7, we provide concluding remarks and outline
potential avenues for future research directions.
12
Chapter 2
Research Background
2.1 Blind Image Quality Assessment (BIQA) Methods
Existing BIQA methods can be categorized into two types: conventional methods and deep-learning-based
(DL-based) methods.
2.1.1 Conventional BIQA Methods
Conventional BIQA methods adopt a two-step processing pipeline: 1) extracting quality-aware features from
input images, and 2) using a regression model to predict the quality score based on extracted features. The
support vector regressor (SVR) [3] or the XGBoost regressor [11] is often employed in the second step. Based
on the differences in the first step, we categorize conventional BIQA methods into two main types.
2.1.1.1 Natural Scene Statistics (NSS) Methods
The first type relies on natural scene statistics (NSS). These methods predict image quality by evaluating the
distortion of the NSS information. For example, DIIVINE [76] proposed a two-stage framework, including a
classifier to identify different distortion types, which is followed by a distortion-specific quality assessment.
Instead of computing distortion-specific features, NIQE [74] evaluated the quality of distorted images by
computing the distance between the model statistics and those of distorted images. BRISQUE [72] used
NSS to quantify the loss of naturalness caused by distortions, which is operated in the spatial domain with low
complexity. BLINDS-II [86] proposed an NSS model using the discrete cosine transform (DCT) coefficients
13
and then adopted the Bayesian inference approach to predict image quality using features extracted from
the model. NBIQA [78] developed a refined NSS mode by collecting competitive features from existing
NSS models in both spatial and transform domains. Histogram counting and the Weibull distribution were
employed in [111] and [125], respectively, to analyze the statistical information and build the distribution
models. Although the above-mentioned methods utilized the NSS information in a wide variety, they are
still not powerful enough to handle a broad range of distortion types, especially for datasets with authentic
distortions.
2.1.1.2 Codebook-based Methods
The second type extracts representative codewords from distorted images. The common framework of
codebook-based methods includes local feature extraction, codebook construction, feature encoding, spatial
pooling, and quality regression. CBIQ [114] constructed visual codebooks from training images by quantizing
features, computed the codeword histogram, and fed the histogram data to the regressor. Following the same
framework, CORNIA [115] extracted image patches from unlabeled images as features, built a codebook (or
a dictionary) based on clustering, converted an image into a set of non-linear features, and trained a linear
support vector machine to map the encoded quality-aware features to quality scores. Non-linear features in
this pipeline were obtained from the dictionary using soft-assignment coding with spatial pooling. However,
the codebook needs a large number of codewords to achieve good performance. The high order statistics
aggregation (HOSA) was exploited in [109] to design a codebook of a smaller size. That is, besides the mean
of each cluster, the high-order statistical information (e.g., dimension-wise variance and skewness) inside
each cluster can be aggregated to reduce the codebook size. Generally speaking, codebook-based methods
rely on high-dimensional handcrafted feature vectors, which are ineffective in handling diversified distortion
types.
2.1.2 Deep-Learning-based (DL-based) BIQA Methods
DL-based methods have been intensively studied to solve the BIQA problem. A solution based on the
convolutional neural network (CNN) was first proposed in [40]. It includes one convolutional layer with max
14
and min pooling and two fully connected layers. To alleviate the accuracy discrepancy between FR-IQA and
NR-IQA, a local quality map was derived using CNN to imitate the behaviors of FR-IQA in BIECON [42].
Then, a statistical pooling strategy is adopted to capture the holistic properties and generate fixed-size
feature vectors. A DNN model was proposed in WaDIQaM [6] by including ten convolutional layers as well
as five pooling layers for feature extraction and two fully connected layers for regression. MEON [67] proposed
two sub-networks to achieve better performance on synthetic datasets. The first sub-network classifies the
distortion types, while the second sub-network predicts the final quality. By sharing their earlier layers, the
two sub-networks can solve their sub-tasks jointly for better performance.
Quality assessment of images with authentic (i.e., real-world) distortions is challenging due to mixed
distortion types and high content variety. Recent DL-based methods all adopt advanced DNNs. Feature
extraction using a pre-trained ResNet [27] was adopted in [119]. A probabilistic quality representation was
proposed in PQR [120], which employed a more robust and optimal loss function to describe the score
distribution generated by different subjective. It improved the accuracy of quality prediction and sped up
the training process. A self-adaptive hyper network architecture was utilized by HyperIQA [97] to adjust
the quality prediction parameters. It can handle a broad range of distortions with a local distortion-aware
module and deal with wide content variety with perceptual quality patterns based on recognized content
adaptively. DBCNN [130] adopted DNN models pre-trained by large datasets to facilitate quality prediction
on both synthetic and authentic datasets. A network pre-trained by synthetic-distortion datasets was used
to classify distortion types and levels. Another pre-trained network based on the ImageNet [17] was used
as the classifier. The two feature sets from two models were integrated into one representation for final
quality prediction through bilinearly pooling. The absence of the ground truth reference was compensated
in Hallucinated-IQA [61], which generated a hallucinated reference using generative adversarial networks
(GANs) [25].
Instead of predicting the mean opinion score (MOS) generated by subjects, NIMA [99] predicted the
MOS distribution using a CNN. To balance the trade-off between performance accuracy and number of
model parameters, NIMA had three models with different architectures, namely, VGG16 [93], Inceptionv2 [98], and MobileNet [31]. NIMA (VGG16) gave the best performance but with the longest inference time
15
and the largest model size. NIMA (MobileNet) was the smallest one with the fewest model parameters but
the worst accuracy. Although NIMA (MobileNet) has a small model size, it is still difficult to deploy it on
mobile/edge devices.
2.1.3 Saliency-guided BIQA Methods
In the early stage of image saliency detection research, conventional approaches primarily employed a bottomup methodology. For instance, the study referenced in [34] developed a visual attention system inspired
by the behavioral and neuronal paradigms observed in early primate visual systems. Another significant
contribution in this domain is the Graph-Based Visual Saliency (GBVS) model [26], which operates through
a two-stage process: initially, it generates activation maps across various feature channels, which are then
normalized to highlight salient features, followed by their integration with other maps. The SDSP model [123]
integrates three basic priors to identify salient regions within images. The first prior suggests that band-pass
filtering can emulate the human visual system’s response to salient objects. The second prior utilizes the
study that human attention is typically drawn toward the center of an image. The third prior indicates a
preference for warmer colors over cooler ones in attracting human attention. In recent years, the advent
of DL-based methods for saliency detection [62] has resulted in substantial improvements attributed to the
sophisticated feature representation capabilities of CNNs. However, the substantial model sizes of DL-based
methods often make them impractical for integration into resource-constrained systems. To mitigate this,
Green-Saliency [70] offers a lightweight and efficient solution for image saliency detection that adheres to
green learning principles, aiming to reduce the environmental impact of computing technologies.
Image saliency detectors aim to elucidate the mechanisms underlying human observation, a goal that
aligns closely with the objectives of IQA, particularly in the context of BIQA. Consequently, some researchers have incorporated saliency detection techniques within BIQA frameworks to improve their efficacy.
In [30], a saliency-guided deep learning framework for IQA is introduced, consisting of three main components: saliency-guided feature learning and extraction, a deep learning architecture for classification, and
quality pooling to accommodate multiple quality descriptions. Within DL-based frameworks, saliency models enhance data pre-processing, such as in [14], where these models guide data cropping by identifying
16
and selecting patches containing salient objects from the original images. Furthermore, saliency detectors
are integrated into the data cropping and decision fusion processes, as demonstrated in [92]. Rather than
merely applying saliency models as either a pre-processing or post-processing tool, [112] introduces SGDNet.
This network is built on an end-to-end multi-task learning framework that addresses image saliency prediction and quality assessment. These tasks share a common feature extractor, optimizing both concurrently.
SCVS [35] employs a two-stage approach using two similar CNNs, enhancing the BIQA process. The first
CNN calculates the weighted average of the FR-IQA scores for individual patches and the differential mean
opinion scores for the entire image as the target output. The second CNN assesses the interactions between
adjacent patches using a Gaussian function, incorporating visual saliency and spatial interactions to improve
the assessment accuracy.
2.2 Blind Video Quality Assessment (BVQA) Methods
Quite a few BVQA methods have been proposed in the last two decades. Existing work can be classified
into conventional and DL-based methods.
2.2.1 Conventional BVQA Methods
Conventional BVQA methods extract quality-related features from input images using an ad hoc approach.
Then, a regression model (e.g., Support Vector Regression (SVR) [3] or XGBoost [11]) is trained to predict
the quality score using these handcrafted features. One family of methods is built upon the Natural Scene
Statistics (NSS). NSS-based BIQA methods [72, 76] can be extended to NSS-based BVQA methods since
videos are formed by multiple image frames. For example, V-BLIINDS [85] is an extension of a BIQA method
by incorporating a temporal model with motion coherency. Spatio-temporal NSS can be derived from a joint
spatio-temporal domain. For instance, the method in [57] conducts the 3D discrete cosine transform (DCT)
and captures the spatio-temporal NSS of 3D-DCT coefficients. The spatio-temporal statistics of meansubtracted and contrast-normalized (MSCN) coefficients of natural videos are investigated in [16], where
an asymmetric generalized Gaussian distribution (AGGD) is adopted to model the statistics of both 3DMSCN coefficients and bandpass filter coefficients of natural videos. Another family of BVQA methods, called
17
codebook-based BVQA, is inspired by CORNIA [110], [115]. They first obtain frame-level quality scores using
unsupervised frame-based feature extraction and supervised regression. Then, they adopt temporal pooling
to derive the target video quality. A two-level feature extraction mechanism is employed by TLVQM [43],
where high- and low-level features are extracted from the whole sequence and a subset of representative
sequences, respectively.
2.2.2 Deep-Learning-based (DL-based) BVQA Methods
DL-based BVQA methods have been investigated recently. They offer state-of-the-art prediction performance. Inspired by MEON [67], V-MEON [64] provides an end-to-end learning framework by combining
feature extraction and regression into a single stage. It adopts a two-step training strategy: one for codec
classification and the other for quality score prediction. COME [105] adopts CNNs to extract spatial features
and uses motion statistics as temporal features. Then, it exploits a multi-regression model, including two
types of SVR, to predict the final score of videos. A mixed neural network is derived in [118]. It uses a 3D
convolutional neural network (3D-CNN) and a Long-Short-Term Memory (LSTM) network as the feature
extractor and the quality predictor, respectively. Following the VSFA method [55], MDTVSFA [56] adopts
an unified BVQA framework with a mixed-dataset training strategy to achieve better prediction performance
against authentic-distortion video datasets. PVQ [116] adopts a local-to-global region-based BVQA architecture, which is trained with different kinds of patches. Also, a large authentic-distortion video dataset
is built and reported in [116]. QSA-VQM [1] uses two CNNs to extract quality attributes and semantic
content of each video frame, respectively, and one Recurrent Neural Network (RNN) to estimate the quality
score of the whole video by incorporating temporal information. To address the diverse range of natural
temporal and spatial distortions commonly observed in user-generated-content datasets, CNN-TLVQM [44]
integrates spatial features obtained from a CNN model and handcrafted statistical temporal features obtained via TLVQM. The CNN model was originally trained for image quality assessment using a transfer
learning technique.
18
2.3 Image Saliency Detection Methods
2.3.1 Conventional Image Saliency Detection Methods
In the early stage of image saliency detection research, conventional methodologies predominantly adhered to
a bottom-up approach. For example, in [34], a visual attention system was inspired by early primate visual
systems’ behavioral and neuronal paradigms. This system combines multi-scale image features into a unified
topographical saliency map, subsequently utilized by a dynamical neural network to sequentially prioritize
attended locations based on decreasing saliency levels. This process, informed by inputs from early visual
processes, effectively simulates bottom-up attention mechanisms observed in primates. Another prominent
bottom-up model is the Graph-Based Visual Saliency (GBVS) framework [26], comprising two sequential
steps: 1) the formation of activation maps on specific feature channels followed by their normalization to
accentuate salient features and 2) enabling combination with other maps. Conversely, SUN [127] introduces
a Bayesian framework where bottom-up saliency naturally emerges as the self-information of visual features.
In contrast, overall saliency, incorporating top-down and bottom-up influences, arises as the pointwise mutual
information between features and the target during target search tasks. Furthermore, [41] proposes an image
saliency model derived directly from human eye movement data, characterized by a nonlinear mapping from
image patches to real values. This model is trained to produce positive outputs for fixated regions and
negative outputs for randomly selected image patches. In conclusion, while conventional saliency detection
methods are noted for their efficiency, they predominantly focus on evaluating contrast based on low-level
features. These methods often lack supervision from human feedback or subjective datasets, which limits
their effectiveness. This lack of sophisticated feature analysis and the absence of adaptive learning from
large-scale annotated data restrict their applicability, especially in scenarios requiring an understanding of
visual content.
2.3.2 DL-based Image Saliency Detection Methods
In recent years, deep-learning-based methodologies for saliency detection [15, 79] have achieved notable
performance owing to the advanced feature representation capabilities inherent in CNNs. An early work in
19
this domain was the emergence of eDN [104], which introduced hierarchical feature learning principles to
visual saliency. Additionally, [79] introduced the concept of training a shallow CNN architecture from scratch,
marking the inception of end-to-end CNN models tailored explicitly for image saliency detection tasks.
However, a significant impediment encountered in leveraging deep learning frameworks for image saliency
detection lies in the scarcity of available data, primarily stemming from the laborious and cost-intensive
nature of collecting fixation data. Furthermore, the sensitivity of image saliency to image transformations
poses another challenge, significantly constraining potential data augmentations [8]. Consequently, transfer
learning has emerged as a pivotal strategy for refining image saliency detection performance. Inspired by
the remarkable success of deep convolutional networks in classification tasks, particularly exemplified by
benchmarks like ImageNet, most high-performing saliency models have embraced transfer learning, typically
relying on pre-trained models on ImageNet as the foundation. Noteworthy milestones in this trajectory
include the pioneering work of DeepGaze I [45], which has since evolved into DeepGaze II [47], leveraging
the VGG19 architecture. Following this paradigm, DeepGaze IIE [62] was proposed, demonstrating the
attainment of robust confidence calibration on unseen datasets through the principled amalgamation of
multiple backbone architectures. Additionally, the Saliency Attentive Model (SAM) [15] introduced a novel
Attentive Convolutional Long Short-Term Memory (LSTM) mechanism, sequentially directing attention to
diverse spatial regions within a feature stack to enhance image saliency detection. EML-NET [36] introduced
a scalable approach for integrating multiple deep convolutional networks of varying complexities as encoders
for saliency-relevant features. Meanwhile, UNISAL [20] unified saliency prediction across both image and
video modalities, leveraging the entirety of available saliency prediction datasets to enrich analysis outcomes.
UNISAL adopted MobileNet-V2 [89], a model with lightweight and efficient architecture, as its backbone
encoder for simplicity and reduced computational demands.
Various models have been developed employing intricate, deep architectures or extending existing ones
that have demonstrated efficacy in other domains. Nonetheless, all these models have uniformly embraced
transfer learning, initializing their architectures by pre-training them on outside datasets. For instance,
SalGAN [80] introduces a generative adversarial model tailored for image saliency detection, comprising two
interconnected networks: one responsible for generating saliency maps from raw pixel data of input images.
20
At the same time, the other discriminates between predicted saliency maps and ground truth. Similarly,
GazeGAN [8] leverages a modified U-Net architecture as its generator, amalgamating classic “skip connections” with a novel “center-surround connection” (CSC) module. Departing from conventional feedforward
architectures for saliency prediction, FBNet [18] integrates feedback convolutional connections to establish
links between high-level blocks and low-level layers, thereby enhancing feature representation. Subsequently,
SalFBNet [19] introduces a lightweight feedback-recursive convolutional framework that enhances saliency
detection by integrating feature pathways from high-level blocks to low-level layers. Additionally, [19] pioneers creating a large-scale Pseudo-Saliency dataset, addressing concerns regarding data scarcity in image
saliency detection.
Despite the high performance achieved by DL-based methods in saliency detection, many of them rely
on models pre-trained on extensive external datasets. Integrating these large pre-trained models into mobile
or edge devices presents significant economic challenges due to their considerable computational demands
and substantial model sizes. Additionally, the methods that require training or fine-tuning of models with a
large number of parameters entail considerable offline costs. These factors can severely limit the deployment
and scalability of DL-based saliency detection methods in resource-constrained environments.
2.4 Green Machine Learning
Green learning [50] has been proposed recently as an alternative machine learning paradigm that targets
efficient models of low carbon footprint. It represents an emerging paradigm within the realm of machine
learning and artificial intelligence. This approach places a strong emphasis on developing algorithms, models,
and techniques that are not only proficient in their respective tasks but also considerate of the ecological
impact associated with their operations. Green learning is characterized by small model sizes and low training
and inference computational complexities. An additional advantage is its mathematical transparency through
a modularized design principle. Green learning originated through efforts in understanding the functions
of various components of CNNs, such as nonlinear activation [48], convolutional layers and fully-connected
layers [51]. Its development path started to deviate from neural networks by giving up the basic neuron
unit and the network architecture since 2020. Examples of green learning models include PixelHop [12]
21
and PixelHop++ [13] for object classification and PointHop [129] and PointHop++ [128] for 3D point
cloud classification. Green learning techniques have been developed for many applications, such as deepfake
detection [9], anomaly detection [122], image generation [53], etc. Incorporating principles of green learning
can lead to the development of more environmentally conscious artificial intelligence systems. By reducing
energy consumption, optimizing resources, and advancing sustainable practices, green learning seeks to
harness the power of machine learning while minimizing its ecological footprint. We propose our GreenBIQA
and GreenBVQA methods introduced in this dissertation by following the path of green learning.
2.5 Video Quality Assessment at the Edge
Machine learning models have been extensively deployed on edge devices [82]. Several video analytics tasks
are implemented in the edge computing platform [108]. Besides prediction accuracy, important metrics to be
considered at the edge include latency, computational complexity, memory size, etc. The remarkable success
achieved by DL in various domains, such as computer vision and natural language processing, has inspired the
application of DL to edge devices (say, smartphones and IoT sensors) [10]. They often generate a significant
amount of data that demands local processing due to the high data communication cost. Edge computing
has emerged in video analytics in recent years. One objective is to optimize the tradeoff between accuracy
and cost [108]. For example, VideoEdge [32] is an edge-based video analysis tool that is implemented in a
distributed cloud-edge architecture comprising edge nodes and the cloud.
Given heavy video traffic on mobile or edge networks, there is a growing demand for higher transmission
rates and lower network latency. In order to tackle these challenges, adaptive bitrate (ABR) technologies
[4, 117] are commonly used in video distribution. The advantage of ABR lies in its ability to reduce the
occurrence of choppy videos while enhancing the user’s quality of experience (QoE). Despite the application of
the ABR algorithm in some scenarios involving full-reference assessments, one scenario that can benefit from
BVQA is bitrate adaptation for HTTP adaptive streaming (HAS) [4]. As a pull-based delivery system, the
bitrate adaptation algorithms in HAS are chiefly executed at each client (i.e., in a distributed manner). Acting
as an objective QoE metric for bitrate guidance, a BVQA algorithm can serve either after a video session
ends to evaluate the performance of the ABR algorithm or during playback to guide the ABR algorithm
22
in selecting the most suitable bitrate demand for the next video segments. A blind QoE assessment metric
was proposed in [54] to provide QoE feedback of high consistency with the human visual system (HVS)
at low latency and without reference, indicating that the metric can provide real-time guidance to ABR
algorithms and improve the overall HAS performance. As stated in [21], a better understanding of human
perceptual experience and behavior is the most dominating factor in improving the performance of ABR
algorithms. Since reference videos are not available on edge devices, BVQA methods become the only option.
State-of-the-art BVQA methods rely on large pre-trained models and exhibit high computational complexity,
making them impractical for deployment on edge devices. Lightweight BVQA methods are urgently needed
to address this void. Along this line, Agarla et al. [2] propose an efficient BVQA method with two lightweight
pre-trained MobileNets [31] with certain limitations, such as degraded prediction accuracy.
23
Chapter 3
GreenBIQA: A Lightweight High-Performance Blind Image
Quality Assessment
3.1 Introduction
Objective image quality assessment (IQA) can be classified into three categories: full-reference IQA (FRIQA), reduced-reference IQA (RR-IQA), and no-reference IQA (NR-IQA). FR-IQA methods evaluate the
quality of images by comparing distorted images with their reference images. Quite a few image quality
metrics, such as PSNR, SSIM [106], FSIM [126], and MMF [63] have been proposed in the last two decades.
RR-IQA methods (e.g., RR-SSIM [83]) utilize part of the information from reference images to evaluate
the quality of underlying images. RR-IQA is more flexible than FR-IQA. NR-IQA, also called blind image
quality assessment (BIQA), is needed in two scenarios. First, reference images may not be available to users
(e.g., at the receiver). Second, most user-generated images do not have references. The need for BIQA grows
rapidly due to the popularity of social media platforms and multi-party video conferencing.
Research in the field of Blind Image Quality Assessment (BIQA) has garnered substantial attention in
recent years. The existing landscape of BIQA methods can be broadly categorized into two distinct classes:
conventional methodologies and deep learning-based (DL-based) approaches. Conventional techniques adhere
to a standard pipeline involving quality-aware feature extraction followed by a regressor responsible for
mapping the feature space to the quality score space. For instance, methods based on natural scene statistics
(NSS) analyze statistical attributes of distorted images, subsequently deriving quality-aware features. These
24
features can manifest through discrete wavelet transform (DWT) coefficients [75], discrete cosine transform
(DCT) coefficients [86], or luminance coefficients in the spatial domain [72], and so on. Codebook-based
methods [109, 114, 115, 124] generate features by extracting representative codewords from distorted images.
These features are then subjected to a regressor that establishes the mapping to quality scores.
Capitalizing on the transformative success of deep neural networks (DNNs) within the realm of computer
vision, researchers have endeavored to harness DL-based methodologies to address the BIQA challenge. These
DL-based methods wield the advantages of robust feature representation capabilities and effective regression
fitting, culminating in superior performance. However, the limitation arises from the scarcity of annotated
Image Quality Assessment (IQA) datasets that are often insufficient for training expansive DNN models.
Given the resource-intensive nature of collecting substantial annotated IQA datasets and the propensity of
DL-based BIQA approaches to overfit such limited data, counteracting the overfitting phenomenon becomes
paramount. To mitigate these challenges, DL-based solutions harness pre-trained models that have been
cultivated on diverse datasets, like ImageNet [17]. While transferring prior knowledge from these models
enhances performance, implementing intricate large-scale pre-trained models on mobile or edge devices proves
challenging. Considering the widespread consumption of social media content through mobile terminals, the
imperative to conduct BIQA efficiently within the confines of constrained model sizes and computational
complexity emerges.
In light of this pressing need, we embark on an in-depth exploration of the BIQA domain, culminating
in the formulation of a novel method christened “GreenBIQA”. This innovative approach aims to bridge
the gap by delivering a lightweight yet high-performance method to the BIQA problem, catering to the
demands of efficient quality assessment within the paradigm of mobile and edge computing. This work has
the following three main contributions.
• A novel GreenBIQA method is proposed for images with synthetic and real-world (or authentic) distortions. It offers a transparent and modularized design with a feedforward training pipeline. The pipeline
includes unsupervised representation generation, supervised feature selection, distortion-specific prediction, regression, and ensembles of prediction scores.
25
• We conduct experiments on four IQA datasets to demonstrate the prediction performance of GreenBIQA. It outperforms all conventional BIQA methods and DL-based BIQA methods without pretrained models in prediction accuracy. As compared to state-of-the-art BIQA methods with pretrained networks, the prediction performance of GreenBIAQ is still quite competitive yet demands a
much smaller model size and significantly lower inference complexity.
• We carry out experiments under the weakly-supervised learning setting to demonstrate the robust
performance of GreenBIQA as the number of training samples decreases. Also, we show how to exploit
active learning in selecting images for labeling.
3.2 Methodology
An overview of the proposed GreenBIQA method is depicted in Fig. 3.1. As shown in the figure, GreenBIQA
has a modularized solution that consists of five modules: (1) image cropping, (2) unsupervised representation
generation, (3) supervised feature selection, (4) distortion-specific prediction, and (5) regression and decision
ensemble. They are elaborated below.
Distorted
Images Subimage
Unsupervised
Representation
Generation
Supervised
Feature
Selection
Regressor
Image
Quality
Scores
Regressor
Regressor Distortionspecific
Prediction
Figure 3.1: An overview of the proposed GreenBIQA method.
3.2.1 Image Cropping
Image cropping is implemented to standardize the input size and enlarge the number of training samples.
It is achieved by cropping sub-images of fixed size from raw images in datasets. All cropped sub-images
are assigned the same mean opinion score (MOS) as their source image. To ensure the high correlation
26
Figure 3.2: An example of five cropped sub-images for authentic-distortion datasets.
Figure 3.3: An example of nine cropped sub-images for synthetic-distortion datasets.
between sub-images and their assigned MOS, we adopt different cropping strategies for synthetic-distortion
and authentic-distortion datasets, as shown in Fig. 3.2 and Fig. 3.3, respectively.
For images in authentic-distortion datasets such as KonIQ-10K [29], they contain distortions in unknown
regions. Thus, we crop a smaller number of sub-images of a larger size (e.g., 256 × 256 out of 384 × 512) to
ensure the assigned MOS for each sub-image is reasonable. The cropped sub-images can overlap with one
another. Fig. 3.2 shows five randomly cropped sub-images from one source image.
For images in synthetic-distortion datasets such as KADID-10K [59], all distortions are applied to the
reference images uniformly with few exceptions (e.g., color distortion in localized regions in KADID-10K).
27
256x256x1
32x32x1
8x8 block based DCT
4x4 2D Saab Transform
8x8x1
1x1x(N+1)
feature
16x16x1
DC AC1 AC2 AC63
DC AC1 AC15 AC1 AC2 AC47
Pooling Pooling Pooling
Pooling Pooling
Hop1
Sub-image
AC1
5 2x2x1 DC
All std All PCA std PCA std PCA std
4x4 2D Saab Transform Hop2
AC1 AC1 4x4x1 5
256x256x3
SpatioColor
Cuboid
64x64x1
4x4x3 3D Saab
Transform
4x4 2D Saab Transform
16x16x1
1x1x(N+1)
feature
PCA
16x16x1
DC AC1 AC2 AC47
AC15 AC1 AC2 AC47
std PCA std
Pooling Pooling Pooling
Pooling
Hop1
Hop2
AC1
Pooling
DC
Pooling
(a) Generation of spatial representations (b) Generation of joint spatio-color representations
Figure 3.4: Unsupervised representation generation: spatial representations and joint spatio-color representations.
Only one distortion type is added to one image at a time. Therefore, cropping sub-images of a smaller size
is sufficient to capture distortion characteristics. Furthermore, we can crop more sub-images to enlarge the
number of training samples and conduct decision ensembles in the inference stage. An example of image
cropping from the KADID-10K dataset is shown in Fig. 3.3, where nine sub-images of the size of 64 × 64
are randomly selected.
3.2.2 Unsupervised Representation Generation
Given sub-images from the image cropping module, we extract a set of representations from sub-images in
an unsupervised manner. We consider two types of representations.
1. Spatial representations. They are extracted from the Y, U, and V channels of sub-images individually.
2. Joint spatio-color representations. They are extracted from a 3D cuboid of size H × W × C, where H
and W are the height and width of a sub-image and C = 3 is the number of color channels, respectively.
3.2.2.1 Spatial Representations
Fig. 3.4 (a) shows the procedure of spatial representation generation. The representations are derived from
8 × 8 block DCT coefficients since they are often available in compressed images. The input sub-images
are first partitioned into non-overlapping blocks of size 8 × 8, and DCT coefficients are generated by the
28
Figure 3.5: RFT results of spatial and spatio-color representations.
block DCT transform. DCT coefficients of each block are scanned in the zigzag order, leading to one DC
coefficient and 63 AC coefficients, denoted by AC1-AC63. We split them into 64 channels. Generally, the
amount of energy decreases from the DC channel to the AC63 channel. There are correlations among DC
coefficients of spatially adjacent blocks. We apply the Saab transform [51] to them. The Saab transform
uses a constant-element kernel to compute the patch mean, which is called the DC component of the Saab
transform. Then, it applies the principal component analysis (PCA) to mean-removed patches in deriving
data-driven kernels, called AC kernels. The application of AC kernels to each patch yields AC coefficients
of the Saab transform. Here, we decorrelate DC coefficients in two stages, i.e., Hop1 and Hop2.
• Hop1 Processing: We partition 32 × 32 DC coefficients into non-overlapping blocks of size 4 × 4 and
conduct the Saab transform on each block, leading to one DC channel and 15 AC channels in Hop1.
We feed the 8 × 8 DC coefficients to the next hop.
• Hop2 Processing: We apply another 4 × 4 Saab transform on each of non-overlapping blocks of size
4 × 4, leading to DC and 15 AC channels in Hop2. We collect all the representations from Hop2 and
append them to the final representation set to preserve low-frequency details.
29
Other Saab coefficients in Hop1 and other DCT coefficients at the top layer contain mid- and high-frequency
information. We need to aggregate them spatially to reduce the representation number. First, we take their
absolute values and apply the maximum pooling to lower their dimension as indicated by the down-ward
gray arrow. Next, we adopt the following operations to yield two sets of values:
• Compute the maximum value, the mean value, and the standard deviation of the same coefficients
across the spatial domain.
• Conduct the PCA transform on spatially adjacent regions for further dimension reduction (except the
coefficients in HOP2).
These values are concatenated to form spatial representations of interest. The same process is applied to
the Y, U, and V channels of all sub-images.
Cropped
Subimages
Authentic Datasets
Cropped
Subimages
Distortion
1 (GB)
Distortion
2 (WN)
features Distortionspecific
Classifier
Synthetic Datasets
Distortion
n (CC)
Distortion
n-1 (FN)
Group 1
Group 2
Distortion
Clustering
Group n
Group n-1
Low-level
features
Figure 3.6: Distortion-specific classifier and distortion clustering for synthetic and authentic datasets, respectively.
3.2.2.2 Joint Spatio-Color Representations
We first convert sub-images from the YUV to RGB color space. The corresponding spatio-color cuboids
have a size of H × W × C, where H and W are the height and width of the sub-image, respectively, and
C = 3 is the number of color channels. They serve as input cuboids to a two-hop hierarchical structure, as
30
(a) (b) (c)
(d) (e) (f)
Figure 3.7: Six synthetic distortion types in CSIQ: (a) Gaussian blur, (b) Gaussian noise, (c) Contrast
decrements, (d) Pink Gaussian noise, (e) JPEG, and (f) JPEG-2000.
shown in Fig. 3.4 (b). In Hop1, we split the input cuboids into non-overlapping cuboids of size 4 ×4×3 and
apply the 3D Saab transform to them individually - leading to one DC channel and 47 AC channels, denoted
by AC1-AC47. Each channel has a spatial dimension of 64 × 64. Since the DC coefficients are spatially
correlated, we apply the 2D Saab transform in Hop2, where the DC channel of size 64 × 64 is decomposed
into 16×16 non-overlapping blocks of size 4×4. For other 47 AC coefficients in the output of Hop1, we take
their absolute values and conduct the 4x4 max pooling, leading to 47 channels of spatial dimension 16 × 16.
In total, we obtain 16 + 47 = 63 channels of the same spatial size 16 × 16. We use the following two steps
to extract joint spatio-color features.
• Flatten blocks to vectors, conduct PCA, and select coefficients from the first N principal components.
• Compute the standard deviation of the coefficients in the same channel.
The above two sets of representations are concatenated to form the joint spatio-color representations.
31
(a) (b) (c)
Figure 3.8: Three distorted images in KonIQ-10k: (a) Dark environment, (b) Underwater, and (c) Smeared
light.
3.2.3 Supervised Feature Selection
It is desired to select more discriminant features from a large number of representations obtained from the
second module. A powerful tool, called the relevant feature test (RFT) [113], is adopted to achieve this
objective. It computes the loss of each representation independently. A lower loss value indicates a better
representation. To conduct RFT, we split the dynamic range of a representation into two sub-intervals
with a set of partition points. For a given partition, we first calculate the means of training samples in
their left and right regions, respectively, as the representative values and compute their mean-squared errors
(MSE) accordingly. Then, by combining the MSE values of both regions together, we get the weighted
MSE for the partition. Then, we search for the smallest weighted MSE through a set of partition points,
and this minimum value defines the cost function of this representation. Note that RFT is a supervised
feature selection algorithm since it exploits the label of the training samples. We sort representation indices
according to their root-MSE (RMSE) values from the smallest to the largest in Fig. 3.5. There are two
curves, one for the spatial representations and the other for the spatio-color representations. We can use the
elbow point on each curve to select a subset of representations. In the experiment, we use RFT to select
2048D spatial features and 2000D spatio-color features. The former is a concatenation of spatial features
from Y, U, and V channels.
32
3.2.4 Distortion-specific Prediction
Usually, we get better prediction scores if we can classify distorted images into several classes based on their
distortion types. We examine synthetic-distortion and authentic-distortion datasets separately as shown in
Fig. 3.6 due to their different properties.
3.2.4.1 Synthetic Distortions
Images in synthetic-distortion datasets are usually associated with one specific distortion type with multiple
severity levels. For example, CSIQ [52] has 6 distortion types with 4 to 5 different levels, as shown in Fig.
3.7. We can leverage the known distortion types by first training a distortion classifier to separate images
accordingly. Then, we design an individual pipeline to handle each distortion type. We can use distortion
labels of training images to train a multi-class distortion classifier based on the selected features in Sec. 3.2.3.
There are multiple sub-images from one image, and each of them may have a different predicted distortion
type. We adopt majority voting to determine the image-level distortion type. Note that some distortion
types are easily confused with each other (e.g., JPEG and JPEG2000). We can simply merge them into a
single type. As a result, the class number can be reduced.
3.2.4.2 Authentic Distortions
Images from authentic-distortion datasets may contain mixed distortion types introduced in image capture
or transmission. Three distorted images from KonIQ-10K are shown in Fig. 3.8. It is difficult to define each
as one specific type. For example, the underwater image contains blurriness, noise, and color distortion.
Thus, instead of training a specific-distortion classifier, we cluster images into multiple groups using some
low-level features in an unsupervised manner (e.g., the K-means algorithm). The low-level features include
statistical information in the spatial and color domains. For spatial features, we apply the Laplacian and
Sobel edge filters to all pixels in each sub-image, take their absolute values, and compute the mean, variance,
and maximum. For color features, we compute the variance of each color channel (such as Y, U, and V).
In addition, higher-order statistics can also be collected as color features. All these extracted features are
33
concatenated into a feature vector for unsupervised clustering. Although unsupervised clustering does not
assign a distortion type to a cluster, it reduces the content diversity of sub-images in the same cluster.
3.2.5 Regression and Decision Ensemble
For each of 6 distortions, 19 distortions, and 4 clusters for CSIQ, KADID-10K, and authentic-distortion
datasets, we train an XGBoost regressor [11] that maps from the feature space to the MOS score, respectively.
In the experiment, we set hyper-parameters of the XGBoost regressor to the following: 1) the max depth of
each tree is 5, 2) the subsampling ratio is 0.6, 3) the maximum tree number is 2000, and 4) the early stop
is adopted. Given the predicted MOS scores of all sub-image from the same source image, a median filter is
applied to generate the ultimate predicted MOS score of the input image.
Table 3.1: Four benchmarking IQA datasets, where the number of distorted images, the number of reference
images, the number of distortion types and collection methods of each dataset are listed.
Datasets Dist. Ref. Dist. Types Scenario
CSIQ 866 30 6 Synthetic
KADID-10K 10,125 81 25 Synthetic
LIVE-C 1,169 N/A N/A Authentic
KonIQ-10K 10,073 N/A N/A Authentic
Table 3.2: Performance comparison in PLCC and SROCC metrics between our GreenBIQA method and
nine benchmarking methods on four IQA databases, where the nine benchmarking methods are categorized
into four groups as discussed in Sec. 3.3.1 and the best performance numbers are shown in boldface.
CSIQ LIVE-C KADID-10K KonIQ-10K
Model SROCC PLCC SROCC PLCC SROCC PLCC SROCC PLCC Model size (MB)
NIQE 0.627 0.712 0.455 0.483 0.374 0.428 0.531 0.538 -
BRISQUE 0.746 0.829 0.608 0.629 0.528 0.567 0.665 0.681 -
CORNIA 0.678 0.776 0.632 0.661 0.516 0.558 0.780 0.795 7.4
HOSA 0.741 0.823 0.661 0.675 0.618 0.653 0.805 0.813 0.23
BIECON 0.815 0.823 0.595 0.613 - - 0.618 0.651 35.2
WaDIQaM 0.844 0.852 0.671 0.680 - - 0.797 0.805 25.2
PQR 0.872 0.901 0.857 0.882 - - 0.880 0.884 235.9
DBCNN 0.946 0.959 0.851 0.869 0.851 0.856 0.875 0.884 54.6
HyperIQA 0.923 0.942 0.859 0.882 0.852 0.845 0.906 0.917 104.7
GreenBIQA 0.952 0.959 0.801 0.809 0.886 0.893 0.858 0.870 1.82
34
3.3 Experiments
3.3.1 Experimental Setup
3.3.1.1 Datasets
We evaluate GreenBIQA on two synthetic IQA datasets and two authentic IQA datasets. Their statistics
are given in Table 3.1. The two synthetic-distortion datasets are CSIQ [52] and KADID-10K [59]. Multiple
distortions of various levels are applied to a set of reference images to yield distorted images. CSIQ has
six distortion types with four to five distortion levels. KADID-10K contains 25 distortion types with five
levels for each distortion type. LIVE-C [23] and KonIQ-10K [29] are two authentic-distortion datasets. They
contain a broad range of distorted real-world images captured by users. No reference image and specific
distortion type are available for each image. LIVE-C and KonIQ-10K have 1,169 and 10,073 distorted
images, respectively.
3.3.1.2 Benchmarking Methods
We compare the performance of GreenBIQA with nine benchmarking methods in Table 3.2. They include
four conventional and five DL-based BIAQ methods. We divide them into four categories.
• NIQE [74] and BRISQUE [72]. They are conventional BIQA methods using NSS features.
• CORNIA [115] and HOSA [109]. They are conventional BIQA methods using codebooks.
• BIECON [42] and WaDIQaM [6]. They are DL-based BIQA methods without pre-trained models (or
simple DL methods).
• PQR [120], DBCNN [130], and HyperIQA [97]. They are DL-based BIAQ methods with pre-trained
models (or advanced DL methods).
35
3.3.1.3 Evaluation Metrics
The performance is measured by two popular metrics: the Pearson Linear Correlation Coefficient (PLCC)
and the Spearman Rank Order Correlation Coefficient (SROCC). PLCC evaluates the correlation between
predicted scores from an objective method and user’s subjective scores (e.g., MOS) in form of
PLCC =
P
i
(pi − pm)( ˆpi − pˆm)
pP
i
(pi − pm)
2
pP
i
( ˆpi − pˆm)
2
, (3.1)
where pi and ˆpi represent predicted and subjective scores while pm and ˆpm are their means, respectively.
SROCC measures the monotonicity between predicted scores from an objective method and the user’s
subjective scores via
SROCC = 1 −
6
PL
i=1(mi − ni)
2
L(L2 − 1) , (3.2)
where mi and ni denote the ranks of the prediction and the ground truth label, respectively, and L denotes
the total number of samples or the number of images in our current case.
Figure 3.9: Performance curve on the validation dataset of LIVE-C with different crop numbers and sizes.
3.3.1.4 Implementation Details
In the training stage, we crop 15 sub-images of size 224 × 224 for each image in the two authentic datasets.
This design choice is based on the SROCC performance of validation sets, as shown in Fig. 3.9, where the
36
best performance under different crop sizes is highlighted. Similarly, we crop 25 sub-images of size 32 × 32
for each image in the two synthetic datasets. In the testing (or inference) stage, we crop 25 sub-images of
size 224 × 224 and 32 × 32 for images in authentic and synthetic datasets, respectively.
We adopt the standard evaluation procedure by splitting each dataset into 80% for training and 20% for
testing. Furthermore, 10% of training data is used for validation. We run experiments ten times and report
median PLCC and SROCC values. For synthetic-distortion datasets, splitting is implemented on reference
images to avoid content overlap.
3.3.2 Performance Evalution
We compare the performance of GreenBIQA and nine benchmarking BIQA methods on four IQA datasets
in Table 3.2.
3.3.2.1 Comparison among Benchmarking Methods
We first compare the performance among the nine benchmarks. Although some conventional BIAQ methods
have comparable performance with simple DL methods (without pre-trained models), and we see a clear
performance gap between conventional BIQA methods and advanced DL methods (with pre-trained models).
On the other hand, the model size of advanced DL methods is significantly larger. We comment on the
performance of GreenBIQA against other benchmarking methods below.
3.3.2.2 Synthetic-Distortion Datasets
For the two synthetic-distortion datasets, CSIQ and KADID-10K, GreenBIQA achieves the best performance
among all. This is attributed to its two characteristics: 1) classification of synthetic distortions to multiple
types followed by different processing pipelines, and 2) effective usage of ensemble decisions. For the first
point, there are six distortion types in CSIQ, as shown in Fig. 3.7. We show the SROCC performance of
the best BIQA method in each of the four categories against each of the six distortion types in the CSIQ
dataset in Table 3.3. GreenBIAQ outperforms all others in four distortion types. It performs especially well
for JPEG distortion because it adopts the DCT spatial features, which match the underlying compression
37
distortion well. GreenBIQA is also effective against white Gaussian noise (WN), pink Gaussian noise (FN),
and contrast decrements (CC) through the use of joint spatial and spatio-color features. GreenBIQA still
works well for Gaussian blur (GB), although no blur detector is employed. For the second point, since the
number of reference images is limited and the distortion is uniformly spread out across the whole image,
ensemble decision works well in such a setting.
Table 3.3: Comparison of the SROCC performance for each of six individual distortion types in the CSIQ
dataset, where WN, JPEG, JP2K, FN, GB, and CC denotes white Gaussian noise, JPEG compression,
JPEG-2000 compression, pink Gaussian noise, Gaussian blur, and contrast decrements, respectively. The
last column shows the weighted average of the SROCC metrics.
WN JPEG JP2K FN GB CC Average
BRISQUE 0.723 0.806 0.840 0.378 0.820 0.804 0.728
HOSA 0.604 0.733 0.818 0.500 0.841 0.716 0.702
BIECON 0.902 0.942 0.954 0.884 0.946 0.523 0.858
HyperIQA 0.927 0.934 0.960 0.931 0.915 0.874 0.923
GreenBIQA (Ours) 0.943 0.980 0.969 0.965 0.894 0.857 0.934
3.3.2.3 Authentic-Distortion Datasets
For the two authentic-distortion datasets, LIVE-C and KonIQ-10K, GreenBIQA outperforms conventional
BIQA methods and simple DL methods. This demonstrates the effectiveness of its extracted quality-aware
features and decision pipeline in handling diversified distortions and contents. There is, however, a performance gap between GreenBIQA and advanced DL methods with pre-trained models. The authenticdistortion datasets are more challenging because of non-uniform distortions across images and a wide variety
of content without duplication. Since pre-trained models are trained by a much larger image database, they
have advantages in extracting features for non-uniform distortions and unseen contents. Yet, they demand
much larger model sizes as a tradeoff.
3.3.3 Cross-Domain Learning
To evaluate the transferability of BIQA methods, we train models on one dataset and test them on another
dataset. Due to the huge differences in synthetic-distortion and authentic-distortion datasets, we focus on
authentic-distortion datasets and conduct experiments on LIVE-C and KonIQ-10K only. We consider two
experimental settings: I) trained with LIVE-C and tested on KonIQ-10K, and II) trained with KonIQ-10K
38
and tested on LIVE-C. The SROCC performance of GreenBIQA and five benchmarking methods under
the two settings are compared in Table 3.4, where benchmarks include the three best BIQA methods in
Table 3.2 (i.e., PQR, DBCNN, and HyperIQA) and two conventional BIQA methods (i.e., BRISQUE and
HOSA). By comparing the performance numbers in Tables 3.2 and 3.4, we see a performance drop in the
cross-domain condition for all methods. We see that GreenBIQA has a performance gap of 0.019 and 0.053
against the best one, HyperIQA, for Experimental Settings I and II, respectively. As shown in Table 3.1,
KonIQ-10K is much larger than LIVE-C. Experimental Setting I provides a more proper environment to
demonstrate the robustness (or generalizability) of a learning model. We compare the performance gaps in
Table 3.4 under Setting I with those in the KonIQ-10K/SROCC column in Table 3.2. The gaps between
PQR, DBCNN, HyperIQA, and GreenBIQA narrowed down from 0.022, 0.017, and 0.048 to 0.004, 0.001,
and 0.019, respectively. We see a greater potential for GreenBIQA in this direction.
Table 3.4: Comparison of the SROCC performance under the cross-domain learning scenario.
Settings I II
Train Dataset LIVE-C KonIQ-10K
Test Dataset KonIQ-10K LIVE-C
BRISQUE 0.425 0.526
HOSA 0.651 0.648
PQR 0.757 0.770
DBCNN 0.754 0.755
HyperIQA 0.772 0.785
GreenBIQA(Ours) 0.753 0.732
3.3.4 Model Complexity
A lightweight model is critical to applications on mobile and edge devices. We analyze the model complexity
of BIQA methods in four aspects below: model sizes, inference time, computational complexity in terms of
floating-point operations (FLOPs), and memory/latency tradeoff.
3.3.4.1 Model Size
There are two ways to measure the size of a learning model: 1) the number of model parameters, and
2) the actual memory usage. Floating-point and integer model parameters are typically represented by 4
bytes and 2 bytes, respectively. Since a great majority of model parameters are in floating point, the actual
39
(a) The SROCC performance versus the model size
(b) The SROCC performance versus the running time
Figure 3.10: Illustration of the tradeoff between (a) the SROCC performance and model sizes and (b) the
SROCC performance and running time with respect to CSIQ and KonIQ-10K datasets among several BIQA
methods.
40
memory usage is roughly equal to 4 × (no. of model parameters) bytes (see Table 3.5). To avoid confusion,
we use the “model size” to refer to actual memory usage below. Fig. 3.10(a) plots the SROCC performance
(in linear scale along the vertical axis) versus model sizes (in log scale along the horizontal axis) on a
synthetic-distortion dataset (i.e., CSIQ) and an authentic-distortion dataset, (i.e., KonIQ-10K) with respect
to a few benchmarking BIQA methods. The size of the GreenBIQA model includes the feature extractor
(600KB), the distortion-specific classifier (50KB), and several regressors (1.17MB), leading to a total of
1.82 MB. As compared with the two conventional methods (CORINA and HOSA), GreenBIQA achieves
much better performance with comparable model sizes. GreenBIQA outperforms two simple DL methods
(BIECON and WaDIQaM), with a smaller model size. As compared with the three advanced DL methods
(PQR, DBCNN, and HyperIQA), GreenBIQA achieves the best performance on CSIQ and competitive
performance on KonIQ-10k at a significantly smaller model size. Note that advanced DL methods have a
huge pre-trained network of size larger than 100MB as their backbones.
3.3.4.2 Inference Time
Another important factor to consider is running time in inference, which is especially the case for mobile/edge
clients. Fig. 3.10(b) shows the SROCC performance versus the inference time (measured in milliseconds
per image) for several benchmarking methods on CSIQ and KonIQ-10K. All methods are tested in the same
environment with a single CPU. We compare GreenBIQA with four conventional methods (NIQE, BRISQUE,
CORNIA, and HOSA) and two DL methods (NIMA and DBCNN). GreenBIQA has clear advantages over
all benchmarking methods by jointly considering performance and inference time. It is worthwhile to point
out that GreenBIQA can process around 43 images per second with a single CPU. In other words, it can
meet the real-time requirement by processing videos of 30 fps on a frame-by-frame basis. The inference time
of GreenBIQA can be further reduced by code optimization and/or with the support of mature packages.
3.3.4.3 Computational Complexity
We compare the SROCC and PLCC performance, the numbers of model parameters, model sizes (in terms
of memory usage), the numbers of Flops, and Flops per pixel of several BIQA methods tested on the LIVE-C
41
Table 3.5: Comparison of SROCC/PLCC performance, no. of model parameters, model sizes (memory
usage), no. of GigaFlops, and no. of KiloFlops per pixel of several BIQA methods tested on the LIVE-C
dataset, where “X” denotes the multiple no.
Model SROCC PLCC Model Parameters (M) Model Size (MB) GFLOPs KFLOPs/pixel
NIMA(Inception-v2) 0.637 0.698 10.16 (22.6X) 37.4 (20.5X) 4.37 (128.5X) 87.10 (128.5X)
BIECON 0.595 0.613 7.03 (15.6X) 35.2 (19.3X) 0.088 (2.6X) 85.94 (126.8X)
WaDIQaM 0.671 0.680 5.2 (11.6X) 25.2 (13.8X) 0.137 (4X) 133.82 (197.4X)
DBCNN 0.851 0.869 14.6 (32.4X) 54.6 (30X) 16.5 (485.3) 328.84 (485.1X)
HyerIQA 0.859 0.882 28.3 (62.9X) 104.7 (57.5X) 12.8 (376.5X) 255.10 (376.3X)
GreenBIQA (Ours) 0.801 0.809 0.45(1X) 1.82(1X) 0.034 (1X) 0.678(1X)
dataset in Table 3.5. FLOPs is a common metric to measure the computational complexity of a model. For
a given hardware configuration, the number of FLOPs is linearly proportional to energy consumption or
carbon footprint. Column “GFLOPs” in Table 3.5 gives the number of GFLOPs needed to run a model
once without considering the patch number and size used in a method. For a fair comparison of FLOPs, we
compute the number of FLOPS per pixel defined by
F LOP s/pixel =
F LOP s/patch
H × W
, (3.3)
where H and W are the height and width of an input patch to a model, respectively. NIMA with the pretrained Inception-v2 network has low performance, large model size, and high complexity. Although simple
DL methods (e.g., WaDIQaM and BIECON) use smaller networks with lower FLOPs, their performance
is still inferior to GreenBIQA. Finally, advanced DL methods (e.g., DBCNN and HyperIQA) outperform
GreenBIQA in SROCC and PLCC performance. However, their model sizes are much larger and their
computational complexities are much higher. The numbers of FLOPs of DBCNN and HyperIQA are 485
and 376 multiples of that of GreenBIQA, respectively.
3.3.4.4 Memory/Latency Tradeoff
There is a tradeoff between memory usage and latency in the image quality inference stage. That is, latency
can be reduced when given more computing resources. To observe the tradeoff, we control the memory usage
using different test image numbers in each run (i.e., the batch size). Fig. 3.11 shows the latency (in linear
scale along the vertical axis) and memory usage (in log scale along the horizontal axis) of GreenBIQA and
two advanced DL methods, where we set the batch size equal to 1, 4, 16, and 64 in four experiments. We see
42
Figure 3.11: Tradeoff between memory usage and latency for three BIQA methods (NIMA, DBCNN, and
GreenBIQA), where latency can be reduced by the usage of a larger memory.
from the figure that the latency of GreenBIQA is much smaller than NIMA and DBCNN under the same
memory size (say, 103MB). Along this line, the memory requirement of GreenBIQA is much lower than that
of NIMA and DBCNN at the same level of latency. Again, the memory/latency tradeoff curve of GreenBIQA
can be further improved through code optimization.
3.3.5 Ablation Study
To understand the impact of individual components on the overall performance of GreenBIQA, we conduct
an ablation study in Table 3.6, where fs, fsc, and DP denotes spatial features, spatio-color features, and
distortion-specific prediction, respectively. We first examine the effectiveness of the spatial features and then
add spatio-color features in the first two rows. Both SROCC and PLCC improve on the two authenticdistortion datasets. Similarly, adding distortion-specific prediction to S-features can improve SROCC and
PLCC for all datasets in the third row. Finally, we use all the components in the fourth row and see that
SROCC and PLCC can be further improved to reach the highest value. Note that we do not report the
performance of joint spatial and spatio-color features for synthetic datasets since spatial features are powerful
43
enough. The distortion-specific prediction benefits the performance significantly on synthetic datasets by
leveraging the distortion label.
Table 3.6: Ablation Study for GreenBIQA.
CSIQ LIVE-C KADID-1K KonIQ-10k
Components SROCC PLCC SROCC PLCC SROCC PLCC SROCC PLCC
fs 0.925 0.936 0.774 0.778 0.847 0.848 0.822 0.838
fs + fsc - - 0.782 0.783 - - 0.835 0.850
fs + DP 0.952 0.959 0.786 0.788 0.886 0.893 0.839 0.856
fs + fsc + DP - - 0.801 0.809 - - 0.858 0.870
Figure 3.12: The PLCC performance curves of GreenBIQA and WaDIQaM are plotted as functions of the
percentages of the full training dataset of KonIQ-10K, where the solid line and the banded structure indicate
the mean value and the range of mean plus/minus one standard deviation, respectively.
3.3.6 Weak Supervision
We train BIQA models using different percentages of the KonIQ-10K training dataset (e.g., from 1% to 90%),
as shown in Fig. 3.12 and show the PLCC performance against the full test dataset. For a fair comparison,
we only compare GreenBIQA with WaDIQaM, which is a simple DL method. Note that we do not choose
44
Figure 3.13: Comparison of the PLCC performance of GreenBIQA using active learning (in green) and
random selection (in red) on the KonIQ-10k dataset.
45
advanced DL methods with pre-trained networks for performance benchmarking since pre-trained networks
have been trained by other larger datasets. We show the mean and the plus/minus one standard deviation.
We see that GreenBIQA performs robustly under the weak supervision setting. Even if it is only trained
on 1% of training samples, GreenBIQA can achieve a PLCC value higher than 0.67. Conversely, WaDIQaM
does not perform well when the percentage goes low since a small number of samples is not sufficient for
training a large neural network.
3.3.7 Active Learning
To further investigate the potential of GreenBIQA, we implement an active learning scheme [84, 91] below.
1. Keep the initial training set as 10% of the full training dataset and obtain an initial model denoted by
M1.
2. Predict the performance of remaining samples in the training dataset using Mi
, i = 1, 2, · · · , 8. Compute the standard derivation of predicted scores of all sub-images associated with the same image,
which indicates prediction uncertainty.
3. Select a set of images that have the highest standard derivations in Step 2, where its size is 10% of
the full training dataset. Merge them into the current training image set; namely, their ground truth
labels are leveraged to train Model Mi+1.
We repeat the above process in sequence to obtain models M1, · · · , M9. Model M10 is the same as the one
that uses all training samples. We compare the PLCC performance of GreenBIQA with active learning and
with random sampling in Fig. 3.13. We see that the active learning strategy can improve the performance
of the random selection scheme in the range from 20% to 70% of full training samples.
3.4 Conclusion
This chapter introduces a lightweight yet high-performance approach for blind image quality assessment,
denoted as GreenBIQA. The efficacy of GreenBIQA is evaluated in terms of its PLCC and SROCC performances on both synthetic-distortion and authentic-distortion datasets. In comparison to conventional
46
BIQA methodologies and basic DL-based methods, GreenBIQA demonstrates superior performance across
all four datasets. Furthermore, when benchmarked against state-of-the-art advanced DL-based methods
equipped with pre-trained models, GreenBIQA maintains its dominance in synthetic datasets while also
delivering near-optimal performance in authentic datasets. Notably, GreenBIQA distinguishes itself through
its compact model size, rapid inference speed, and low computational complexity in terms of Floating Point
Operations (FLOPs). These attributes position GreenBIQA as a compelling choice for BIQA applications
within the domain of mobile and edge devices.
47
Chapter 4
GreenBVQA: Blind Video Quality Assessment at the Edge
4.1 Introduction
Objective video quality assessment methods are often classified into three categories: full-reference video
quality assessment (FR-VQA), reduced-reference video quality assessment (RR-VQA), and no-reference video
quality assessment (NR-VQA). NR-VQA is also known as blind video quality assessment (BVQA). FR-VQA
methods assess video quality by measuring the difference between distorted videos and their reference videos.
One well-known example is VMAF [58,60]. RR-VQA [96] methods evaluate video quality by utilizing a part
of the information from reference videos, which offers greater flexibility than FR-VQA. Finally, BVQA is the
only choice if no reference video is available. With the rise of social media and the popularity of multi-party
video conferencing, there has been an explosion of user-generated content (UGC). A significant portion of the
UGC lacks the availability of reference videos, necessitating the need for BVQA methods to automatically
and efficiently evaluate perceptual video quality. For example, UGC captured on mobile devices or other
edge devices may exhibit various quality issues due to factors like low lighting, shaky camera movements,
or out-of-focus shots. BVQA can be applied to assess the quality of these videos and provide feedback to
users. Furthermore, the adoption of edge computing is on the rise, primarily attributable to its capacity
to reduce latency, conserve network bandwidth, enhance privacy, and enable real-time data processing. In
edge computing environments, BVQA plays a pivotal role in ensuring that video quality remains high while
optimizing resources and responsiveness. Thus, BVQA has attracted growing attention in recent years, as it
48
addresses the pressing need to evaluate video quality in diverse contexts, spanning UGC and edge computing
scenarios.
Edge computing is a rapidly growing field in recent years due to the popularity of smartphones and the
Internet of Things (IoT). It involves processing and analyzing data near their source, typically at the “edge”
of the network (rather than transmitting them to a centralized location such as a cloud data center). Given
that a high volume of videos triggers heavy Internet traffic, video processing at the edge reduces the video
transmission burden and saves the network bandwidth. The VQA task can enhance many video processing
modules, such as video bitrate adaptation [4], video quality assurance [88], and video pre-processing within
wireless surveillance systems, leveraging lightweight AI and IoT collaboration [66]. The demand for highquality videos is increasing on edge devices, while most UGC lack reference videos, necessitating the use
of BVQA methods. For instance, in a mobile video capture scenario, environmental factors like network
instability, background noise, and lighting challenges can degrade video quality, often unnoticed during
recording. BVQA becomes crucial when users decide to share the video, as it evaluates and highlights
quality issues stemming from earlier recording conditions, aiding in quality assurance. Additionally, BVQA
guides video pre-processing on edge devices, ensuring optimal quality without excessive compression. By
adjusting content/bitrate based on network conditions and viewer preferences, BVQA facilitates optimized
video transmission, balancing file size and quality. This results in faster loading, reduced buffering, and an
enhanced viewing experience, making the content more professional and engaging.
One straightforward BVQA solution is to build it upon blind image quality assessment (BIQA) methods.
That is, the application of BIQA methods to a set of key frames of distorted videos individually. BIQA
methods can be classified into three categories: natural scene statistic (NSS) based methods [72, 74, 76],
codebook-based methods [109, 115] and deep-learning-based (DL-based) methods [6, 130]. However, directly
applying BIQA followed by frame-score aggregation does not produce satisfactory outcomes because of the
absence of temporal information. Thus, it is essential to incorporate temporal or spatio-temporal information. Other BVQA methods with handcrafted features [95, 103] were evaluated on synthetic-distortion
datasets with simulated distortions such as transmission and compression. Recently, they were evaluated
on authentic-distortion datasets as well as reported in [73, 87]. Their performance on authentic-distortion
49
datasets is somehow limited. Authentic-distortion VQA datasets arise from the UGC in the real-world environment. They contain complicated and mixed distortions with highly diverse contents, devices, and capture
conditions.
Deep learning (DL) methods have been developed for BIQA and BVQA [130]. To further enhance
performance and reduce distributional shifts [65], pre-trained models on large-scale image datasets, such as
the ImageNet [17], are adopted in [56, 101]. However, it is expensive to adopt large pre-trained models on
mobile or edge devices due to their high computational complexity and large model sizes. Furthermore, in the
context of BVQA, if users transmit videos directly to remote cloud servers without prior BVQA assessment,
several concerns arise. Firstly, there exists a potential time delay in acquiring BVQA feedback from the
cloud, leading to an additional round of processing for users. Secondly, from a privacy perspective, certain
scenarios may deter users from uploading their content to the cloud, given concerns over data privacy and
security. These considerations encourage the exploration of BVQA methodologies that are resource-efficient
and lightweight.
Therefore, a lightweight BVQA method is demanded at the edge. Based on the green learning principle
[48, 49], a lightweight BIQA method, called GreenBIQA, was proposed in [68] recently. It is worth noting
that GreenBIQA exhibits limited performance when applied to VQA datasets, primarily due to its lack of
temporal information integration. Moreover, direct deployment of GreenBIQA in video quality evaluation is
computationally expensive because of a huge difference in image and video data sizes. To address this void, we
propose a lightweight BVQA method and call it GreenBVQA in this work. GreenBVQA features a smaller
model size and lower computational complexity while achieving highly competitive performance against
state-of-the-art DL methods. The processing pipeline of GreenBVQA contains four modules: 1) video data
cropping, 2) unsupervised representation generation, 3) supervised feature selection, and 4) mean-opinionscore (MOS) regression and ensembles. The video data cropping operation in Module 1 is a pre-processing
step. Then, we extract spatial, spatio-color, temporal, and spatio-temporal representations from cropped
data in an unsupervised manner to obtain a rich set of representations at low complexity in Module 2 and
select a subset of the most relevant features using the relevant feature test (RFT) [113] in Module 3. Finally,
all selected features are concatenated and fed to a trained MOS regressor to predict multiple video quality
50
scores and then, an ensemble scheme is used to aggregate multiple regression scores into one ultimate score.
We conduct experimental evaluations on three VQA datasets and show that GreenBVQA can offer state-ofthe-art performance in PLCC and SROCC metrics while demanding significantly small model sizes, short
inference time, and low computational complexity.
There are three main contributions of this work.
• A novel lightweight BVQA method, named GreenBVQA, is proposed. Four different types of representation (i.e., spatial, spatio-color, temporal, and spatio-temporal representations) are considered
jointly. Each type of representation is passed to the supervised feature selection module for dimension
reduction. Then, all selected features are concatenated to form the final feature set.
• Experiments are conducted on three commonly used VQA datasets to demonstrate the advantages
of the proposed GreenBVQA method. Our method outperforms all conventional BVQA methods in
terms of MOS prediction accuracy. Its performance is highly competitive against DL-based methods
while featuring a significantly smaller model size, shorter inference time, and lower computational
complexity.
• A video-based edge computing system is presented to illustrate the role of GreenBVQA in facilitating
various video processing tasks at the edge. The inherent characteristics of GreenBVQA, notably its
lightweight model and low computational complexity, serve as a compelling evidence of its prospective
utility in the realm of edge computing.
4.2 Methodology
The system diagram of the proposed GreenBVQA method is shown in Fig. 4.1. It has a modularized system
consisting of four modules: 1) video data cropping, 2) unsupervised representation generation, 3) supervised
feature selection, and 4) MOS regression and ensemble. We will introduce the operations in each module
below.
51
Input
Videos
Input
Images Regression
Image
Quality
Scores
Feature
Selection
Regression
Video
Quality
Scores
Feature
Selection
Unsupervised Supervised
Video Data
Cropping
Unsupervised Representation
Generation
Spatio-color Representations
Spatial Representations
Temporal Representations
Spatio-temporal
Representations
Supervised Feature
Selection
MOS Regression
and Ensemble
Representative
Frames
Figure 4.1: The system diagram of the proposed GreenBVQA method.
sub-video
sub-video
width (video)
height
(video) sub-image
video
time
Cube Cropping
cube
Sub-cube
Cropping
sub-cube
: representative frame : cube : sub-cube
time
height time
(cube)
width (cube)
height
(sub-cube)
width (sub-cube)
Frame
Cropping
Figure 4.2: The video data cropping module of GreenBVQA.
52
4.2.1 Video Data Cropping
GreenBVQA adopts a hierarchical data cropping approach as illustrated in Fig. 4.2. This serves as a preprocessing step for later modules. Upon receiving an input video clip, we first split it into multiple sub-videos
in the time domain. For instance, a ten-second video can be partitioned into ten non-overlapping sub-videos,
each of one-second duration. The sub-video serves as the basic unit for future processing. Given a sub-video,
we consider the following three cropping schemes.
1. Frame Cropping
One representative frame is selected from each sub-video. It can be the first frame, an I-frame, or an
arbitrary frame. Here, we aim to get spatial-domain information. For example, several sub-images
can be cropped from a representative frame. Both spatial and spatio-color representations will be
computed from each sub-image, which will be discussed in the next subsection.
2. Cube Cropping
We collect co-located sub-images from all frames in one sub-video, as shown in Fig. 4.2. This process
is referred to as “cube cropping” since it contains both spatial and temporal information. The purpose
of cube cropping is to reduce the amount of data to be processed in the later modules. This is needed
as the data size of videos is substantially larger than that of images.
3. Sub-cube Cropping
We crop out a sub-cube from a cube that has a shorter length in the time domain and a smaller size in
the spatial domain (see Fig. 4.2). It is used to extract spatio-temporal representations. The rationale
for sub-cube cropping is akin to that of cube cropping - reducing the amount of data to be processed
later.
After the above cropping steps, we obtain sub-images, cubes, and sub-cubes as shown in Fig. 4.2. Spatial
and spatio-color representations will be derived from sub-images cropped from representative frames, while
temporal and spatio-temporal representations will be obtained from cubes and sub-cubes, respectively.
53
Table 4.1: The selected hyper-parameters for spatial representation generation in our experiments, where the
transform dimensions are denoted by {(H ×W), C} for spatial and channel dimensions, the stride parameter
is denoted as {Spatial Stride2}, and L, M, and H represent low-frequency, mid-frequency, and high-frequency
representations, respectively.
Layer Spatial Output Size
Input 320 × 320 320 × 320
DCT (8 × 8), 1 (40 × 40), 64
Split Low-freq (L) High-freq (H) L: (40 × 40), 1
H: (40 × 40), 63
Hop1 (4 × 4), 1
stride 22
-
L: (19 × 19), 16
H: (40 × 40), 63
Split Low-freq (L) Mid-freq (M) High-freq (H)
L: (19 × 19), 3
M: (19 × 19), 13
H: (40 × 40), 63
Pooling (1 × 1), 3 (2 × 2), 13 (4 × 4), 63
L: (19 × 19), 3
M: (9 × 9), 13
H: (10 × 10), 63
Hop2 (3 × 3), 3
stride 22
- -
L: (9 × 9), 27
M: (19 × 19), 13
H: (10 × 10), 63
4.2.2 Unsupervised Representation Generation
We consider the following four representations in GreenBVQA.
1. Spatial representations. They are extracted from the Y channel of sub-images cropped from representative frames.
2. Spatio-Color representations. They are extracted from cubes of size (H × W) × C, where H and W
are the height and width of sub-images and C = 3 is the number of color channels, respectively.
3. Temporal representations. They are the concatenation of statistical temporal information of cubes.
4. Spatio-Temporal representations. They are extracted from sub-cubes of size (H × W) × T, where H
and W are the height and width of sub-images and T is the length of sub-cubes in the time domain,
respectively.
GreenBVQA employs all four types of representations collectively to predict perceptual video quality scores.
On the other hand, spatial and spatio-color representations can be utilized to predict the quality scores of
individual images or sub-images.
54
Table 4.2: The selected hyper-parameters for spatial-color representations generation, where the transform
dimensions are denoted by {(H × W) × T, C} for spatial, color, and channel dimensions and L and H
represent low-frequency and high-frequency representations, respectively.
Layer Spatio-color Output Size
Input (320 × 320) × 3 (320 × 320) × 3
Pooling (2 × 2) × 1 (160 × 160) × 3
Hop1 (4 × 4) × 3 (40 × 40) × 1, 48
Split Low-freq (L) High-freq (H) L: (40 × 40) × 1, 3
H: (40 × 40) × 1, 45
Pooling - (2 × 2) × 1, 45 L: (40 × 40) × 1, 3
H: (20 × 20) × 1, 45
Hop2 (4 × 4) × 1, 3 -
L: (10 × 10) × 1, 48
H: (20 × 20) × 1, 45
4.2.2.1 Spatial Representations
As discussed in deriving the spatial representation for GreenBIQA [69], a three-layer structure is adopted
to extract local and global spatial representations from the sub-images. This is summarized in Table 4.1
and depicted in Fig. 4.3. Input sub-images are partitioned into non-overlapping blocks of size 8 × 8,
and the Discrete Cosine Transform (DCT) coefficients are computed through the block DCT transform.
These coefficients, consisting of one DC coefficient and 63 AC coefficients (AC1-AC63), are organized into
64 channels. The DC coefficients exhibit correlations among spatially adjacent blocks, which are further
processed by using the Saab transform [51]. The Saab transform computes the patch mean, referred to as
the DC component, using a constant-element kernel. Principal Component Analysis (PCA) is then applied to
the mean-removed patches to derive data-driven kernels, known as AC kernels. The AC kernels are applied
to each patch, resulting in AC coefficients of the Saab transform. To decorrelate the DC coefficients, a
two-stage process, namely Hop1 and Hop2, is employed. The coefficients obtained from each channel, either
with or without down-sampling at different Hops and the DCT layer, are utilized to calculate standard
deviations, PCA coefficients, or are left unchanged. According to the spectral frequency in DCT and Saab
domain, the coefficients from Hop2, Hop1, and the DCT layers are denoted as low-frequency, mid-frequency,
and high-frequency representations, respectively. Low- and mid-frequency representations contain global
information from large receptive fields, while high-frequency representations contain information of details
from a small receptive field. Then, all representations are concatenated to form spatial representations.
55
320x320
40x40
8x8 block based DCT
4x4 2D Saab Transform
19x19
1x(N+1)
Representation
10x10
DC AC1 AC2 AC63
DC AC3 AC1
5 AC1 AC2 AC63
Pooling Pooling Pooling
Pooling Pooling
Hop1
Sub-image
9x9 DC AC8
All std PCA std PCA std PCA std
3x3 2D Saab Transform Hop2
AC3 AC1 9x9 5
Low-freq Mid-freq High-freq
AC2
DC AC8
Figure 4.3: The block diagram of unsupervised spatial representation generation.
4.2.2.2 Spatio-Color Representations
The representations for spatio-color cubes are derived using 3D Saab and PCA methods. The hyperparameters are given in Table 4.2, and the data processing block diagram is depicted in Fig. 4.4 (a).
A spatio-color cuboid has dimensions of H × W × C, where H and W represent the height and width of the
sub-image, respectively, and C = 3 denotes the number of color channels. It is fed to a two-hop structure.
In Hop1, it is divided into non-overlapping cuboids of size 4 × 4 × 3, and the 3D Saab transform is applied
individually, resulting in one DC channel and 47 AC channels (AC1-AC47). Each channel has a spatial
dimension of 40 × 40. Since the DC, AC1, and AC2 coefficients exhibit high spatial correlation. In Hop2, a
2D Saab transform is used to decompose these channels of size 40 × 40 into non-overlapping blocks of size
4 × 4. For the other 45 AC coefficients obtained from Hop1, their absolute values are taken and a 2 × 2 max
pooling operation is performed, yielding 45 channels with a spatial dimension of 20 × 20. In total, we obtain
93 channels, comprising 48 low-frequency channels from Hop2 and 45 high-frequency channels from Hop1,
with spatial size of 10×10 and 20×20, respectively. The coefficients obtained from each channel are utilized
to calculate standard deviations and PCA coefficients. These computed coefficients are then concatenated
to form spatio-color representations.
56
320x320x3 Spatio- Color
Cuboid
40x40x1
4x4x3 3D Saab Transform
4x4x1 2D Saab Transform
10x10x1
1x(N+1)
feature
PCA
20x20x1
DC AC3 AC47
DC AC1 AC15 AC3 AC47
std PCA std
Pooling Pooling
Hop1
Hop2
Pooling
AC2
DC AC1 AC15
160x160x3
Low-freq High-freq
4x4x1 2D Saab Transform
96x96x15
Spatiotemporal
Cuboid
12x12x5
8x8x3 3D Saab Transform
2x2x5 3D Saab Transform
6x6x1
1x(N+1)
feature
DC AC4 AC191
DC AC1 AC19 6x6x5
PCA std
AC3
DC AC1 AC19
2x2x5 3D Saab Transform
PCA std
AC3 AC191
Pooling Pooling
Hop1
Hop2
Low-freq High-freq
(a) Generation of spatiao-color representations (b) Generation of spatio-temporal representations
Figure 4.4: The block diagram of unsupervised spatio-color and spatio-temporal representations generation.
Table 4.3: Summary of 14-D raw temporal representations.
Index Computation Procedure
f1 − f2 Compute the mean of x-mvs and y-mvs
f3 − f4 Compute the standard deviation of x-mvs and y-mvs
f5 − f6 Compute the ratio of significant x-mvs and y-mvs
f7 − f8 Collect the maximum of x-mvs and y-mvs
f9 − f10 Collect the minimum of x-mvs and y-mvs
f11 Compute the mean of magnitude of mvs
f12 Compute the standard deviation of magnitude of mvs
f13 Compute the ratio of significant magnitude of mvs
f14 Collect the maximum of magnitude of mvs
4.2.2.3 Temporal Representations
The spatial and spatio-color representations are extracted from the sub-images on representative frames.
Both of them represent the information within individual frames while disregarding the temporal information
across frames. Here, a temporal representation generation is proposed to capture the temporal information
from motion vectors (mvs).
Consider a cube of dimensions H × W × T, where H and W represent the height and width of subimages, respectively, and T is the number of frames in the time domain. For each H × W sub-image within
a cube, motion vectors of small blocks are computed (or collected from compressed video streaming). They
are denoted as V = ((x1, y1)· · · ,(xn, yn))T
, where (xn, yn) represents the motion vector of the n
th block.
Specifically, xn and yn are the horizontal magnitude and vertical magnitude of the motion vector and are
named x-mv and y-mv, respectively. The magnitude of the motion vector can be computed by p
x
2
n + y
2
n
.
57
Table 4.4: The selected hyper-parameters for spatial-temporal representations generation, where the transform dimensions are denoted by {(H ×W)×T, C} for spatial, temporal, and channel dimensions and L and
H represent low-frequency and high-frequency representations, respectively.
Layer Spatio-temporal Output Size
Input (96 × 96) × 15 (96 × 96) × 15
Pooling - (96 × 96) × 15
Hop1 (8 × 8) × 3 (12 × 12) × 5, 192
Split Low-freq (L) High-freq (H) L: (12 × 12) × 5, 4
H: (12 × 12) × 5, 188
Pooling - (2 × 2) × 1, 188 L: (12 × 12) × 5, 4
H: (6 × 6) × 5, 188
Hop2 (2 × 2) × 5, 4 -
L: (6 × 6) × 1, 80
H: (6 × 6) × 5, 188
The motion representation of a cube is computed based on its motion vectors. This statistical analysis
yields a 14-D temporal representation of each sub-image as shown in Table 4.3. They are arranged in
chronological order to form raw temporal representations. Furthermore, PCA is applied to them to derive
spectral temporal representations. Finally, the raw and spectral temporal representations are concatenated
to form the final temporal representations.
4.2.2.4 Spatio-Temporal Representations
Both spatial representations from sub-images of representative frames and temporal representations from
cubes are extracted individually from a single domain. It is also important to consider the correlation between
spatial information and temporal information in subjective score prediction, as subjective assessments often
take both aspects into account when providing scores. To extract spatio-temporal features from both spatial
and temporal domains, a two-hop architecture is adopted, where the 3D Saab transform is conducted, as
depicted in Fig. 4.4 (b). The hyper-parameters are summarized in Table 4.4
The dimension of the spatio-temporal cube, which is the same as sub-cubes in Fig. 4.2, is H × W × T,
where H, W and T represent the height, width, and the frame number of the sub-cube, respectively. These
sub-cubes are fed into a two-hop architecture, where the first and the second hops are used to capture local
and global representations, respectively. The procedure used to generate spatio-temporal representation is
similar to that for the spatio-color representation generation, except the 3D channel-wise Saab transform
is applied in both hops. In Hop 1, we split the input sub-cubes into non-overlapping 3-D cuboids of size
58
8 × 8 × 3. They are converted to one-dimensional vectors for Saab coefficient computation, leading to one
DC channel and 191 AC channels, denoted by AC1-AC191. The size of each channel is 12 × 12 × 5. The
coefficients in DC and low-frequency AC (e.g., AC1-AC3) channels are spatially and temporally correlated
because the adjacent 8 × 8 × 3 cuboids are strongly correlated. Therefore, another 3-D Saab transform is
applied in Hop 2 to decorrelate the DC and low-frequency AC channels from Hop 1. Similarly, we split
these channels into several non-overlapping 3D cuboids of size 2 × 2 × 5. Coefficients in each cuboid are
flattened into a 20-D vector denoted by y = (y1, · · · , y20)
T
, and their Saab coefficients are computed. The
DC coefficients in Hop 2 are computed by the mean of the 20-D vector, ¯y = (P20
i=1 yi)/20. The remaining
19 AC coefficients, denoted by AC1 to AC19, are generated by the principle component analysis (PCA) on
the mean-removed 20-D vector.
To lower the number of coefficients that need to be processed, blocks in the Hop 1 are downsampled to
cuboids of size 6 × 6 × 5 by using 2 × 2 max pooling in the spatial domain. Given low-frequency channels of
size 6 × 6 × 1 from Hop 2 and downsampled 6 × 6 × 5 high-frequency channels from Hop 1, we generate two
sets of representations as follows.
• The coefficients in each channel are first flattened to 1-D vectors. Next, we conduct PCA and select
the first N PCA coefficients of each channel to form the spectral features.
• We compute the standard deviation of coefficients from the same channel across the spatio-temporal
domain.
Finally, we concatenate the two sets of representations to form the spatio-temporal representations.
4.2.3 Supervised Feature Selection
The number of unsupervised representations is large. To reduce the dimension of unsupervised representations, we select quality-relevant features from the 4 sets of representations obtained in the unsupervised
representation generation part by adopting the relevant feature test (RFT) [113]. RFT enables the calculation of independent losses for each representation, with lower loss values indicating superior representations.
The RFT procedure involves splitting the dynamic range of a representation into two sub-intervals using a
59
set of partition points. For a given partition, the means of the training samples in the left and right regions
are computed as representative values, and their respective mean-squared errors (MSE) are calculated. By
combining the MSE values of both regions, a weighted MSE for the partition is obtained. The search for the
minimum weighted MSE across the set of partition points determines the cost function for the representation.
It is important to note that RFT is a supervised feature selection algorithm that utilizes the labels of the
training samples. We computed the RFT results for spatial, spatio-color, temporal, and spatio-temproal representations individually. Fig. 4.5 illustrates the sorting of representation indices based on their MSE values
with separate curves. The elbow point on each curve can be used to select a subset of representations. Given
the four types of unsupervised representations, the top n dimensions for each representation are selected
to be concatenated as the supervised quality-relevant features for cubes. The dimensions of unsupervised
representations and supervised selected features on KoNViD-1k [28] dataset are shown in Table 4.5.
Figure 4.5: RFT results of spatial, spatio-color, temporal, and spatio-temporal representations.
60
Table 4.5: Dimensions of unsupervised representation and selected supervised features on the KoNViD-1k
dataset.
Representation Feature
Dimension Dimension
Spatio 6,637 220
Spatio-color 6,793 200
Temporal 420 140
Spatio-temporal 8,878 240
Sum 22,728 800
Table 4.6: Statistics of three VQA datasets.
Dataset Ref. Scenes. Resolution Time Duration MOS range
CVD2014 [77] 234 5 480p, 720p 10-25s [-6.50, 93.38]
KoNViD-1k [28] 1,200 1,200 540p 8s [1.22, 4.64]
LIVE-VQC [94] 585 585 240p-1080p 10s [6.2237, 94.2865]
4.2.4 MOS Regression and Ensembles
Once the quality-relevant features are selected, we employ the XGBoost [11] regressor as the quality score
prediction model that maps d-dimensional quality-relevant features to a single quality score. After the
regressor’s prediction, each cube is assigned a predicted score. The scores of cubes belonging to the same
sub-video are then ensembled using a median filter, resulting in the score of the sub-video, which predicts
the Mean Opinion Score (MOS) of a short interval of frames from the input video. To obtain the final MOS
for the entire input video, a mean filter is applied to aggregate the scores from all sub-videos belonging to
the same input video.
4.3 Experiments
4.3.1 Experiments Setup
We discuss VQA datasets, performance benchmarking methods, evaluation metrics, and some implementation details below.
4.3.1.1 Datasets
We evaluate GreenBVQA on three VQA datasets: CVD2014 [77], KoNViD-1k [28], and LIVE-VQC [94].
Their statistics are summarized in Table 4.6. CVD2014 is captured in a controlled laboratory environment.
61
Thus, it is also called the lab-generated dataset. It comprises 234 video sequences of resolution 640 × 480
or 1280 × 720. They are acquired with 78 cameras ranging from low-quality mobile phones to high-quality
digital single-lens reflex cameras. Each video displays one of five scenes with distortions associated with
the video acquisition process. KoNViD-1k and LIVE-VQC are authentic-distortion datasets, also known as
user-generated content (UGC) datasets. KoNViD-1k comprises 1200 video sequences, each of which lasts for
8 seconds with a fixed resolution. LIVE-VQC consists of a collection of video sequences of a fixed duration
in multiple resolutions. Both of them contain diverse content and a wide range of distortions.
4.3.1.2 Benchmarking Methods
We compare the performance of GreenBVQA with eleven benchmarking methods in Table 4.7. These
methods can be classified into three categories.
• Three conventional BIQA methods: NIQE [74], BRISQUE [72], and CORNIA [115]. They are applied
to frames of distorted videos. Then, the predicted scores are ensembled to yield the ultimate BVQA
score.
• Three conventional BVQA methods without neural networks: V-BLIINDS [87], TLVQM [43], and
VIDEVAL [100].
• Five state-of-the-art DL-based methods with pre-trained models: VSFA [55], RAPIQUE [101], QSAVQM [1], Mirko et al. [2], and CNN-TLVQM [44]. They are also called advanced DL methods.
4.3.1.3 Evaluation Metrics
The MOS prediction performance is measured by two well-known metrics: the Pearson Linear Correlation
Coefficient (PLCC) and the Spearman Rank Order Correlation Coefficient (SROCC). PLCC is employed to
assess the linear correlation between the predicted scores and the subjective quality scores. It is defined as
PLCC =
P
i
(pi − pm)( ˆpi − pˆm)
pP
i
(pi − pm)
2
pP
i
( ˆpi − pˆm)
2
, (4.1)
62
where pi and ˆpi denote the predicted score and the corresponding subjective quality score, respectively, for
a test video sample. Additionally, pm and ˆpm represent the means of the predicted scores and subjective
quality scores, respectively. SROCC is used to measure the monotonic relationship between the predicted
scores and the subjective quality scores, considering the relative ranking of the samples. It is defined as
SROCC = 1 −
6
PL
i=1(mi − ni)
2
L(L2 − 1) , (4.2)
where mi and ni represent the ranks of the predicted score pi and the corresponding subjective quality score
pˆi
, respectively, within their respective sets of scores. The variable L represents the total sample number.
Table 4.7: Comparison of the PLCC and SROCC performance of 10 benchmarking methods against three
VQA datasets.
CVD2014 LIVE-VQC KoNViD-1k Average
Model SROCC↑ PLCC↑ SROCC↑ PLCC↑ SROCC↑ PLCC↑ SROCC↑ PLCC↑
NIQE [74] 0.475 0.607 0.593 0.631 0.539 0.551 0.535 0.596
BRISQUE [72] 0.790 0.804 0.593 0.624 0.649 0.651 0.677 0.654
CORNIA [115] 0.627 0.663 0.681 0.723 0.735 0.735 0.681 0.707
V-BLIINDS [87] 0.795 0.806 0.681 0.699 0.706 0.701 0.727 0.735
TLVQM [43] 0.802 0.823 0.783 0.785 0.763 0.765 0.782 0.791
VIDEVAL [100] 0.814 0.832 0.744 0.748 0.770 0.771 0.776 0.783
VSFA [55] 0.850 0.859 0.717 0.770 0.794 0.798 0.787 0.809
RAPIQUE [101] 0.807 0.823 0.741 0.761 0.788 0.805 0.778 0.796
QSA-VQM [1] 0.850 0.859 0.742 0.778 0.801 0.802 0.797 0.813
Mirko et al. [2] 0.834 0.848 0.742 0.780 0.772 0.784 0.782 0.804
CNN-TLVQM [44] 0.852 0.868 0.811 0.828 0.814 0.817 0.825 0.837
GreenBVQA(Ours) 0.835 0.854 0.785 0.789 0.776 0.779 0.798 0.807
4.3.1.4 Implementation Details
Video Data Cropping. Each sub-video has a length of 30 frames. Six sub-images of size 320 × 320 are
cropped from each representative frame. A cube consists of (320 × 320) × 30 pixels. Then, one sub-video
has 6 cubes. The size of the sub-cube, which is used to generate the spatio-temporal representation, is
(96 × 96) × 15. Only one sub-cube is cropped from each cube.
Unsupervised Representation Generation. For spatial representation generation, the 8 × 8 DCT
transform and the 4 × 4 Saab transform are used to generate spatial representations of 6,637 dimensions.
The dimensions of temporal, spatio-temporal, and spatio-color representations are 420, 8878, and 6793,
respectively.
63
Supervised Feature Selection. Following the application of RFT to each type of representation
generated from the KoNViD-1k dataset, independently, the resulting selected features exhibit dimensions of
220 for spatial features, 200 for spatio-color features, 140 for temporal features, and 240 for spatio-temporal
features. It is important to note that the dimensions of the selected features may vary across different
datasets, as the distribution of data and content can differ significantly among various datasets.
MOS Regression and Ensembles. The XGBoost regressor is used to train and predict the MOS score
of each cube. The max depth of each tree is 5, and the subsampling rate is 0.6. The maximum number of
trees is 2,000 with early termination. Given the score of each cube, a median filter is used to obtain the
score of each sub-video. Next, we take the average of all sub-videos’ scores to obtain the final score of the
input video.
Performance Evaluation. To ensure reliable evaluation, we partition a VQA dataset into two disjoint
sets: the training set (80%) and the testing set (20%). We set 10% aside in the training set for validation
purposes. We conduct experiments in 10 runs and report the median values of PLCC and SROCC.
4.3.2 Performance Comparison
4.3.2.1 Same-Domain Training Scenario
We compare the PLCC and SROCC performance of GreenBVQA with that of the other eleven benchmarking methods in Table 4.7. GreenBVQA outperforms all three conventional BIQA methods (i.e., NIQE,
BRISQUE, and CORNIA) and all three conventional BVQA methods (i.e., V-BLIINDS, TLVVQM, and
VIDEVAL) by a substantial margin in all three datasets. This shows the effectiveness of GreenBVQA in
extracting quality-relevant features to cover diverse distortions and content variations. GreenBVQA is also
competitive with the five DL-based BVQA methods. Specifically, GreenBVQA achieves the second-best
performance for the LIVE-VQC dataset. It also ranks second in the average performance of SROCC across
all three datasets.
As to the five DL-based BVQA methods, the performance of GreenBVQA is comparable with that of
QSA-VQM. However, there exists a performance gap between GreenBVQA and CNN-TLVQM, which is
64
a state-of-the-art DL-based method employing pre-trained models. The VQA datasets, particularly usergenerated content datasets, pose significant challenges due to non-uniform distortions across videos and a
wide variety of content without duplication. Pre-trained models trained on large external datasets have
an advantage in extracting features for non-uniform distortions and unseen content. Nonetheless, these
advanced DL-based methods come with significantly larger model sizes and inference complexity as analyzed
in Sec. 4.3.3.
4.3.2.2 Cross-Domain Training Scenario
To evaluate the generalizability of BVQA methods, we investigate the setting where training and testing
data come from different datasets. Here, we focus on the two UGC datasets (i.e., KoNViD-1k and LIVEVQC) due to their practical significance. Two settings are considered: I) trained with KoNViD-1k and
tested on LIVE-VQC, and II) trained with LIVE-VQC and tested on KoNViD-1k. We compare the SROCC
performance of GreenBVQA and five benchmarking methods under these two settings in Table 4.8. The
five benchmarking methods include two conventional BVQA methods (TLVQM and VIDEVAL) and three
DL-base BVQA methods (VSFA, QSA-VQM, and Mirko et al.).
We see a clear performance drop for all methods in the cross-domain condition by comparing Tables
4.7 and 4.8. We argue that setting II provides a more suitable scenario to demonstrate the robustness (or
generalizability) of a learning model. This is because KoNViD-1k has a larger video number and scene
number, as shown in Table 4.6. Thus, we compare the performance gaps in Table 4.8 under Setting II
with those in the KoNViD-1k/SROCC column in Table 4.7. The gaps between VSFA, QSA-VQM, and
CNN-TLVQM against GreenBVQA become narrower for KoNViD-1k. They are down from 0.019, 0.023,
and 0.038 (trained by the same dataset) to 0.015, -0.066, and 0.024 (trained by LIVE-VOC), respectively.
This suggests a high potential for GreenBVQA in the cross-domain training setting.
4.3.3 Comparison of Model Complexity
We evaluate the model complexity of various BVQA methods in three aspects: model size, inference time,
and computational complexity.
65
Table 4.8: Comparison of the SROCC performance under the cross-domain training scenario.
Settings I II
Training KoNViD-1k LIVE-VQC
Testing LIVE-VQC KoNViD-1k
TLVQM [43] 0.572 0.639
VIDEVAL [100] 0.591 0.656
VSFA [55] 0.593 0.671
QSA-VQM [1] 0.660 0.590
CNN-TLVQM [44] 0.720 0.680
GreenBVQA(Ours) 0.631 0.656
Table 4.9: Model complexity comparison, where the reported SROCC and PLCC performance numbers are
against the KoNViD-1k dataset.
Model SROCC↑ PLCC↑ Model Size (MB)↓ FLOPs↓
VSFA [55] 0.794 0.798 100.2 (15.8×) 20T (1250×)
QSA-VQM [1] 0.801 0.802 196 (30.8×) 40T (2500×)
Mirko et al. [2] 0.772 0.784 42.3 (6.6×) 1.5T (94×)
CNN-TLVQM [44] 0.814 0.817 98 (15.4×) 21T (1312×)
GreenBVQA(Ours) 0.776 0.779 6.36 (1×) 16G (1×)
4.3.3.1 Model sizes
There are two ways to measure the size of a learning model: 1) the number of model parameters, and 2)
the actual memory usage. Floating-point and integer model parameters are typically represented by 4 bytes
and 2 bytes, respectively. Since a great majority of model parameters are in the floating point format, the
actual memory usage is roughly equal to 4×(no. of model parameters) bytes. Here, we use the “model size”
to refer to actual memory usage below. The model sizes of GreenBVQA and four benchmarking methods
are compared in Table 4.9. The size of the GreenBVQA model includes the following: the representation
generator (4.28MB) and a regressor (2.08MB), leading to a total of 6.36 MB. As compared with four DLbased benchmarking methods, GreenBVQA achieves comparable SROCC and PLCC performance with a
much smaller model size.
Table 4.10: Inference time comparison in seconds.
Model 240frs@540p 364frs@480p 467frs@720p
V-BLIINDS [87] 382.06 361.39 1391.00
QSA-VQM [1] 281.21 256.13 900.72
VSFA [55] 269.84 249.21 936.84
TLVQM [43] 50.73 46.32 136.89
NIQE [74] 45.65 41.97 155.90
BRISQUE [72] 12.69 12.34 41.22
Mirko et al. [2] 8.43 6.24 16.29
GreenBVQA 3.22 4.88 6.26
66
4.3.3.2 Inference time
One measure of computational efficiency is inference time. We compare the inference time of various BVQA
methods on a desktop with an Intel Core i7-7700 CPU@3.60GHz, 16 GB DDR4 RAM 2400 MHz. The
benchmarking methods include NIQE, BRISQUE, TLVQM, Mirko et al., V-BLIINDS, VSFA, and QSAVQM. We run their original codes with the default settings using the CPU only. As shown in Table 4.10, we
conduct experiments on three test videos of various lengths and resolutions: a 240-frame video of resolution
of 960 × 540, a 346-frame video of resolution of 640 × 480, and a 467-frame video of resolution 1280 × 720.
We repeat the test for each method ten times and report the average inference time (in seconds) in Table
4.10.
GreenBVQA has a significantly shorter inference time than other methods across all resolutions. The
efficiency gap widens as the video resolution increases. It is approximately 2.1x faster than Mirko et al.
on average, which is the second most efficient method. Furthermore, GreenBVQA provides comparable
performance with Mirko et al. in prediction accuracy as shown in Table 4.7, while demanding a smaller
model size. GreenBVQA can process videos in real-time, achieving an approximate speed of 75 frames per
second, solely relying on a CPU.
It is worthwhile to mention that, as an emerging trend, edge computing devices will contain heterogeneous
computing units such as CPUs, GPUs, and APUs (AI processing units). Several DL-based methods support
GPU acceleration, benefiting from mature coding libraries and environments. Since GreenBVQA can be
easily performed with parallel processing, we expect GreenBVQA to benefit from these accelerators as well.
4.3.3.3 Computational complexity
The number of floating point operations (FLOPs) provides another way to assess the complexity of a BVQA
model. We estimate the FLOPs number of several BVQA methods and compare them with that of GreenBVQA. The FLOPs required by one 240frs@540p test video in the KoNViD-1k dataset are shown in the last
column of Table 4.9. QSA-VQM, CNN-TLVQM, and VSFA demand remarkably higher FLOPs numbers,
ranging from 1250 to 2500 times of GreenBVQA. Mirko et al., which is an efficient The BVQA method,
67
Figure 4.6: Comparison of the floating point operation (FLOP) numbers.
specifically designed to reduce inference time and computational complexity, still requires about 100 times
of GreenBVQA.
To be consistent with the inference time analysis, three test videos of different lengths and resolutions are
selected for the FLOPs comparison in Fig. 4.6. TLVQM and NIQE have the lowest FLOPs, which are in the
order of 108
to 109
. The FLOPs of GreenBVQA are in the order of 1010. However, this discrepancy is not
reflected by the inference time comparison in Table 4.10 since GreenBVQA can be parallelized more easily.
GreenBVQA allows SIMD instructions, which are commonly used in multi-thread CPUs. Furthermore,
GreenBVQA attains higher prediction accuracy than TLVQM and NIQE. All DL-based BVQA methods
have significantly higher FLOPs.
Table 4.11: Ablation Study for GreenBVQA.
CVD2014 LIVE-VQC KoNViD-1k
Model SRCC PLCC SRCC PLCC SRCC PLCC
S features 0.809 0.844 0.728 0.758 0.720 0.724
S+T features 0.835 0.854 0.762 0.771 0.742 0.749
S+T+ST features - - 0.776 0.781 0.765 0.766
S+T+ST+SC features - - 0.785 0.789 0.776 0.779
68
4.3.4 Abalation Study
We conduct an ablation study on the choice of selected features in GreenBVQA. The results are reported
in Table 4.11. The examined features include spatial features (S-features), temporal features (T-features),
spatio-temporal features (ST-features), and spatio-color features (SC-features). Our study begins with the
assessment of the effectiveness of spatial features (the first row), followed by adding temporal features (the
second row). We see that both SROCC and PLCC improved in all three datasets. The addition of ST-features
(the third row) can improve SROCC and PLCC for all datasets as well. Finally, we use all four feature types
and observe further improvement in SROCC and PLCC (the last row). Note that the performance of ST
and SC features is not reported for the CVD2014 dataset since their improvement is little. A combination
of S and T features has already achieved high performance for this dataset.
Capturing
module
Compression
module
Video
Streaming
GreenBVQA
module
Adaptive
Bitrate module
Pre-processing
module
Computation
module
Analysis
module
Camera Node Edge Node Server Node
HAS Service
Figure 4.7: Illustration of an edge computing system employing GreenBVQA.
4.4 An Edge Computing System with BVQA
In this section, we introduce a video-based edge computing system to illustrate the role of GreenBVQA
in facilitating various video processing tasks at the edge. Existing bitrate adaptive video augmentation
methods [90] primarily consider the tradeoff between video bitrate and bandwidth consumption to improve
the quality of experience. Yet, most of them ignore the perceptual quality of streaming videos. Perceptual
video quality is more relevant to the human visual experience. A video with higher bitrates does not
guarantee perceptual friendliness due to the presence of perceptual distortions such as blurriness, noise,
blockiness, etc. Although the FR-VQA technique can account for the perceptual quality of streaming video,
69
the resulting methods rely on reference videos, which are not available on edge or mobile devices. As a
blind video quality assessment method, GreenBVQA can operate without any reference. Its small model
size and low computational complexity make it well-suited for deployment on edge devices. Furthermore,
its energy efficiency and cost-effectiveness, evidenced by short inference time, support its applicability in an
edge computing system.
GreenBVQA can be used as a perceptual quality monitor on edge devices. An edge computing system
that employs GreenBVQA is shown in Fig. 4.7, where GreenBVQA is used to enhance users’ experience
in watching videos. As shown in the figure, the system involves predicting the perceptual quality of video
streams with no reference. Other modules in the system can utilize the predicted quality score.
1. The predicted score can be used as feedback to the phone camera in video capturing. In certain extreme
situations, such as dark or blurred video capturing conditions, a low predicted video quality score can
serve as an alert so that the user can change the camera setting to get improved video quality.
2. In the context of video streaming over the network, it can assist the adaptive bitrate module in adjusting
the bitrate of subsequent video streams, specifically through the bitrate adaptation algorithm for HTTP
Adaptive Streaming (HAS). GreenBVQA demonstrates strong alignment with the human visual system
while satisfying both latency and blind assessment requirements, thereby enhancing the overall resource
utilization of HAS services.
3. Several video pre-processing modules (e.g., video enhancement [107] and video denoising [22]) are
commonly implemented on edge devices to alleviate the computational burden of the server. By
leveraging the predicted video quality scores, unnecessary pre-processing operations can be saved. For
instance, when a sequence of video frames is predicted to have a good visual quality, there is no need
to denoise or deblur the frame sequence. GreenBVQA can also be used to evaluate the performance
of video pre-processing tasks.
70
4.5 Conclusion
As the demand for high-quality videos, both in terms of capture and consumption at the edge, continues
its rapid expansion, a critical requirement arises for an efficient and effective model capable of predicting
perceptual video quality. Addressing this imperative, the present work introduces GreenBVQA, a lightweight
blind video quality assessment methodology. Rigorous evaluation of GreenBVQA’s predictive performance,
as gauged by its SROCC and PLCC, is conducted across three widely recognized video quality assessment
datasets. In the pursuit of a comprehensive validation, GreenBVQA’s performance is juxtaposed against
both conventional and state-of-the-art DL-based BVQA methodologies. Notably, GreenBVQA emerges as
a frontrunner, surpassing conventional BVQA techniques while concurrently achieving performance that
stands in proximity to the state-of-the-art DL-based counterparts. Significantly, the modest model size and
notably low computational complexity inherent in GreenBVQA render it a particularly fitting candidate for
seamless integration within edge-based video systems. Moreover, GreenBVQA’s rapid inference capabilities
empower real-time prediction of perceptual video quality scores solely utilizing the central processing unit
(CPU), further affirming its practical utility.
71
Chapter 5
GreenSaliency: A Lightweight and Efficient Image Saliency
Detection Method
5.1 Introduction
The attention mechanism within the human visual system indicates the regions humans are interested in
within the observed scenes [34]. Such an attention mechanism is often studied through the analysis of
human eye movements recorded via gaze-tracking technology during the presence of visual stimuli, such
as images. These fixations, collected from eye-tracker data, indicate the most compelling locations within
a scene. Typically, a saliency map is derived from fixation maps through convolution with a Gaussian
kernel, enhancing the representation of salient locations. This saliency map, constructed at the pixel level,
represents regions of attention within the stimuli. Image saliency detection is used to detect the most
informative and conspicuous fixations within visual scenes to emulate the attention mechanisms exhibited
by human eyes. Fixation maps, obtained through human subject studies, and their corresponding saliency
maps generated from these fixation maps are commonly regarded as ground truths (GTs), utilized in the
training and evaluating of image saliency detection models. Image saliency detection research predominantly
falls within two categories: 1) human eye fixation prediction, which involves the prediction of human gaze
locations on images where attention is most concentrated [8], and 2) salient object detection (SOD) [121],
which aims to identify salient object regions within an image. This chapter focuses on the former, which
predicts the human gaze from visual stimuli.
72
Comprehending human gaze patterns within visual stimuli is essential in modeling visual attention. Image
saliency detection, serving as either a preparatory step or a guiding principle in image processing, facilitates
the identification of regions likely to command initial human attention. This insight finds applications across
diverse domains. In saliency-driven perceptual image compression methodologies [81], saliency cues inform
the allocation of coding resources, directing more bits towards salient regions while economizing on less crucial
areas. Recognizing the non-uniform distribution of visual attention across image regions, local image saliency
detection has been integrated into no-reference image quality assessment techniques [112] to capture spatial
attention discrepancies, thereby enhancing quality prediction accuracy. Furthermore, investigations into
saliency within visual stimuli yield valuable insights into the cognitive mechanisms governing human visual
processing and attentional allocation, thus informing strategies for data augmentation [102] and enhancing
model interpretability through saliency-guided training procedures [33]. Overall, image saliency detection
plays a fundamental role in various visual processing applications and human perception applications. It is
an essential area of research and development in computer vision and related fields.
During earlier epochs, conventional methodologies for image saliency detection primarily followed a
bottom-up approach [26,34], where the features are not required to be pre-computed, thus offering versatility
across a wide range of applications. Recent advancements have witnessed the increase of DL-based saliency
detection methods [15,79], which have demonstrated remarkable efficacy of feature representation capabilities
of Convolutional Neural Networks (CNNs), driven by large-scale image saliency datasets [5,37,39]. Given the
limited volume of data within the saliency domain compared to some of the more prominent computer vision
tasks, transfer learning emerges as a pivotal mechanism for enhancing image saliency detection. Drawing
inspiration from the massive success of deep convolutional models such as VGGNet [93] and ResNet [27] in
classification tasks, particularly on benchmarks like ImageNet [17], researchers have leveraged transfer learning to introduce pre-trained features from large-scale and external datasets into the image saliency detection
methods. As a result, these methods achieve superior performance due to the expressive image features.
Integrating these large pre-trained models into mobile or edge devices is economically burdensome due to
their substantial computational demands and expansive model sizes. Given that image saliency detection
73
methods typically serve as preliminary stages in image processing pipelines, allocating such considerable
computational resources may be deemed unwarranted.
To address these challenges, we propose a lightweight image saliency detection method called GreenSaliency in this work. GreenSaliency features a smaller model size and lower computational complexity
while achieving comparable performance against DL-based methods. Compared to other green-learningbased methods presented before [50], the novelty of GreenSaliency lies in two completely new modules: 1)
multi-layer hybrid feature extraction and 2) multi-path saliency prediction. Without backpropagation as
done in neural networks, the proposed method employs an one-pass pipeline to reduce the model size (i.e.,
the number of model parameters). Compared with conventional image saliency detection methods, the multilayer hybrid feature extraction module relies on something other than human prior knowledge to extract
features. It combines unsupervised feature generation (via hierarchical Saab transforms [51]) with supervised
feature selection (using the Relevant Feature Test, RFT [113]) to efficiently extract task-relevant features.
The supervised feature selection module is a data-driven approach, which is achieved automatically without
human intervention. Additionally, the data-driven multi-path saliency prediction module leverages features
from diverse layers that encapsulate information from disparate receptive fields. This diversity enhances the
model’s ability to accommodate the unique characteristics of different input images. Consequently, the onepass pipeline of GreenSaliency establishes a balance between conventional and DL-based methods, offering
an efficient compromise between prediction accuracy and model complexity. We conduct experiments on two
popular image saliency datasets and show that GreenSaliency can offer satisfactory performance in five evaluation metrics while demanding significantly small model sizes, short inference time, and low computational
complexity. This work has the following two main contributions.
• Introduction of a novel method for saliency detection termed GreenSaliency, which is characterized by
a transparent and modularized design. This method features a feedforward training pipeline distinct
from the utilization of DNNs. The pipeline encompasses multi-layer hybrid feature extraction and
multi-path saliency prediction.
• Execution of experiments on two distinct datasets to assess the predictive capabilities of GreenSaliency.
The findings illustrate its superior performance compared to conventional methods and its competitive
74
standing against early-stage DL-based methods. Furthermore, GreenSaliency exhibits efficient prediction performance compared to state-of-the-art methods employing pre-trained networks on external
datasets while necessitating a significantly smaller model size and reduced inference complexity.
Predicted
Saliency Map
Multi-path
Saliency
Prediction
Multi-layer
Hybrid Feature
Extraction
Input Image
Figure 5.1: An overview of the proposed GreenSaliency method.
5.2 Methodology
An overview of the proposed GreenSaliency method is depicted in Fig. 5.1. As shown in the figure, GreenSaliency has a modularized solution that consists of two modules: 1) multi-layer hybrid feature extraction
and 2) multi-path saliency prediction. They are elaborated below.
5.2.1 Multi-layer Hybrid Feature Extraction
Figure 5.2 shows the multi-layer hybrid feature extraction pipeline. In this section, we commence by delineating the multi-layer structure. Each layer comprises analogous components, such as spatial feature extraction
and two Saab transforms, aimed at extracting hybrid features for each layer. Subsequently, the details of
the spatial feature extraction module and the saab transform are introduced.
5.2.1.1 Multi-layer Structure
A hierarchical multi-layer structure is employed to capture features spanning from local to global contexts
across different receptive fields. This structure comprises five layers denoted as d4, d8, d16, d32, and d64,
respectively. Within each layer, except d4 and d64, a spatial feature extraction module accompanied by
two Saab transform modules exists. The spatial feature extraction module facilitates the direct extraction
of spatial features from input images that have undergone downsampling. Notably, the prefix “d” within
75
Input image
Spatial
Spatial Saab 3x3 Saab 5x5
Spatial Saab 3x3 Saab 5x5
RFT
DOWN-2x
RFT
DOWN-2x
Spatial Saab 3x3 Saab 5x5
RFT
DOWN-2x
RFT
DOWN-2x
Saab 3x3 Saab 5x5
RFT
DOWN-2x
RFT
DOWN-2x
DOWN-4x
DOWN-2x
DOWN-2x
DOWN-2x
d8
feature
d16
feature
d32
feature
d64
feature
d4
d8
d16
d32
d64
DOWN-2x
DOWN-2x
DOWN-2x
DOWN-2x
Figure 5.2: Multi-layer hybrid feature extraction pipeline.
76
the naming of each layer signifies downsampling, wherein, for instance, d4 indicates that spatial features
at this layer are computed based on input images downsampled by a factor of 4. Specifically, we use the
Lanczos downsampling method. The two Saab transform modules, employing kernel sizes of 3 × 3 and
5×5, respectively, compute Saab coefficients derived from those of the previous layer. This recursive process
enables extracting high-level features as the layers progress deeper. Further elaboration on the spatial
feature extraction module and Saab transform mechanisms is provided in Section 5.2.1.2 and Section 5.2.1.3,
respectively.
The Relevant Feature Test (RFT) is employed to discern the most powerful coefficients or channels
meriting propagation to the subsequent layer after obtaining Saab coefficients from the two Saab transform
modules within the current layer. Section 5.2.1.4 provides a detailed exposition of the RFT methodology.
Additionally, the selected coefficients undergo downsampling at a ratio of 2 to mitigate the computational
burden associated with processing coefficients in the subsequent layer. Within each layer, the computed
spatial features and Saab coefficients derived from the two Saab transform modules are concatenated to form
hybrid features. Subsequently, these hybrid features are combined with hybrid features from other layers to
compose sets of features utilized for image saliency prediction. Illustrated in Figure 5.2, four sets of features
are discernible, denoted as d8 features, d16 features, d32 features, and d64 features, respectively. Each set
encompasses extracted features from two layers. For instance, d8 features comprise hybrid features derived
from layers d4 and d8. These four sets of features are transmitted to the multi-path saliency prediction
module, as elaborated in Section 5.2.2, to predict the corresponding saliency map or residual.
5.2.1.2 Spatial Feature Extraction
In contrast to features computed by the Saab transform, which are propagated from shallow layers to deeper
layers, the spatial features extracted in each layer are calculated directly from down-sampled input images
and are not transmitted to deeper layers. The spatial features encapsulate local and spatial information
inherent in the input images, while features computed by the Saab transform primarily capture spectral
information. The rationale for not propagating spatial features to deeper layers is threefold. First, the edge
and location features are low-level features. No further propagation is needed as their utility stays in the
77
initial layers. Second, the local Saab features complement high-level Saab features recursively computed
from two Saab transform modules, providing a more nuanced feature set at each layer. Third, propagating
spatial features to deeper layers would introduce additional computational complexity, contradicting the
design goal of maintaining a lightweight and low-complexity architecture.
Spatial features consist of three primary components: 1) local Saab features, 2) edge features, and
3) location features. The local Saab features are derived by implementing two sequential cascade Saab
transforms. Distinguished from the two Saab transforms within the same layer, these cascade Saab transforms
are characterized by smaller kernel sizes in the spatial domain, specifically 2×2 and 3×3. This distinction in
kernel sizes is attributed to their primary focus on capturing and emphasizing local information within the
directly downsampled input images. Utilizing smaller kernel sizes enables these cascade Saab transforms to
effectively highlight and extract intricate details locally, contributing to the overall feature representation and
analysis. Concerning edge features, they are derived utilizing the Canny edge detector [7], a widely employed
method for detecting edges in images. Numerous prior investigations have established that objects near an
image’s center garner greater attention from observers [39]. This observation suggests that locations proximal
to the image center are more likely to exhibit saliency than those farther away. A Gaussian distribution
can effectively model this empirical observation. Let (cx, cy) represent the center coordinates of an image;
subsequently, the location feature f(x, y) at coordinates (x, y) can be expressed as a Gaussian map defined
by
f(x, y) = exp
−
(cx − x)
2 + (cy − y)
2
σ
2
, (5.1)
where σ is a pre-defined parameter.
5.2.1.3 Subspace Approximation with Adjusted Bias (Saab) Transform
The initial step in the processing pipeline involves partitioning the input images into overlapping blocks,
typically of sizes 3 × 3 or 5 × 5, followed by applying the Saab transform [51]. The Saab transform, a
principal component analysis (PCA) method with mean-removal, distinguishes itself by incorporating an
additional bias vector. Within the framework of the Saab transform, a constant-element kernel is utilized
to compute the average value of image patches, commonly known as the DC (Direct Current) component.
78
(a) (b) (c)
Figure 5.3: RFT results of 3 × 3 Saab coefficients from three layers: (a) d8, (b) d16, and (c) d32.
Subsequently, PCA is applied to these patches post-removal of the computed mean, yielding data-driven
AC (Alternating Current) kernels. Applying these AC kernels on individual patches leads to extracting AC
coefficients associated with the Saab transform. For instance, an input cuboid has dimensions (H ×W)×C,
where H, W, and C denote the height, width, and number of channels, respectively. To execute a 3 × 3
Saab transform, the input cuboid is initially subdivided into overlapping cuboids of size (3 × 3) × C, with a
stride of 1 and padding applied as necessary. Then, these cuboids are flattened into 1-dimensional vectors,
each possessing a length of 9C. The mean values of these vectors are computed to yield the DC channel,
while the mean-removed vectors undergo PCA transform, resulting in the generation of 9C −1 AC channels.
Following the computation of the 3 × 3 Saab transform, the input cuboid, originally of size (H × W) × C,
transforms to dimensions (H × W) × 9C, preserving the spatial resolution and featuring 9C Saab coefficient
channels.
In the feature extraction pipeline context, a multi-layer pipeline comprising four successive Saab transforms, located in d8, d16, d32, and d64, is adopted to decorate the DC coefficients and generate higher-level
representations. This multi-layer pipeline facilitates the transformation of input data into a more informative
and discriminative feature space, thereby enhancing the effectiveness of subsequent processing steps.
5.2.1.4 Relevent Feature Test (RFT)
The objective of feature selection is to identify highly discriminative features from a diverse candidate set
of extracted features. To accomplish this task, a robust technique known as the Relevant Feature Test
(RFT) [113] is employed. RFT entails dividing a feature dimension into two left and right segments and
evaluating the total mean-squared error (MSE) between them. The resultant approximation error is the RFT
79
loss function, with a smaller RFT loss indicating a more informative feature dimension. Given a dataset
comprising N data samples and P features, each feature dimension, denoted by fi where 1 ≤ i ≤ P, each
feature dimension possesses a minimum and maximum range of f
i
min and f
i
max, respectively. The deployment
of RFT involves three distinct steps, delineated as follows.
• Training Sample Partitioning. The primary objective entails identifying the optimal threshold, f
i
op,
within the range [f
i
min, f
i
max], facilitating the partitioning of training samples into two subsets: S
i
L
and S
i
R. If the value of the ith feature, x
i
n
, for the nth training sample xn is lower than f
i
op, then xn is
assigned to S
i
L
; otherwise, xn is allocated to S
i
R. To refine the search space for f
i
op, the entire feature
range, [f
i
min, f
i
max], is divided into B uniform segments, and the optimal threshold is sought among
B − 1 candidates.
• RFT Loss Measured by Estimated Regression MSE. Denoting the regression target value as y, y
i
L
, and
y
i
R represent the mean target values in S
i
L
and S
i
R, respectively. These mean values are the estimated
regression values for all samples in S
i
L
and S
i
R. The RFT loss is defined as the summation of the
estimated regression MSEs of S
i
L
and S
i
R, given by
R
i
t =
Ni
L,tRi
L,t + Ni
R,tRi
R,t
N
, (5.2)
where Ni
L,t, Ni
R,t, Ri
L,t, and Ri
R,t represent the sample numbers and estimated regression MSEs in
subsets S
i
L
and S
i
R, respectively. Each feature f
i
is characterized by its optimized estimated regression
MSE over a set, T, of candidate partition points:
R
i
op = min
t∈T
R
i
t
. (5.3)
• Feature Selection based on the Optimized Loss. The optimized estimated regression MSE value, Ri
op,
is computed for each feature dimension, fi
. These values are subsequently arranged in ascending order,
reflecting the relevance of each feature dimension. A lower Ri
op value denotes a higher relevance of the
i-th-dimensional feature, f
i
.
80
After computing the Ri
op value for each feature dimension, f
i
, representation indices i are arranged
based on their MSE values in ascending order. Within the multi-layer hybrid feature extraction pipeline
context, RFT is employed between two consecutive layers to diminish the feature dimensionality from the
shallow layer onwards. The RFT outcomes of the 3 × 3 Saab coefficients computed in layers d8, d16,
and d32 are shown in Figure 5.3. In these figures, discernible elbow points are observed in the curves
of the RFT outcomes, especially for the results from d16 and d32. The dot lines represent the number of
selected features. Consequently, we select the top-ranking features, representing the most influential ones, for
propagation to deeper layers. The primary rationale behind incorporating RFT between successive layers lies
in the recognition that not all channels of coefficients necessitate processing at deeper layers, primarily due
to the substantial computational complexity entailed. Hence, RFT identifies the most pertinent coefficients
or channels that warrant deeper layer-level processing.
d16
feature
d64
feature
Ground
Truth
UP-4x
Saliency
Prediction
(d64)
Saliency
residual
(d16)
Saliency
Prediction
(d64)
Saliency
Prediction
(d16)
d8
feature
d32
feature UP-4x
Saliency
Prediction
(d32)
Saliency
residual
(d8)
Saliency
Prediction
(d32)
Saliency
Prediction
(d8)
DOWN-8x
DOWN-16x
Ensemble
UP-2x
UP-4x
UP-4x
UP-2x
Post-processing
UP-4x
Predicted
Saliency
Map
d16
feature
d64
feature UP-4x
Saliency
Prediction
(d64)
Saliency
residual
(d16)
Saliency
Prediction
(d64)
Saliency
Prediction
(d16)
d8
feature
d32
feature UP-4x
Saliency
Prediction
(d32)
Saliency
residual
(d8)
Saliency
Prediction
(d32)
Saliency
Prediction
(d8)
Ensemble
UP-2x
UP-4x
UP-4x
UP-2x
Post-processing
UP-4x
Predicted
Saliency
Map
Training Testing
RFT
RFT
RFT
RFT
RFT
RFT
RFT
RFT
RFT
RFT
RFT
RFT
Figure 5.4: Multi-path saliency prediction pipeline.
81
Figure 5.5: Prediction results from different layers.
5.2.2 Multi-path Saliency Prediction
Following the extraction of features from the multi-layer hybrid feature extraction module, these features
are utilized for multi-path saliency prediction, as delineated in Figure 5.4. Four distinct paths are initiated,
corresponding to four layers: d8, d16, d32, and d64, each tasked with predicting its corresponding saliency
map. Within each path, two conditions are considered: saliency map prediction and saliency residual
prediction, whose details are expounded upon in Section 5.2.2.1. Upon obtaining predicted saliency maps
from the four paths, they are aggregated within the ensemble module and subjected to post-processing to
yield the final predicted saliency map. Detailed discussions on the ensembles and post-processing modules
are provided in Section 5.2.2.2 and Section 5.2.2.3, respectively.
The rationale for adopting multi-path saliency prediction instead of a singular path employing all features
stems from the recognition that features extracted from diverse layers encapsulate information from disparate
receptive fields. Such diversity in receptive fields proves advantageous, catering to the varying characteristics
of different input images. In contrast, a single-path approach that amalgamates features from all layers
into one predictive model tends to produce a singular predicted saliency map. It constrains the model’s
adaptability and versatility, limiting its ability to tailor its response to specific input images. The experiments
that compare the performance of multi-path versus single-path predictions are given in Section 5.3.4.
82
(a) (b)
Figure 5.6: Distribution of RFT results in: (a) d16, and (b) d8 layers.
In the context of the multi-layer Saab transform, the features utilized for predicting the saliency map in
layers d64 and d32 are relatively high-level features, while those in layers d16 and d8 are low-level features.
High-level features, characterized by a large receptive field, concentrate on capturing the overall structure
or shape of the entire saliency map rather than focusing on small objects or details. Conversely, low-level
features exhibit greater sensitivity towards smaller objects while potentially overlooking the broader context
of the scene. Figure 5.5 demonstrates the efficacy of utilizing distinct feature sets, where predicted saliency
maps derived solely from specific feature sets are showcased. In the first two rows, predictions based on d8
and d16 features outperform those based on d32 and d64 features, attributed to the high-level features in d32
and d64 inadequately capturing the relatively small human subjects in these images. However, in the third
row, high-level features excel in delineating the shape of the cat in the center. In contrast, predictions based
on d8 and d16 features primarily focus on the top-right corner, illustrating the nuanced interplay between
feature levels and their corresponding receptive fields. Section 5.3.4 illustrates more experimental results of
different layers.
5.2.2.1 Saliency Map Prediction and Saliency Residual Prediction
In each path, leveraging the set of hybrid features extracted from the multi-layer hybrid feature extraction
module, an initial step involves utilizing these features to predict their respective saliency maps. For instance,
in the path utilizing d64 features, the labels correspond to ground truth saliency maps down-sampled by a
83
factor of 64. Subsequently, an XGBoost regressor is trained to perform pixel-wise prediction, mapping the
feature domain to the saliency map domain. Following the direct utilization of extracted hybrid features
for saliency map prediction, the predicted saliency maps from d8 and d16 are transmitted to the ensemble
module. In contrast, those from d32 and d64 undergo saliency residual prediction. The saliency residual
computation determines the disparity between the up-sampled predicted saliency map in the current layer
and the ground truth within a shallower layer. For instance, given the predicted saliency map in d64, the
residual is calculated between the up-sampled d64 saliency map and the ground truth in d16. Subsequently,
d16 features train and predict this saliency residual using another regressor. The resultant predicted saliency
residual in d16 is then added to the up-sampled predicted saliency map in d64 to form the final predicted
saliency map originating from d64.
In summary, d64 and d32 features exclusively contribute to predicting their corresponding saliency maps,
while d16 and d8 features predict both saliency maps and saliency residuals. The rationale underlying the
saliency residual prediction lies in the observation that d64 and d32 features primarily capture high-level
features, resulting in the comparatively lower resolution of their predicted saliency maps. Consequently, d8
and d32 features are leveraged to refine and calibrate the predicted saliency maps derived from d16 and
d64, respectively. This strategy ensures enhanced precision and fidelity across multiple layers of predicted
saliency maps.
Within the pipeline of multi-path saliency prediction, RFTs are employed to identify the most influential
features for saliency map prediction and residual prediction. It is noteworthy that, for a given set of features,
distinct RFTs are utilized to select disparate subsets of features for saliency map prediction and saliency
residual prediction. This distinction arises due to the differing labels associated with these two predictions.
The label for the former prediction corresponds to the downsampled ground truth. In contrast, the label
for the latter prediction represents the residual between the predicted saliency map and the ground truth.
The statistical distributions of the selected features employed for predicting saliency maps and saliency
residuals on layers d16 and d8 are depicted in Figure 5.6. Despite some overlapping observed between the
two distributions, several distinct features are selected for the two types of predictions. Given the label
84
disparity between these two prediction tasks, selecting different features validates the necessity of employing
separate RFTs for the same set of features.
5.2.2.2 Ensembles
An ensemble module is deployed to integrate the four predicted saliency maps spanning from d64 to d8.
Within this module, the first step involves upsampling the four predicted saliency maps to match the resolution size of d4 and concatenating them. Subsequently, for each pixel location, the neighboring saliency
values from the four saliency maps within a 5 × 5 block centered at this location are collated, yielding a
100-dimensional vector. Ultimately, an XGBoost regressor is trained using labels derived from downsampled
ground truth by a factor of 4, facilitating the prediction of the saliency map in the d4 layer.
5.2.2.3 Post-processing
Upon obtaining the fused predicted saliency maps in d4 and upsampling them to align with the resolution of
the input images, several post-processing operations are implemented to furnish the final predicted saliency
maps. These operations entail three sequential steps. Initially, a portion of small saliency values is filtered
out, as their presence after ensembling signifies a lack of confidence in prediction. Eliminating these small
predicted saliency values aids in reducing false positive predictions. Subsequently, a small Gaussian filter,
with dimensions of 10×10, is applied to the entire image to enhance its smoothness. Finally, normalization is
conducted on the entire image to align its distribution with the ground truth, thereby refining the accuracy
and reliability of the predicted saliency maps.
5.3 Experiments
5.3.1 Experimental Setup
5.3.1.1 Datasets
Experiments were conducted utilizing the SALICON [37] and MIT300 [38] datasets, with benchmarks in
the domain of image saliency detection. The SALICON dataset, widely recognized within the research
85
community, comprises a substantial collection of images, including 10,000 images in the training set, 5,000
in the validation set, and an additional 5,000 in the testing set. On the other hand, the MIT300 dataset
comprises 300 testing images, while the ground truth for this dataset is not publicly available. To mitigate
this limitation, the model was trained on the MIT1003 [39] dataset and subsequently evaluated on the
MIT300 dataset, a practice commonly adopted by various benchmarks. The MIT1003 dataset, consisting of
1,003 images, features ground truth annotations obtained through eye-tracking devices from 15 observers.
Similarly, the MIT300 dataset was constructed following comparable procedures, drawing from the same
image repositories and annotation methodologies.
5.3.1.2 Evaluation Metrics
To evaluate the performance of our saliency model comprehensively, we employ widely recognized metrics,
including the linear correlation coefficient (CC), area under the receiver operating characteristic curve (AUCJ), shuffled AUC (s-AUC), normalized scanpath saliency (NSS), and similarity (SIM). These metrics offer
multifaceted insights into model efficacy in predicting eye fixation patterns. Notably, higher values of CC,
AUC, sAUC, NSS, and SIM signify superior performance of the saliency model. For a more detailed explanation of these metrics and their relevance to saliency prediction, a comprehensive study is conducted
in [46].
5.3.1.3 Implementation Details
The initial step in our experimental setup involves resizing the input images to dimensions of (480 ×640)×3,
utilizing the YUV color format. Subsequently, we implement the multi-layer hybrid feature extraction
process. Specifically, after computing the Saab coefficients from Saab 3 × 3 in layers d8, d16, and d32,
the RFTs are employed to select 20, 50, and 100 coefficients, respectively, from each layer, which are then
forwarded to the subsequent layer. A similar approach is applied to the Saab 5 ×5 coefficients. In the multipath saliency prediction phase, RFTs select various features from different layers: 500 from d8 features,
500 from d16 features, 1,000 from d32 features, and 1,000 from d64 features, respectively. Regarding the
training and testing partitioning, we adhere to the official splitting protocol for the SALICON dataset. For
86
Table 5.1: Performance comparison in five metrics between our GreenSaliency method and eleven benchmarking methods on the MIT300 dataset.
Methods AUC-J↑ s-AUC↑ CC↑ SIM↑ NSS↑
ITTI [34] 0.543 0.535 0.131 0.338 0.408
GBVS [26] 0.806 0.630 0.479 0.484 1.246
Judd Model [39] 0.810 0.600 0.470 0.420 1.18
Shallow Convnet [79] 0.800 0.640 0.530 0.460 1.430
GazeGAN [8] 0.860 0.731 0.758 0.649 2.211
EML-NET [36] 0.876 0.746 0.789 0.675 2.487
Deep Convnet [79] 0.830 0.690 0.580 0.520 1.510
SAM-ResNet [15] 0.852 0.739 0.689 0.612 2.062
SalFBNet [19] 0.876 0.785 0.814 0.693 2.470
UNISAL [20] 0.877 0.784 0.785 0.674 2.369
DeepGaze IIE [62] 0.882 0.794 0.824 0.699 2.526
GreenSaliency 0.843 0.700 0.752 0.647 1.713
the MIT1003 dataset, we allocate 900 images for training and 103 for validation. Subsequently, testing is
conducted on the 300 images comprising the MIT300 dataset. All experiments are executed on a server
equipped with an Intel(R) Xeon(R) E5-2620 CPU, ensuring consistency and reliability across computations.
5.3.2 Experimental Results
5.3.2.1 Benchmarking Methods
We conducted a comprehensive performance evaluation of GreenSaliency compared to eleven benchmarking
methods, as summarized in Table 5.1 and Table 5.2. These benchmarking methods encompass conventional
and DL-based image saliency detection methods, which we categorize into two groups for clarity.
• ITTI [34], GBVS [26], and Judd Model [39]. They are conventional image saliency detection methods
that do not rely on neural networks.
• Shallow Convnet [79], GazeGAN [8], EML-NET [36], Deep Convnet [79], SAM-ResNet [15], SalFBNet [19], UNISAL [20], and DeepGaze IIE [62]. This category encompasses diverse DL-based image
saliency detection methods, with transfer learning from outside datasets.
5.3.2.2 Performance Evaluation
We compare the performance of GreenSaliency with eleven benchmarking methods in Table 5.1 and ten
benchmarking methods in Table 5.2. GreenBaliency outperforms all conventional image saliency detection
87
Table 5.2: Performance comparison in five metrics between our GreenSaliency method and ten benchmarking
methods on the SALICON dataset.
Methods AUC-J↑ s-AUC↑ CC↑ SIM↑ NSS↑
ITTI [34] 0.667 0.610 0.205 0.378 -
GBVS [26] 0.790 0.630 0.421 0.446 -
Shallow Convnet [79] 0.836 0.670 0.596 0.520 1.458
GazeGAN [8] 0.864 0.736 0.879 0.773 1.899
EML-NET [36] 0.866 0.746 0.868 0.774 2.058
Deep Convnet [79] 0.858 0.724 0.622 0.609 1.859
SAM-ResNet [15] 0.865 0.741 0.899 0.793 1.990
SalFBNet [19] 0.868 0.740 0.892 0.772 1.952
UNISAL [20] 0.864 0.739 0.879 0.775 1.952
DeepGaze IIE [62] 0.869 0.767 0.872 0.733 1.996
GreenSaliency 0.839 0.679 0.765 0.683 1.605
methods (i.e., ITTI, GBVS, and Judd Model) and some earlier DL-based methods (i.e., Shallow Convnet and
Deep Convnet) by a substantial margin in both two datasets. This shows the effectiveness of GreenSaliency
in extracting multi-layer hybrid features to cover information from disparate receptive fields. GreenSaliency
is also competitive with some DL-based methods (i.e., GazeGAN and SAM-ResNet). Compared with the
state-of-the-art DL-based methods (i.e., EML-NET, SalFBNet, UNISAL, and DeepGAZE IIE), there is a gap
to reach their performance. However, our model complexity is much lower than those, which is illustrated
in Section 5.3.3.
Additionally, we conducted a qualitative analysis by comparing the predicted saliency maps generated
by GreenSaliency with those produced by four benchmark methods. Figure 5.7 showcases exemplary images
that illustrate the instances where GreenSaliency outperforms other methods. Typically, the ground truth
saliency maps for images containing multiple objects without a clear dominant focal point exhibit a dispersed
and smooth distribution. GreenSaliency excels in these scenarios by effectively attending to all objects within
an image, demonstrating its capability to accurately predict saliency without the need for transfer learning
from diverse external datasets such as ImageNet. GreenSaliency’s methodological independence from transfer
learning allows it to maintain uniform attention across an entire scene, leading to high performance in
complex images with multiple points of interest. However, in scenarios where images feature prominently
88
Figure 5.7: Successful cases in GreenSaliency.
Table 5.3: Comparison of no. of model parameters, model sizes (memory usage), no. of GigaFlops, and
latency time of several saliency detection methods tested on the SALICON dataset, where “X” denotes the
multiplier compared to our proposed method.
Model #Params (M)↓ Model Size (MB)↓ GFLOPs↓ Runtime (s)↓
Deep Convnet [79] 25.5 (37.5X) 99 (34.7X) 3.2 (20.0X) 0.412 (10.8X)
GazeGAN [8] 208.9 (307.2X) 879.2 (308.5X) 25 (156.2X) 2.540 (66.8X)
EML-NET [36] 43 (63.2X) 180.2 (63.2X) 9.8 (61.3X) 0.365 (9.6X)
Shallow Convnet [79] 620 (911.7X) 2500 (877.2X) 3.9 (24.4X) 0.672 (17.7X)
SAM-ResNet [15] 128.8 (189.4X) 535 (187.7X) 18.8 (117.5X) 0.858 (22.6X)
SalFBNet [19] 5.9 (8.7X) 23.4 (8.2X) 2.30 (14.4X) 0.180 (4.7X)
UNISAL [20] 3.8 (5.6X) 14.7 (5.2X) 1.98 (12.4X) 0.083 (2.2X)
DeepGaze IIE [62] 98.2 (144.4X) 401 (140.7X) 21.2 (132.5X) 6.436 (169.4X)
GreenSaliency 0.68 (1X) 2.85 (1X) 0.16 (1X) 0.038 (1X)
captivating objects, such as humans and animals, highlighted in Figure 5.8, benchmark methods typically
outperform GreenSaliency. These methods, having been fine-tuned on extensive and diverse datasets, are
better equipped to recognize and prioritize these highly salient features, especially human faces. In such
contexts, our GreenSaliency, which does not utilize specialized transfer learning for distinct object categories,
might not perform optimally, revealing a potential area for further enhancement in future iterations of the
model.
89
Figure 5.8: Failed cases in GreenSaliency.
5.3.3 Model Complexity
The significance of a lightweight model in saliency detection cannot be overstated, especially when it functions
as a preliminary processing component in various computer vision applications. Moreover, the complexity
of the model plays a pivotal role in determining its suitability for deployment on mobile and edge devices.
Our analysis assesses the model complexity of saliency detection methods across three key dimensions:
model sizes, inference time, and computational complexity measured in terms of floating-point operations
(FLOPs). These metrics are presented comprehensively in Table 5.3, offering insights into different saliency
detection approaches’ efficiency and practical feasibility. It is important to clarify that our analysis primarily
focuses on comparing model complexity with DL-based methods due to two main reasons. First, although
conventional methods are efficient, the prediction accuracy is substantially lower compared to early-stage
DL-based methods and our proposed GreenSaliency method. Second, since conventional methods typically
employ unsupervised filters and are often implemented in Matlab rather than Python, directly comparing
its complexity with other benchmarks is challenging and potentially unfair.
90
5.3.3.1 Model Sizes
The size of a learning model can be assessed through two primary metrics: 1) the total number of model
parameters and 2) the actual memory usage. Model parameters can be represented in either floating-point
or integer format, typically occupying 4 bytes and 2 bytes of memory, respectively. Given that most model
parameters are in floating point, the actual memory usage can be estimated as approximately four times
the number of model parameters (as depicted in Table 5.3). For clarity, we utilize the term “model size” to
denote memory usage throughout the subsequent discussion. The model sizes of GreenSaliency and eight DLbased benchmark methods are detailed in the second column of Table 5.3. Notably, GreenSaliency exhibits
a significantly lower model than the two DL-based methods featuring lightweight models (i.e., SalFBNet and
UNISAL). In contrast to other DL-based methods characterized by notably large model sizes (often exceeding
100MB), GreenSaliency’s model size ranges from 34 times to 877 times smaller than these counterparts.
5.3.3.2 Inference Time
An essential metric for assessing computational efficiency in image saliency detection is the inference time
required to generate a saliency map. Our comparative analysis evaluated the inference time of various
DL-based methods on a server equipped with an Intel(R) Xeon(R) E5-2620 CPU. The inference time for
predicting a single saliency map is documented in the fourth column of Table 5.3. Notably, GreenSaliency
demonstrates a significantly reduced inference time compared to other DL-based methods. Specifically,
GreenSaliency achieves an inference time of 0.038 seconds per saliency map prediction, translating to an
approximate processing speed of 26 frames per second, utilizing solely CPU resources. It is imperative
to acknowledge that as a non-DL-based method, GreenSaliency may not leverage computing acceleration
resources as extensively as DL-based counterparts. Nevertheless, with foreseeable advancements in thirdparty libraries and coding optimizations, GreenSaliency can realize even more excellent efficiency benefits in
CPU or GPU-supported environments.
91
5.3.3.3 Computational Complexity
The assessment of computational complexity in saliency detection methods can be further elucidated by considering the number of floating-point operations (FLOPs) required. To this end, we estimated the FLOPs
for several DL-based methods essential for predicting a saliency map, compared with those of GreenSaliency.
The “GFLOPs” column in Table 5.3 presents the number of GFLOPs necessary to execute a model once to
generate a saliency map. In line with our inference time analysis, GreenSaliency exhibits notably lower computational complexity than other DL-based methods. Expressly, GreenSaliency necessitates 0.16 GFLOPs
to predict a single image, whereas other DL-based methodologies require over 2 GFLOPs, resulting in a
reduction ranging from 12 to 156 times lower in computational complexity.
Table 5.4: Ablation study for GreenSaliency on SALICON dataset.
Layer AUC-J s-AUC CC SIM NSS
d64 0.822 0.635 0.710 0.616 1.428
d64 + RP 0.827 0.651 0.734 0.655 1.505
d32 0.829 0.639 0.721 0.611 1.425
d32 + RP 0.834 0.658 0.748 0.661 1.516
d16 0.831 0.660 0.726 0.628 1.425
d8 0.830 0.662 0.725 0.635 1.428
Singular-path prediction 0.836 0.670 0.754 0.670 1.541
Multi-path ensemble 0.839 0.679 0.765 0.683 1.605
5.3.4 Ablation Study
To evaluate the individual contributions of various components to the overall performance of GreenSaliency,
we conducted an ablation study as outlined in Table 5.4. This study assessed the impact of different feature
sets, namely d64, d32, d16, and d8, along with saliency residual prediction (RP) as depicted in Figure 5.4.
Specifically, we independently investigated the efficacy of each set of hybrid features for layers d64, d32, d16,
and d8. Our findings revealed that employing a single set of features resulted in the highest performance
for the d16 layer, whereas the d64 layer exhibited the lowest performance. Upon incorporating residual
prediction, notable enhancements in performance were observed for the d64 and d32 layers, underscoring
92
the importance of residual prediction in improving overall performance. Moreover, through ensembling
the performance metrics across all four layers, we observed further improvements across all five evaluation
metrics, culminating in achieving the highest performance values. Additionally, we present the performance
of single-path predictions, which involve the concatenation of features from all four layers to generate a single
saliency map, as described in Section 5.2.2. It is observed that while the single-path prediction surpasses the
performance metrics of individual layers, it does not match the efficacy of the multi-path approach. This
finding substantiates the previous discussion regarding the advantages of employing multi-path for saliency
prediction, emphasizing the enhanced flexibility and accuracy offered by this method.
5.4 Conclusion
This chapter introduces a novel lightweight image saliency detection approach named GreenSaliency, which
operates without using DNNs or pre-training on external datasets. GreenSaliency surpasses all conventional
(non-DL-based) image saliency detection methods and achieves comparable performance with some earlystage DL-based methods. Compared to state-of-the-art DL-based methods, GreenSaliency exhibits lower
prediction accuracy but offers advantages in smaller model sizes, shorter inference time, and reduced computational complexity. The minimal model complexity of GreenSaliency suggests a more efficient energy usage,
making it suitable for integration into extensive image processing systems.
93
Chapter 6
GSBIQA: Green Saliency-guided Blind Image Quality Assessment
Method
6.1 Introduction
Objective image quality assessment (IQA) is pivotal in various multimedia applications. It can be categorized into three distinct types: Full-Reference IQA (FR-IQA), Reduced-Reference IQA (RR-IQA), and
No-Reference IQA (NR-IQA). FR-IQA directly compares a distorted image against a reference or original
image to assess quality. RR-IQA, on the other hand, uses partial information from the reference images to
evaluate the quality of the target images. NR-IQA, also known as blind image quality assessment (BIQA),
becomes essential in scenarios where reference images are unavailable, such as at the receiver’s end or for
user-generated content on social media. The demand for BIQA has surged with the increasing popularity of
such platforms.
Research in BIQA has gained significant momentum over recent years, branching into two primary approaches: conventional methods and deep-learning-based (DL-based) methods. Conventional BIQA methods
typically follow a structured pipeline involving quality-aware feature extraction followed by regression to map
these features to quality scores. Over the past two decades, various conventional BIQA methods have been developed, including those based on Natural Scene Statistics (NSS) [72] and codebook-based approaches [115].
More recently, inspired by the success of DNNs in computer vision, researchers have developed DL-based
methods [6, 42] to solve the BIQA problem. However, the high cost of collecting large-scale annotated IQA
94
datasets and the tendency of DL-based methods to overfit on limited-sized datasets pose significant challenges. Some advanced DL-based methods have adopted large pre-trained models from external datasets,
such as ImageNet [17] to address this. Based on that, a further direction [14] is enhancing dataset variability
by sampling patches randomly from quality-annotated images and incorporating saliency-guided approaches
to improve accuracy, reflecting the intrinsic link between perceptual quality prediction and saliency prediction.
However, due to their large model sizes and high computational complexity, DL-based BIQA methods face
substantial challenges, particularly in their deployment on mobile or edge devices. Moreover, the conventional
saliency predictors often used in these methods do not significantly enhance perceptual quality predictions.
In response to these challenges, this work introduces a novel, efficient, and saliency-assisted BIQA method
named Green Saliency-guided Blind Image Quality Assessment (GSBIQA). It is particularly suited for realworld applications with authentic distortions and offers a modular design with a straightforward training
pipeline.
We should point out that green image saliency detection (GreenSaliency) and green blind image quality
assessment (GreenBIQA) were initially presented in [71] and [70], respectively. This work offers a significant expansion of these initial efforts by combining and extending these two lightweight models into a
comprehensive and advanced BIQA method.
D4 Features
D8 Features
D16 Features
Regressor
BIQA Features
BIQA Features
BIQA Features
Global
Prediction
s1
s2
sn
Saliency-guided Data Cropping
Green Saliency Detection
Final
Score
Green BIQA
Feature Extraction
Local
Prediction
Saliency-guided
Global Prediction
Input
Image
Predicted Saliency Map
Statistic Information
Figure 6.1: An overview of the proposed GSBIQA method.
95
6.2 Methodology
An overview of the proposed GSBIQA method is illustrated in Figure 6.1. The GSBIQA approach features a
modularized architecture comprising five distinct modules: (1) green saliency detection, (2) saliency-guided
data cropping, (3) green BIQA feature extraction, (4) local patch prediction, and (5) saliency-guided global
prediction. They are elaborated below.
6.2.1 Green Saliency Detection
Drawing inspiration from GreenSaliency [70], a simplified version is implemented within the proposed green
saliency detection model. Contrary to the original method, which extracts features from five distinct layers
or scales, this model selects only the three lowest layers: d4, d8, and d16. Overlapped patches are collected
from each of these layers, and two consecutive Saab transforms of sizes 2×2 and 4×4 are employed to extract
features, resulting in d4, d8, and d16 features derived from the corresponding layers. It is important to note
that the Saab transforms [51] are applied independently across different layers, unlike the interconnected
Saab transform modules found in GreenSaliency.
In the saliency map prediction phase, features from all three layers are concatenated. The most significant
subset of these features is identified through a Relevant Feature Test (RFT) [113], and subsequently, an
XGBoost regressor is utilized to estimate the saliency values for each patch. An up-sampling operation is
then executed to generate the predicted saliency map. The predicted saliency map facilitates data cropping
and decision ensemble modules, as detailed in Sections 6.2.2 and 6.2.5, respectively. Given the absence of
ground truth saliency maps in most IQA datasets, the green image saliency prediction model was pre-trained
on the SALICON [37] dataset, a widely used resource for image saliency prediction studies.
96
6.2.2 Saliency-guided Data Cropping
In the saliency-guided data cropping process, input images are uniformly cropped into overlapped sub-images
of size 256 × 256, each retaining the original image’s quality score. Then, the predicted saliency map is used
to calculate an Average Saliency Score (ASS) for each sub-image, defined as:
ASSi =
1
NM
X
N
n=1
X
M
m=1
S(n, m), (6.1)
where S(n, m) represents the saliency value at pixel coordinates (n, m) in the i
th sub-image, and N × M
represents the size of sub-images. The ASS indicates how well sub-images attract human attention, with
higher scores indicating more significant impacts on perceptual judgments. Sub-images with the highest ASS
are selected for further analysis, ensuring focus on the most salient parts of the image to optimize subsequent
quality assessments.
6.2.3 Green BIQA Feature Extraction
In this part, three sets of features are extracted and detailed.
256x256x1
32x32x1
8x8 block based DCT
4x4 2D Saab Transform
8x8x1
1x1x(N+1)
feature
16x16x1
DC AC1 AC2 AC63
DC AC1 AC15 AC1 AC2 AC47
Pooling Pooling Pooling
Pooling Pooling
Hop1
Sub-image
AC1
5 2x2x1 DC
All std All PCA std PCA std PCA std
4x4 2D Saab Transform Hop2
AC1 AC1 4x4x1 5
Figure 6.2: Structure of spatial features extraction.
97
6.2.3.1 Spatial Features
The pipeline of spatial feature extraction is depicted in Figure 6.2. Initially, input sub-images are segmented
into non-overlapping 8 × 8 blocks, and Discrete Cosine Transform (DCT) coefficients are computed via a
block DCT transform. These coefficients are then ordered in a zigzag pattern, producing one DC coefficient
and sixty-three AC coefficients labeled AC1 to AC63, then divided into 64 channels. For further processing
and to decorrelate DC coefficients while deriving higher-level representations, a dual-stage Saab transform
approach is utilized, consisting of two successive transformations, referred to as Hop1 and Hop2:
• Hop1 Processing: The 32 × 32 DC coefficients are subdivided into non-overlapping 4 × 4 blocks. These
blocks undergo a Saab transform, resulting in one DC channel and 15 AC channels for each block.
Each block’s 8 × 8 DC coefficients are then advanced to Hop2.
• Hop2 Processing: A second Saab transform is applied to non-overlapping 4 × 4 blocks from DC coefficients from Hop1, yielding one DC and 15 AC channels.
After maximum pooling of Saab coefficients from Hop1 and DCT coefficients from the top layer, we take
further steps to reduce the number of representations. This process includes:
• Computing the standard deviation as shown by the green downward arrow in the diagram.
• Conduct the PCA transform on spatially adjacent regions of these coefficients as indicated by the
orange downward arrow in the diagram.
6.2.3.2 Spatio-color Features
The extraction process for spatial-color features begins by converting sub-images from the YUV color to the
RGB color space. These converted sub-images form spatial-color cuboids with dimensions H ×W ×C, where
H and W represent the height and width of the sub-image, respectively, and C = 3 denotes the number of
color channels. These cuboids are then input into a two-hop hierarchical structure, similar to that in the
spatial features extraction part.
98
• In the first hop, the input cuboids are segmented into non-overlapping smaller cuboids of size 4 ×4×3.
A 3D Saab transform is applied to each cuboid independently, resulting in one DC channel and 47 AC
channels labeled AC1 through AC47. Each channel maintains a spatial resolution of 64 × 64.
• Given the spatial correlation of the DC coefficients, a 2D Saab transform is utilized in Hop2. Here,
the DC channel, sized 64 × 64, is further subdivided into 16 × 16 non-overlapping blocks, each 4 × 4 in
dimension.
Similar to those in the spatial features extraction part, all channels of coefficients in each layer are first
downsampled by max pooling. Then, PCA coefficients and standard deviation values are computed.
6.2.3.3 Saliency Features
Building on the insights that multi-level features can boost the accuracy of IQA methods, our approach
integrates saliency features from multiple layers into the green BIQA feature extraction process. These
features, initially extracted in the green image saliency detection module and shown as d4 and d8 features
in Figure 6.1, are resized for compatibility with BIQA dimensions and seamlessly concatenated with BIQA
features. This integration enhances the feature set by leveraging the spatial coherence and contextual
relevance of the saliency data, offering a more comprehensive understanding of image quality and potentially
enhancing prediction precision without additional computational overhead.
6.2.4 Local Patch Score Prediction
The feature set derived from the green BIQA feature extraction module is characterized by its large dimensionality. To address this, we adopt the Relevant Feature Test (RFT), as proposed in [113], to identify
and select the most discriminant features. RFT is a supervised feature selection technique that evaluates
the significance of each feature based on its ability to improve the predictive model’s performance. It uses
statistical methods to ascertain the relevance of features, ensuring that only the most powerful features are
retained for model training.
Due to the lack of Mean Opinion Scores (MOS) for individual sub-images and only a global MOS for the
entire image, directly assigning the global MOS to all local sub-images is ineffective. To overcome this, we
99
employ an indirect method that allows each sub-image to predict more adaptive local scores while maintaining
the average global image score close to the ground truth. The local quality score of i
th sub-image in the
j
th image and the global quality score of j
th image are denoted as qij and Qj , respectively. We initiate the
training process by uniformly distributing the global quality score across all local sub-images. An XGBoost
regressor is trained to predict qij for each sub-image. Qj for the entire image is computed by the average
of all qij from the same image. The discrepancy between Qj and its corresponding label is used to update
the gradient, which is then propagated back to each local patch. This iterative process, characterized by
successive relaxation and adjustment, converges to a stable set of local quality scores qij . This approach
effectively captures the diverse quality variations across sub-images, providing a more nuanced image quality
assessment.
6.2.5 Saliency-guided Global Quality Score Prediction
Upon calculating local quality scores for each sub-image, we integrate these scores to form a global quality
assessment using an additional XGBoost regressor, which utilizes a detailed set of input features. The feature
collection and preparation involve:
• Collection of all local quality scores from the same image to capture quality variance.
• Computation of statistical metrics such as average, standard deviation, and maximum value of saliency
local scores from the same image.
• Integration of downsampled d16 features from the green saliency detection module refined by the RFT.
The prepared features enable the XGBoost regressor to provide an accurate global image quality prediction,
offering a thorough evaluation that reflects both local detail and global perceptual impact.
100
6.3 Experiments
6.3.1 Experimental Setup
6.3.1.1 Datasets
The green image saliency detection module was pre-trained using the SALICON [37] dataset. For the
evaluation of the GSBIQA method, we utilized the LIVE-C [23] and KonIQ-10K [29] datasets. These
datasets are particularly relevant for authentic IQA as they encompass various real-world images captured
by users featuring assorted distortions.
6.3.1.2 Evaluation Metrics
The effectiveness of the proposed method is quantified using two widely recognized metrics: the Pearson
Linear Correlation Coefficient (PLCC) and the Spearman Rank Order Correlation Coefficient (SROCC).
PLCC is computed as follows:
PLCC =
P
i
(pi − pm)( ˆpi − pˆm)
pP
i
(pi − pm)
2
pP
i
( ˆpi − pˆm)
2
, (6.2)
where pi represents the predicted scores, ˆpi denotes the subjective scores, and pm and ˆpm are the mean
values of the predicted and subjective scores, respectively. SROCC is calculated as:
SROCC = 1 −
6
PL
i=1(mi − ni)
2
L(L2 − 1) , (6.3)
where mi and ni are the ranks of the predicted and actual subjective scores, respectively, and L represents
the total number of samples, which corresponds to the number of images evaluated.
6.3.1.3 Implementation Details
In the training stage, we crop and select 35 sub-images of size 256 × 256 for each image in the two authentic
datasets. The RFT selects 3,000 dimensions of features to do local patch score prediction. To perform
global score prediction, the input features include 35 predicted scores from local patch score prediction, 165
101
Table 6.1: Performance comparison in PLCC and SROCC metrics between our GSBIQA method and nine
benchmarking methods on two IQA databases. The best performance numbers are shown in boldface, and
”X” denotes the multiple numbers.
BIQA Method LIVE-C KonIQ-10K Model Size (MB)↓ GFLOPs↓ Inference Time (ms)↓
SROCC↑ PLCC↑ SROCC↑ PLCC↑
BRISQUE [72] 0.608 0.629 0.665 0.681 - - -
CORNIA [115] 0.632 0.661 0.780 0.795 7.4 (3.1X) - -
BIECON [42] 0.595 0.613 0.618 0.651 35.2 (14.6X) 2.82 (2.2X) 26 (1.4X)
WaDIQaM [6] 0.671 0.680 0.797 0.805 25.2 (10.5X) 4.38 (3.5X) 26 (1.4X)
PQR [120] 0.857 0.882 0.880 0.884 235.9 (98.3X) 157 (125.6X) 100 (5.3X)
DBCNN [130] 0.851 0.869 0.875 0.884 54.6 (22.7X) 242 (193.6X) 55 (2.9X)
SGDNet [112] 0.851 0.872 0.897 0.917 323.6 (134.8X) 260 (208X) 141 (7.4X)
HyperIQA [97] 0.859 0.882 0.906 0.917 104.7 (43.6X) 145 (116X) 54 (2.8X)
TReS [24] 0.846 0.877 0.915 0.928 582 (242.5X) 290 (232X) 260 (13.7X)
GSBIQA(ours) 0.830 0.839 0.875 0.883 2.4 (1X) 1.25 (1X) 19 (1X)
dimensions of features from the statistics of the saliency map, and 600 dimensions of features from saliency
features selected by RFT. We adopt the standard evaluation procedure by splitting each dataset into 80% for
training and 20% for testing. Furthermore, 10% of training data is used for validation. We ran experiments
ten times and reported median PLCC and SROCC values.
6.3.2 Experimental Results
6.3.2.1 Benchmarking Methods
BRISQUE [72] and CORNIA [115] are conventional BIQA methods. BIECON [42] and WaDIQaM [6] are
DL-based BIQA methods without pre-trained models on external datasets. PQR [120], DBCNN [130],
SGDNet [112], HyperIQA [97], and TReS [24] are DL-based BIQA methods with pre-trained models on
external datasets, adding advanced technologies.
6.3.2.2 Comparison among Benchmarking Methods
The evaluation of two datasets featuring authentic distortions, LIVE-C and KonIQ-10K, reveals that our
GSBIQA method surpasses both the conventional BIQA and simple DL-based methods, as detailed in Table
6.1. This superior performance underscores the efficacy of the quality-aware features extracted by GSBIQA,
highlighting its potential for robust and accurate image quality assessment for real-world image distortions.
While GSBIQA exhibits promising performance, a performance gap remains compared to state-of-the-art
102
DL-based methods. However, it is essential to consider the significantly lower model complexity of GSBIQA
as a beneficial trade-off. The details of model complexity are discussed in Section 6.3.3.
6.3.2.3 Qualitative Analysis on Exemplary Images
We conducted a qualitative analysis to evaluate the performance of the GSBIQA method by comparing
the predicted MOS against the ground truth. Figure 6.3 presents exemplary images demonstrating successful outcomes. The ground truth MOS and GSBIQA-predicted MOS values are denoted as MOS(G) and
MOS(P), respectively. These images are characterized by salient objects, effectively detected by the green
image saliency detector. The saliency guidance provided by this component enables GSBIQA to accurately
capture these features, significantly enhancing the model’s performance. These images confirm that GSBIQA is proficient in scenarios where salient objects dominate the visual field. Conversely, GSBIQA exhibits
limitations in images that lack distinct salient objects. Examples of such images include those predominantly
featuring full backgrounds or textures, as shown in Figure 6.4. In these cases, the absence of transparent,
distinguishable objects leads to challenges in saliency detection, resulting in poorer performance.
(a) (b) (c)
Figure 6.3: Successful cases in GSBIQA.
(a) MOS(G)=3.80, MOS(P)=3.80, (b) MOS(G)=3.76, MOS(P)=3.76, (c) MOS(G)=3.70, MOS(P)=3.70.
(a) (b) (c)
Figure 6.4: Failed cases in GSBIQA.
(a) MOS(G)=2.61, MOS(P)=2.41, (b) MOS(G)=3.23, MOS(P)=3.00, (c) MOS(G)=2.30, MOS(P)=2.07.
103
6.3.3 Model Complexity
We examine the complexity of BIQA methods from three perspectives: model size, computational complexity
as measured by floating-point operations (FLOPs), and inference time, as shown in Table 6.1.
6.3.3.1 Model Size
Compared to conventional methods such as CORINA, GSBIQA demonstrates superior performance and
maintains a smaller model footprint. Additionally, GSBIQA outperforms two early-stage DL-based methods,
BIECON and WaDIQaM, while achieving a significantly reduced model size. In comparison with advanced
DL-based methods, namely PQR, DBCNN, SGDNet, HyperIQA, and TReS, GSBIQA offers competitive
performance on the LIVE-C and KonIQ-10K datasets, yet with a considerably smaller model size. Notably,
advanced DL-based methods typically rely on extensive pre-trained networks, often exceeding 100MB, which
underscores the efficiency of GSBIQA in maintaining lower resource usage.
6.3.3.2 Computational Complexity
The “GFLOPs” column in Table 6.1 details the GFLOPs required to process a single image. Among earlystage DL-based methods such as WaDIQaM and BIECON, although the networks are smaller and require
fewer FLOPs, their performance still falls short of GSBIQA. Conversely, more advanced DL-based methods
like TReSN and HyperIQA, while surpassing GSBIQA in accuracy, also possess considerably larger model
sizes and higher computational complexities. Specifically, the FLOPs for TReS and HyperIQA are 232
and 116 times greater than those for GSBIQA, respectively. It is crucial to highlight that non-DL-based
methods like GSBIQA might benefit less from GPU acceleration due to a lack of optimized hardwaresoftware integration compared to DL-based methods. However, the notably low computational complexity
of GSBIQA underscores its potential for effective operation in GPU-supported environments, pending further
improvements in third-party libraries and coding optimizations.
104
6.3.3.3 Inference Time
Inference time is another critical factor, particularly for applications on mobile or edge devices. Table 6.1
presents the inference times, measured in milliseconds per image, across various methods on two datasets.
Each method was evaluated in the same environment using a single CPU. GSBIQA offers significant advantages over benchmarked methods by jointly considering performance and inference time. It is noteworthy
that GSBIQA can process approximately 52 images per second using a single CPU, thus meeting the realtime requirements for processing videos at 30 frames per second. Furthermore, one can reduce the inference
time for GSBIQA through code optimization and the integration of more advanced library support.
Table 6.2: Ablation study for GSBIQA on KonIQ-10K dataset.
Saliency Guidance Local prediction SROCC PLCC
Data Cropping Global Prediction
0.832 0.845
0.848 0.859
0.850 0.860
0.862 0.866
0.862 0.868
0.875 0.883
6.3.4 Ablation Study
We conduct an ablation study in Table 6.2 by focusing on three critical components of GSBIQA: data
cropping, global quality score prediction, and local patch prediction. First, the study assessed the impact
of each component on its own, as shown in the second to fourth rows. Both SROCC and PLCC metrics
on the KonIQ-10K dataset improved when each component was implemented independently, underscoring
the significant role each plays in enhancing model performance. Next, a combined implementation of data
cropping and global quality score prediction, guided by saliency detection, was analyzed. The results in the
fifth row demonstrated substantial improvements, validating the effectiveness of integrating saliency guidance
into these processes. Finally, we use all components in the sixth row and see that SROCC and PLCC can
be further improved to reach the highest value.
105
6.4 Conclusion
A novel and lightweight saliency-guided BIQA method, called GSBIQA, was proposed. GSBIQA outperforms conventional and early-stage DL-based methods on two authentic distortion datasets. It also offers
competitive performance with a significantly smaller model size than state-of-the-art methods. It processes
52 images per second using only CPU resources, making it well-suited for mobile and edge devices.
106
Chapter 7
Conclusion and Future Work
7.1 Summary of the Research
Besides the introduction and background, this dissertation consists of four main chapters.
First, this dissertation introduces a lightweight yet high-performance approach for blind image quality
assessment, denoted as GreenBIQA, in Chapter 3. The efficacy of GreenBIQA is evaluated in terms of
its PLCC and SROCC performances on both synthetic-distortion and authentic-distortion datasets. In
comparison to conventional BIQA methodologies and basic DL-based methods, GreenBIQA demonstrates
superior performance across all four datasets. Furthermore, when benchmarked against state-of-the-art
advanced DL-based methods equipped with pre-trained models, GreenBIQA maintains its dominance in
synthetic datasets while also delivering near-optimal performance in authentic datasets. Notably, GreenBIQA
distinguishes itself through its compact model size, rapid inference speed, and low computational complexity
in terms of Floating Point Operations (FLOPs). These attributes position GreenBIQA as a compelling
choice for BIQA applications within the domain of mobile and edge devices.
Second, this dissertation introduces GreenBVQA, a lightweight blind video quality assessment methodology, in Chapter 4. Rigorous evaluation of GreenBVQA’s predictive performance, as gauged by its SROCC
and PLCC, is conducted across three widely recognized video quality assessment datasets. In the pursuit of a comprehensive validation, GreenBVQA’s performance is juxtaposed against both conventional and
state-of-the-art DL-based BVQA methodologies. Notably, GreenBVQA emerges as a frontrunner, surpassing
conventional BVQA techniques while concurrently achieving performance that stands in proximity to the
107
state-of-the-art DL-based counterparts. Significantly, the modest model size and notably low computational
complexity inherent in GreenBVQA render it a particularly fitting candidate for seamless integration within
edge-based video systems. Moreover, GreenBVQA’s rapid inference capabilities empower real-time prediction of perceptual video quality scores solely utilizing the central processing unit (CPU), further affirming
its practical utility.
Third, this dissertation introduces a novel lightweight image saliency detection approach named GreenSaliency, which operates without using DNNs or pre-training on external datasets, as illustrated in Chapter
5. GreenSaliency surpasses all conventional (non-DL-based) image saliency detection methods and achieves
comparable performance with some early-stage DL-based methods. Compared to state-of-the-art DL-based
methods, GreenSaliency exhibits lower prediction accuracy but offers advantages in smaller model sizes,
shorter inference time, and reduced computational complexity. The minimal model complexity of GreenSaliency suggests a more efficient energy usage, making it suitable for integration into extensive image
processing systems.
Fourth, a novel and lightweight saliency-guided BIQA method, called GSBIQA, was proposed in Chapter 6. GSBIQA outperforms conventional and early-stage DL-based methods on two authentic distortion
datasets. It also offers competitive performance with a significantly smaller model size than state-of-the-art
methods. It processes 52 images per second using only CPU resources, making it well-suited for mobile and
edge devices.
7.2 Future Work on GreenBIQA
Future research on GreenBIQA could significantly benefit from focusing on its generalization to a broader
array of content types. One promising approach involves content-specific tuning, which entails customizing
GreenBIQA to effectively handle specialized applications such as medical imaging, gaming, or virtual reality
(VR). Tailoring the model to these distinct domains could enhance its accuracy and efficacy, making it more
suitable for the unique challenges presented by each application. For instance, in medical imaging, where
precision and sensitivity are paramount, a GreenBIQA model fine-tuned for detecting subtle variations in
image quality could be invaluable. Similarly, in gaming and VR, where user experience is heavily dependent
108
on visual quality, a customized GreenBIQA could provide real-time assessments that enhance immersion and
realism.
Moreover, integrating GreenBIQA into content creation tools, such as image editing software, represents
another promising avenue for future work. In this context, GreenBIQA could offer real-time feedback on
image quality, enabling content creators to optimize their work before sharing or publishing it. This real-time
assessment capability could be particularly valuable for professionals in fields such as photography, graphic
design, and digital media production, where high image quality is crucial. By providing instant feedback
on potential quality issues, GreenBIQA could help creators make informed adjustments, ensuring that their
content meets the highest standards before it reaches the audience.
Additionally, expanding GreenBIQA’s functionality to address aesthetic quality prediction could be a
valuable direction for future development. Recent advances in aesthetic prediction often leverage BIQA
scores as a foundational metric. Extending GreenBIQA’s capabilities to encompass the more subjective
and complex domain of aesthetic quality could serve as a core component of advanced aesthetic prediction
systems. This expansion would involve developing the model to not only assess technical image quality but
also to predict how visually pleasing or artistically valuable an image might be. This would require a nuanced
understanding of both objective quality metrics and subjective aesthetic criteria, potentially opening up new
applications for GreenBIQA in areas such as digital art, photography, and content curation.
7.3 Future Work on GreenBVQA
The trajectory of future research endeavors invites exploration across several promising avenues, each holding
the potential to significantly advance the capabilities of GreenBVQA. One particularly compelling direction
lies in the realm of high frame rate (HFR) video capture and transmission. As edge devices increasingly adopt
capabilities for capturing videos at higher frame rates, the integration of adaptive bitrate (ABR) streaming
with variable frame rates (VFR) emerges as a vital innovation. In this scenario, GreenBVQA is poised to play
a pivotal role in enabling effective management of VFR videos, thereby optimizing adaptive bitrate video
transmission. By facilitating seamless transitions and adjustments in response to varying network conditions
109
and content characteristics, GreenBVQA could contribute to the development of more sophisticated and
responsive adaptation mechanisms in video streaming.
Furthermore, as the ecosystem of user-generated content (UGC) continues to diversify, encompassing a
wide array of content types such as gaming, virtual reality (VR), and live streaming, the need to customize
GreenBVQA to these specific genres becomes increasingly evident. Each content type presents unique challenges and characteristics that can impact video quality in different ways. For instance, gaming videos often
feature rapid movements and high-contrast scenes, while VR content demands immersive, high-resolution
experiences. By tailoring GreenBVQA to the specific nuances of these content types, the model can be finetuned to provide more accurate and contextually relevant quality assessments. This specialization would
enhance GreenBVQA’s ability to deliver precise evaluations across a broad spectrum of visual contexts,
ensuring that it remains relevant and effective in a rapidly evolving digital landscape.
Additionally, the development of a saliency-guided BVQA method represents another promising direction
for future research. The success of GSBIQA, which incorporates saliency detection to improve image quality
assessment, suggests that a similar approach could be beneficial in the video domain. A saliency-guided
BVQA model would leverage the principles of visual attention, focusing on the most perceptually important
regions of a video when assessing its quality. This approach could lead to more efficient and accurate
video quality assessments, particularly in scenarios where certain areas of a video are more critical to the
viewer’s experience than others. By prioritizing these regions, a saliency-guided BVQA method could offer
a more nuanced and effective evaluation of video quality, further enhancing the applicability and utility of
GreenBVQA in real-world scenarios.
7.4 Future Work on GreenSaliency
Our proposed GreenSaliency method, while effective in many respects, exhibits several limitations that warrant further attention in future iterations. One significant challenge is its difficulty in accurately predicting
the saliency of particularly prominent objects of interest, such as human faces and animals. Instead, GreenSaliency tends to distribute emphasis across all objects within an image rather than prioritizing those that
are most likely to draw human attention, as illustrated in Figure 5.8. To address this, future developments
110
could benefit from incorporating a mechanism that ranks the saliency levels of different objects within an
image. By refining the model to prioritize objects that are generally more appealing or significant to viewers,
such as faces or animals, the overall prediction accuracy could be markedly improved. This would enable
GreenSaliency to provide more focused and relevant saliency predictions, aligning more closely with human
visual attention patterns.
Additionally, there is considerable potential to extend the scope of GreenSaliency from image-based
saliency detection to video-based saliency detection. This expansion presents both a challenge and an
opportunity, as it would require the model to incorporate temporal information in its analysis. Unlike static
images, videos involve dynamic changes and interactions between objects over time, making the task of
identifying salient regions more complex but also more impactful. By integrating temporal dynamics, future
iterations of GreenSaliency could enhance its ability to track and predict the saliency of moving objects,
offering a more comprehensive and nuanced understanding of visual attention in videos.
111
Bibliography
[1] Mirko Agarla, Luigi Celona, and Raimondo Schettini. No-reference quality assessment of in-capture
distorted videos. Journal of Imaging, 6(8):74, 2020.
[2] Mirko Agarla, Luigi Celona, and Raimondo Schettini. An efficient method for no-reference video
quality assessment. Journal of Imaging, 7(3):55, 2021.
[3] Mariette Awad and Rahul Khanna. Support vector regression. In Efficient learning machines, pages
67–80. Springer, 2015.
[4] Abdelhak Bentaleb, Bayan Taani, Ali C Begen, Christian Timmerer, and Roger Zimmermann. A
survey on bitrate adaptation schemes for streaming media over http. IEEE Communications Surveys
& Tutorials, 21(1):562–585, 2018.
[5] Ali Borji and Laurent Itti. Cat2000: A large scale fixation dataset for boosting saliency research.
arXiv preprint arXiv:1505.03581, 2015.
[6] Sebastian Bosse, Dominique Maniry, Klaus-Robert M¨uller, Thomas Wiegand, and Wojciech Samek.
Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions
on image processing, 27(1):206–219, 2017.
[7] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis
and machine intelligence, (6):679–698, 1986.
[8] Zhaohui Che, Ali Borji, Guangtao Zhai, Xiongkuo Min, Guodong Guo, and Patrick Le Callet. How is
gaze influenced by image transformations? dataset and model. IEEE Transactions on Image Processing, 29:2287–2300, 2019.
[9] Hong-Shuo Chen, Mozhdeh Rouhsedaghat, Hamza Ghani, Shuowen Hu, Suya You, and C-C Jay Kuo.
Defakehop: A light-weight high-performance deepfake detector. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2021.
[10] Jiasi Chen and Xukan Ran. Deep learning with edge computing: A review. Proceedings of the IEEE,
107(8):1655–1674, 2019.
[11] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the
22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794,
2016.
[12] Yueru Chen and C-C Jay Kuo. Pixelhop: A successive subspace learning (SSL) method for object
recognition. Journal of Visual Communication and Image Representation, page 102749, 2020.
[13] Yueru Chen, Mozhdeh Rouhsedaghat, Suya You, Raghuveer Rao, and C-C Jay Kuo. Pixelhop++:
A small successive-subspace-learning-based (SSL-based) model for image classification. In 2020 IEEE
International Conference on Image Processing (ICIP), pages 3294–3298. IEEE, 2020.
[14] Aladine Chetouani. Convolutional neural network and saliency selection for blind image quality assessment. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 2835–2839.
IEEE, 2018.
112
[15] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. Predicting human eye fixations
via an lstm-based saliency attentive model. IEEE Transactions on Image Processing, 27(10):5142–5154,
2018.
[16] Sathya Veera Reddy Dendi and Sumohana S Channappayya. No-reference video quality assessment
using natural spatiotemporal scene statistics. IEEE Transactions on Image Processing, 29:5612–5624,
2020.
[17] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale
hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition,
pages 248–255. Ieee, 2009.
[18] Guanqun Ding, Nevrez ˙Imamo˘glu, Ali Caglayan, Masahiro Murakawa, and Ryosuke Nakamura. Fbnet:
Feedback-recursive cnn for saliency detection. In 2021 17th International Conference on Machine
Vision and Applications (MVA), pages 1–5. IEEE, 2021.
[19] Guanqun Ding, Nevrez ˙Imamo˘glu, Ali Caglayan, Masahiro Murakawa, and Ryosuke Nakamura. Salfbnet: Learning pseudo-saliency distribution via feedback convolutional networks. Image and Vision
Computing, 120:104395, 2022.
[20] Richard Droste, Jianbo Jiao, and J Alison Noble. Unified image and video saliency modeling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part V 16, pages 419–435. Springer, 2020.
[21] Zhengfang Duanmu, Wentao Liu, Zhuoran Li, Diqi Chen, Zhou Wang, Yizhou Wang, and Wen Gao. Assessing the quality-of-experience of adaptive bitrate video streaming. arXiv preprint arXiv:2008.08804,
2020.
[22] Liming Ge, Wei Bao, Dong Yuan, and Bing B Zhou. Edge-assisted deep video denoising and superresolution for real-time surveillance at night. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking, pages 783–785, 2022.
[23] Deepti Ghadiyaram and Alan C Bovik. Massive online crowdsourced study of subjective and objective
picture quality. IEEE Transactions on Image Processing, 25(1):372–387, 2015.
[24] S Alireza Golestaneh, Saba Dadsetan, and Kris M Kitani. No-reference image quality assessment
via transformers, relative ranking, and self-consistency. In Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision, pages 1220–1230, 2022.
[25] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM,
63(11):139–144, 2020.
[26] Jonathan Harel, Christof Koch, and Pietro Perona. Graph-based visual saliency. Advances in neural
information processing systems, 19, 2006.
[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778,
2016.
[28] Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tam´as Szir´anyi, Shujun Li, and Dietmar Saupe. The konstanz natural video database (konvid-1k). In 2017 Ninth international conference
on quality of multimedia experience (QoMEX), pages 1–6. IEEE, 2017.
[29] Vlad Hosu, Hanhe Lin, Tamas Sziranyi, and Dietmar Saupe. Koniq-10k: An ecologically valid database
for deep learning of blind image quality assessment. IEEE Transactions on Image Processing, 29:4041–
4056, 2020.
[30] Weilong Hou and Xinbo Gao. Saliency-guided deep framework for image quality assessment. IEEE
MultiMedia, 22(2):46–55, 2014.
113
[31] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,
Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile
vision applications. arXiv preprint arXiv:1704.04861, 2017.
[32] Chien-Chun Hung, Ganesh Ananthanarayanan, Peter Bodik, Leana Golubchik, Minlan Yu, Paramvir
Bahl, and Matthai Philipose. Videoedge: Processing camera streams using hierarchical clusters. In
2018 IEEE/ACM Symposium on Edge Computing (SEC), pages 115–131. IEEE, 2018.
[33] Aya Abdelsalam Ismail, Hector Corrada Bravo, and Soheil Feizi. Improving deep learning interpretability by saliency guided training. Advances in Neural Information Processing Systems, 34:26726–26739,
2021.
[34] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid
scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259,
1998.
[35] Jiayu Ji, Ke Xiang, and Xuanyin Wang. Scvs: blind image quality assessment based on spatial
correlation and visual saliency. The Visual Computer, pages 1–16, 2023.
[36] Sen Jia and Neil DB Bruce. Eml-net: An expandable multi-layer network for saliency prediction.
Image and vision computing, 95:103887, 2020.
[37] Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. Salicon: Saliency in context. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1072–1080,
2015.
[38] Tilke Judd, Fr´edo Durand, and Antonio Torralba. A benchmark of computational models of saliency
to predict human fixations. 2012.
[39] Tilke Judd, Krista Ehinger, Fr´edo Durand, and Antonio Torralba. Learning to predict where humans
look. In 2009 IEEE 12th international conference on computer vision, pages 2106–2113. IEEE, 2009.
[40] Le Kang, Peng Ye, Yi Li, and David Doermann. Convolutional neural networks for no-reference image
quality assessment. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 1733–1740, 2014.
[41] Wolf Kienzle, Felix A Wichmann, Matthias Franz, and Bernhard Sch¨olkopf. A nonparametric approach
to bottom-up visual saliency. Advances in neural information processing systems, 19, 2006.
[42] Jongyoo Kim and Sanghoon Lee. Fully deep blind image quality predictor. IEEE Journal of selected
topics in signal processing, 11(1):206–220, 2016.
[43] Jari Korhonen. Two-level approach for no-reference consumer video quality assessment. IEEE Transactions on Image Processing, 28(12):5923–5938, 2019.
[44] Jari Korhonen, Yicheng Su, and Junyong You. Blind natural video quality prediction via statistical
temporal features and deep spatial features. In Proceedings of the 28th ACM International Conference
on Multimedia, pages 3311–3319, 2020.
[45] Matthias K¨ummerer, Lucas Theis, and Matthias Bethge. Deep gaze i: Boosting saliency prediction
with feature maps trained on imagenet. arXiv preprint arXiv:1411.1045, 2014.
[46] Matthias Kummerer, Thomas SA Wallis, and Matthias Bethge. Saliency benchmarking made easy:
Separating models, maps and metrics. In Proceedings of the European Conference on Computer Vision
(ECCV), pages 770–787, 2018.
[47] Matthias Kummerer, Thomas SA Wallis, Leon A Gatys, and Matthias Bethge. Understanding low-and
high-level contributions to fixation prediction. In Proceedings of the IEEE international conference on
computer vision, pages 4789–4798, 2017.
114
[48] C-C Jay Kuo. Understanding convolutional neural networks with a mathematical model. Journal of
Visual Communication and Image Representation, 41:406–413, 2016.
[49] C-C Jay Kuo and Azad M Madni. Green learning: Introduction, examples and outlook. Journal of
Visual Communication and Image Representation, page 103685, 2022.
[50] C-C Jay Kuo and Azad M Madni. Green learning: Introduction, examples and outlook. Journal of
Visual Communication and Image Representation, 90:103685, 2023.
[51] C-C Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, and Yueru Chen. Interpretable convolutional neural
networks via feedforward design. Journal of Visual Communication and Image Representation, 60:346–
359, 2019.
[52] Eric Cooper Larson and Damon Michael Chandler. Most apparent distortion: full-reference image
quality assessment and the role of strategy. Journal of electronic imaging, 19(1):011006, 2010.
[53] Xuejing Lei, Ganning Zhao, Kaitai Zhang, and C-C Jay Kuo. Tghop: an explainable, efficient, and
lightweight method for texture generation. APSIPA Transactions on Signal and Information Processing, 10:e17, 2021.
[54] Chunyi Li, May Lim, Abdelhak Bentaleb, and Roger Zimmermann. A real-time blind quality-ofexperience assessment metric for http adaptive streaming. In 2023 IEEE International Conference on
Multimedia and Expo (ICME), pages 1661–1666. IEEE, 2023.
[55] Dingquan Li, Tingting Jiang, and Ming Jiang. Quality assessment of in-the-wild videos. In Proceedings
of the 27th ACM International Conference on Multimedia, pages 2351–2359, 2019.
[56] Dingquan Li, Tingting Jiang, and Ming Jiang. Unified quality assessment of in-the-wild videos with
mixed datasets training. International Journal of Computer Vision, 129(4):1238–1257, 2021.
[57] Xuelong Li, Qun Guo, and Xiaoqiang Lu. Spatiotemporal statistics for video quality assessment. IEEE
Transactions on Image Processing, 25(7):3329–3342, 2016.
[58] Zhi Li, Anne Aaron, Ioannis Katsavounidis, Anush Moorthy, and Megha Manohara. Toward a practical
perceptual video quality metric. The Netflix Tech Blog, 6(2):2, 2016.
[59] Hanhe Lin, Vlad Hosu, and Dietmar Saupe. Kadid-10k: A large-scale artificially distorted iqa database.
In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3.
IEEE, 2019.
[60] Joe Yuchieh Lin, Tsung-Jung Liu, Eddy Chi-Hao Wu, and C-C Jay Kuo. A fusion-based video quality assessment (fvqa) index. In Signal and Information Processing Association Annual Summit and
Conference (APSIPA), 2014 Asia-Pacific, pages 1–5. IEEE, 2014.
[61] Kwan-Yee Lin and Guanxiang Wang. Hallucinated-iqa: No-reference image quality assessment via
adversarial learning. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 732–741, 2018.
[62] Akis Linardos, Matthias K¨ummerer, Ori Press, and Matthias Bethge. Deepgaze iie: Calibrated prediction in and out-of-domain for state-of-the-art saliency modeling. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 12919–12928, 2021.
[63] Tsung-Jung Liu, Weisi Lin, and C-C Jay Kuo. Image quality assessment using multi-method fusion.
IEEE Transactions on image processing, 22(5):1793–1807, 2012.
[64] Wentao Liu, Zhengfang Duanmu, and Zhou Wang. End-to-end blind quality assessment of compressed
videos using deep neural networks. In ACM Multimedia, pages 546–554, 2018.
[65] Yongxu Liu, Jinjian Wu, Leida Li, Weisheng Dong, Jinpeng Zhang, and Guangming Shi. Spatiotemporal representation learning for blind video quality assessment. IEEE Transactions on Circuits and
Systems for Video Technology, 32(6):3500–3513, 2021.
115
[66] Yutong Liu, Linghe Kong, Guihai Chen, Fangqin Xu, and Zhanquan Wang. Light-weight ai and iot
collaboration for surveillance video pre-processing. Journal of Systems Architecture, 114:101934, 2021.
[67] Kede Ma, Wentao Liu, Kai Zhang, Zhengfang Duanmu, Zhou Wang, and Wangmeng Zuo. End-to-end
blind image quality assessment using deep neural networks. IEEE Transactions on Image Processing,
27(3):1202–1213, 2017.
[68] Zhanxuan Mei, Yun-Cheng Wang, Xingze He, and C-C Jay Kuo. Greenbiqa: A lightweight blind
image quality assessment method. In 2022 IEEE 24th International Workshop on Multimedia Signal
Processing (MMSP), pages 1–6. IEEE, 2022.
[69] Zhanxuan Mei, Yun-Cheng Wang, Xingze He, Yong Yan, and C-C Jay Kuo. Lightweight highperformance blind image quality assessment. arXiv preprint arXiv:2303.13057, 2023.
[70] Zhanxuan Mei, Yun-Cheng Wang, and C-C Jay Kuo. Greensaliency: A lightweight and efficient image
saliency detection method. arXiv preprint arXiv:2404.00253, 2024.
[71] Zhanxuan Mei, Yun-Cheng Wang, C-C Jay Kuo, et al. Lightweight high-performance blind image
quality assessment. APSIPA Transactions on Signal and Information Processing, 13(1), 2024.
[72] Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment
in the spatial domain. IEEE Transactions on image processing, 21(12):4695–4708, 2012.
[73] Anish Mittal, Michele A Saad, and Alan C Bovik. A completely blind video integrity oracle. IEEE
Transactions on Image Processing, 25(1):289–300, 2015.
[74] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a completely blind image quality
analyzer. IEEE Signal processing letters, 20(3):209–212, 2012.
[75] Anush Krishna Moorthy and Alan Conrad Bovik. A two-step framework for constructing blind image
quality indices. IEEE Signal processing letters, 17(5):513–516, 2010.
[76] Anush Krishna Moorthy and Alan Conrad Bovik. Blind image quality assessment: From natural scene
statistics to perceptual quality. IEEE transactions on Image Processing, 20(12):3350–3364, 2011.
[77] Mikko Nuutinen, Toni Virtanen, Mikko Vaahteranoksa, Tero Vuori, Pirkko Oittinen, and Jukka H¨akkinen. Cvd2014—a database for evaluating no-reference video quality assessment algorithms. IEEE
Transactions on Image Processing, 25(7):3073–3086, 2016.
[78] Fu-Zhao Ou, Yuan-Gen Wang, and Guopu Zhu. A novel blind image quality assessment method
based on refined natural scene statistics. In 2019 IEEE International Conference on Image Processing
(ICIP), pages 1004–1008. IEEE, 2019.
[79] Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. Shallow
and deep convolutional networks for saliency prediction. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 598–606, 2016.
[80] Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Cristian Canton Ferrer, Jordi Torres, Kevin McGuinness, and Noel E OConnor. Salgan: Visual saliency prediction with adversarial networks. In CVPR
scene understanding workshop (SUNw), 2017.
[81] Yash Patel, Srikar Appalaraju, and R Manmatha. Saliency driven perceptual image compression. In
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 227–236,
2021.
[82] Xuan Qi and Chen Liu. Enabling deep learning on iot edge: Approaches and evaluation. In 2018
IEEE/ACM Symposium on Edge Computing (SEC), pages 367–372. IEEE, 2018.
[83] Abdul Rehman and Zhou Wang. Reduced-reference image quality assessment by structural similarity
estimation. IEEE transactions on image processing, 21(8):3378–3389, 2012.
116
[84] Mozhdeh Rouhsedaghat, Yifan Wang, Shuowen Hu, Suya You, and C-C Jay Kuo. Low-resolution face
recognition in resource-constrained environments. Pattern Recognition Letters, 149:193–199, 2021.
[85] Michele A Saad and Alan C Bovik. Blind quality assessment of videos using a model of natural scene
statistics and motion coherency. In 2012 Conference Record of the Forty Sixth Asilomar Conference
on Signals, Systems and Computers (ASILOMAR), pages 332–336. IEEE, 2012.
[86] Michele A Saad, Alan C Bovik, and Christophe Charrier. Blind image quality assessment: A natural
scene statistics approach in the dct domain. IEEE transactions on Image Processing, 21(8):3339–3352,
2012.
[87] Michele A Saad, Alan C Bovik, and Christophe Charrier. Blind prediction of natural video quality.
IEEE Transactions on Image Processing, 23(3):1352–1365, 2014.
[88] Pooyan Safari, Behnam Shariati, David Przewozny, Paul Chojecki, Johannes Karl Fischer, Ronald
Freund, Axel Vick, and Moritz Chemnitz. Edge cloud based visual inspection for automatic quality
assurance in production. In 2022 13th International Symposium on Communication Systems, Networks
and Digital Signal Processing (CSNDSP), pages 473–476. IEEE, 2022.
[89] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
[90] Yusuf Sani, Andreas Mauthe, and Christopher Edwards. Adaptive bitrate selection: A survey. IEEE
Communications Surveys & Tutorials, 19(4):2985–3014, 2017.
[91] Burr Settles. Active learning literature survey. 2009.
[92] Lili Shen, Chuhe Zhang, and Chunping Hou. Saliency-based feature fusion convolutional network for
blind image quality assessment. Signal, Image and Video Processing, 16(2):419–427, 2022.
[93] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556, 2014.
[94] Zeina Sinno and Alan Conrad Bovik. Large-scale study of perceptual video quality. IEEE Transactions
on Image Processing, 28(2):612–627, 2018.
[95] Jacob Søgaard, Søren Forchhammer, and Jari Korhonen. No-reference video quality assessment using
codec analysis. IEEE Transactions on Circuits and Systems for Video Technology, 25(10):1637–1650,
2015.
[96] Rajiv Soundararajan and Alan C Bovik. Video quality assessment by reduced reference spatio-temporal
entropic differencing. IEEE Transactions on Circuits and Systems for Video Technology, 23(4):684–
694, 2012.
[97] Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. Blindly assess image quality in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 3667–3676, 2020.
[98] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the
inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2818–2826, 2016.
[99] Hossein Talebi and Peyman Milanfar. Nima: Neural image assessment. IEEE transactions on image
processing, 27(8):3998–4011, 2018.
[100] Zhengzhong Tu, Yilin Wang, Neil Birkbeck, Balu Adsumilli, and Alan C Bovik. Ugc-vqa: Benchmarking blind video quality assessment for user generated content. IEEE Transactions on Image Processing,
30:4449–4464, 2021.
117
[101] Zhengzhong Tu, Xiangxu Yu, Yilin Wang, Neil Birkbeck, Balu Adsumilli, and Alan C Bovik. Rapique:
Rapid and accurate video quality prediction of user generated content. IEEE Open Journal of Signal
Processing, 2:425–440, 2021.
[102] AFM Uddin, Mst Monira, Wheemyung Shin, TaeChoong Chung, Sung-Ho Bae, et al. Saliencymix: A
saliency guided data augmentation strategy for better regularization. arXiv preprint arXiv:2006.01791,
2020.
[103] Giuseppe Valenzise, Stefano Magni, Marco Tagliasacchi, and Stefano Tubaro. No-reference pixel video
quality monitoring of channel-induced distortion. IEEE transactions on circuits and systems for video
technology, 22(4):605–618, 2011.
[104] Eleonora Vig, Michael Dorr, and David Cox. Large-scale optimization of hierarchical features for
saliency prediction in natural images. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2798–2805, 2014.
[105] Chunfeng Wang, Li Su, and Weigang Zhang. Come for no-reference video quality assessment. In 2018
IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pages 232–237. IEEE,
2018.
[106] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from
error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[107] Yirui Wu, Haifeng Guo, Chinmay Chakraborty, Mohammad Khosravi, Stefano Berretti, and Shaohua
Wan. Edge computing driven low-light image dynamic enhancement for object detection. IEEE
Transactions on Network Science and Engineering, 2022.
[108] Zhujun Xiao, Zhengxu Xia, Haitao Zheng, Ben Y Zhao, and Junchen Jiang. Towards performance
clarity of edge video analytics. In 2021 IEEE/ACM Symposium on Edge Computing (SEC), pages
148–164. IEEE, 2021.
[109] Jingtao Xu, Peng Ye, Qiaohong Li, Haiqing Du, Yong Liu, and David Doermann. Blind image quality assessment based on high order statistics aggregation. IEEE Transactions on Image Processing,
25(9):4444–4457, 2016.
[110] Jingtao Xu, Peng Ye, Yong Liu, and David Doermann. No-reference video quality assessment via
feature learning. In 2014 IEEE international conference on image processing (ICIP), pages 491–495.
IEEE, 2014.
[111] Wufeng Xue, Xuanqin Mou, Lei Zhang, Alan C Bovik, and Xiangchu Feng. Blind image quality
assessment using joint statistics of gradient magnitude and laplacian features. IEEE Transactions on
Image Processing, 23(11):4850–4862, 2014.
[112] Sheng Yang, Qiuping Jiang, Weisi Lin, and Yongtao Wang. Sgdnet: An end-to-end saliency-guided
deep neural network for no-reference image quality assessment. In Proceedings of the 27th ACM
international conference on multimedia, pages 1383–1391, 2019.
[113] Yijing Yang, Wei Wang, Hongyu Fu, C-C Jay Kuo, et al. On supervised feature selection from high
dimensional feature spaces. APSIPA Transactions on Signal and Information Processing, 11(1), 2022.
[114] Peng Ye and David Doermann. No-reference image quality assessment using visual codebooks. IEEE
Transactions on Image Processing, 21(7):3129–3138, 2012.
[115] Peng Ye, Jayant Kumar, Le Kang, and David Doermann. Unsupervised feature learning framework
for no-reference image quality assessment. In 2012 IEEE conference on computer vision and pattern
recognition, pages 1098–1105. IEEE, 2012.
[116] Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, and Alan Bovik. Patch-vq:’patching up’the
video quality problem. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 14019–14029, 2021.
118
[117] Jongwon Yoon and Suman Banerjee. Hardware-assisted, low-cost video transcoding solution in wireless
networks. IEEE Transactions on Mobile Computing, 19(3):581–597, 2019.
[118] Junyong You and Jari Korhonen. Deep neural networks for no-reference video quality assessment. In
2019 IEEE International Conference on Image Processing (ICIP), pages 2349–2353. IEEE, 2019.
[119] Hui Zeng, Lei Zhang, and Alan C Bovik. A probabilistic quality representation approach to deep blind
image quality prediction. arXiv preprint arXiv:1708.08190, 2017.
[120] Hui Zeng, Lei Zhang, and Alan C Bovik. Blind image quality assessment with a probabilistic quality
representation. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 609–
613. IEEE, 2018.
[121] Dingwen Zhang, Junwei Han, Yu Zhang, and Dong Xu. Synthesizing supervision for learning deep
saliency network without human annotation. IEEE transactions on pattern analysis and machine
intelligence, 42(7):1755–1769, 2019.
[122] Kaitai Zhang, Bin Wang, Wei Wang, Fahad Sohrab, Moncef Gabbouj, and C-C Jay Kuo. Anomalyhop: an ssl-based image anomaly localization method. In 2021 International Conference on Visual
Communications and Image Processing (VCIP), pages 1–5. IEEE, 2021.
[123] Lin Zhang, Zhongyi Gu, and Hongyu Li. Sdsp: A novel saliency detection method by combining simple
priors. In 2013 IEEE international conference on image processing, pages 171–175. IEEE, 2013.
[124] Lin Zhang, Zhongyi Gu, Xiaoxu Liu, Hongyu Li, and Jianwei Lu. Training quality-aware filters for
no-reference image quality assessment. IEEE MultiMedia, 21(4):67–75, 2014.
[125] Lin Zhang, Lei Zhang, and Alan C Bovik. A feature-enriched completely blind image quality evaluator.
IEEE Transactions on Image Processing, 24(8):2579–2591, 2015.
[126] Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image
quality assessment. IEEE transactions on Image Processing, 20(8):2378–2386, 2011.
[127] Lingyun Zhang, Matthew H Tong, Tim K Marks, Honghao Shan, and Garrison W Cottrell. Sun: A
bayesian framework for saliency using natural statistics. Journal of vision, 8(7):32–32, 2008.
[128] Min Zhang, Yifan Wang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Pointhop++: A lightweight
learning model on point sets for 3d classification. arXiv preprint arXiv:2002.03281, 2020.
[129] Min Zhang, Haoxuan You, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Pointhop: An explainable
machine learning method for point cloud classification. IEEE Transactions on Multimedia, 2020.
[130] Weixia Zhang, Kede Ma, Jia Yan, Dexiang Deng, and Zhou Wang. Blind image quality assessment
using a deep bilinear convolutional neural network. IEEE Transactions on Circuits and Systems for
Video Technology, 30(1):36–47, 2018.
119
Abstract (if available)
Abstract
The domain of blind visual quality assessment has garnered escalating attention in recent times, driven by the escalating influx of user-generated data captured by diverse end devices and users. These data are subsequently disseminated over the Internet without access to the reference, thereby accentuating the significance of perceptual quality evaluation. Such user-generated content, spanning images and videos, constitutes a substantial portion of the data transmitted over the Internet. Consequently, their integration within many image- and video-related processing domains, such as compression, transmission, and pre-processing, necessitates the aid of perceptual quality assessment. However, the absence of references restricts the application of reliable full-reference visual quality assessment methodologies. Therefore, no-reference visual quality assessment, commonly denoted as blind visual quality assessment, emerges as the quintessential approach to automatically and efficiently predict perceptual quality. In this dissertation, we mainly focus on developing lightweight and efficient methods for two research problems in the field of blind visual quality assessment: 1) blind image quality assessment (BIQA) and 2) blind video quality assessment (BVQA). We begin by presenting our proposed GreenBIQA method, a novel approach to BIQA characterized by its compact model size, low computational complexity, and high performance. Building on the foundation of GreenBIQA, we extend its application to BVQA through the development of GreenBVQA. To further enhance the performance of GreenBIQA, we introduce a lightweight and efficient image saliency detection method, termed GreenSaliency. Ultimately, we integrate GreenSaliency with GreenBIQA, culminating in the development of the Green Saliency-guided BIQA method (GSBIQA).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Experimental design and evaluation methodology for human-centric visual quality assessment
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
A learning‐based approach to image quality assessment
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Techniques for compressed visual data quality assessment and advanced video coding
PDF
Object classification based on neural-network-inspired image transforms
PDF
Green image generation and label transfer techniques
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Advanced techniques for stereoscopic image rectification and quality assessment
PDF
Green learning for 3D point cloud data processing
PDF
Depth inference and visual saliency detection from 2D images
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Green knowledge graph completion and scalable generative content delivery
PDF
Efficient coding techniques for high definition video
PDF
Visual knowledge transfer with deep learning techniques
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
A notation for rapid specification of information visualization
PDF
Efficient graph learning: theory and performance evaluation
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
Asset Metadata
Creator
Mei, Zhanxuan
(author)
Core Title
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2024-12
Publication Date
09/26/2024
Defense Date
09/03/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
blind visual quality assessment,green learning,machine learning,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Neumann, Ulrich (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
meizhanxuan@gmail.com,zhanxuan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11399B9YR
Unique identifier
UC11399B9YR
Identifier
etd-MeiZhanxua-13548.pdf (filename)
Legacy Identifier
etd-MeiZhanxua-13548
Document Type
Dissertation
Format
theses (aat)
Rights
Mei, Zhanxuan
Internet Media Type
application/pdf
Type
texts
Source
20240926-usctheses-batch-1214
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
blind visual quality assessment
green learning
machine learning