Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A data-driven approach to compressed video quality assessment using just noticeable difference
(USC Thesis Other)
A data-driven approach to compressed video quality assessment using just noticeable difference
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A DATA-DRIVEN APPROACH TO COMPRESSED VIDEO QUALITY ASSESSMENT USING JUST NOTICEABLE DIFFERENCE by Haiqiang Wang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2018 Copyright 2018 Haiqiang Wang Acknowledgments I would like to express my sincere gratitude to Professor Kuo for his support, patience, encouragement and thoughtful feedback throughout my graduate studies. Without his help, I would not have been able to see it through. Professor Kuo was and remains my best role model for a scientist, mentor and teacher. I am also grateful to Dr. Ioannis Katsavounidis for his scientific advice and knowl- edge and many insightful discussions and suggestions. The research project was funded by Netflix. I would like to thank my committee members, Professor Aiichiro Nakano, Professor Shrikanth Narayanan, Professor Antonio Ortega and Professor Alexander Sawchuk for providing many valuable comments that improved the presentation and contents of the dissertation. Finally, I really appreciate the unconditional love and care from my family. I would not have made it this far without them. I truly thank my wife, Tammy, for sticking by my side during good and bad times. We both learned a lot about life and strengthened our commitment and determination to each other. ii Contents Acknowledgments ii List of Tables v List of Figures vi Abstract ix 1 Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Related Research . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 10 2 Background 11 2.1 Subjective Video Quality Assessment Methodologies . . . . . . . . . . 11 2.2 Objective Video Quality Assessment Indices . . . . . . . . . . . . . . . 12 2.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 VideoSet: A Large-Scale Compressed Video Quality Dataset Based on JND Measurement 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Review on Perceptual Visual Coding . . . . . . . . . . . . . . . . . . . 21 3.3 Source and Compressed Video Content . . . . . . . . . . . . . . . . . 23 3.3.1 Source Video . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.2 Video Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Subjective Test Environment and Procedure . . . . . . . . . . . . . . . 26 3.4.1 Subjective Test Environment . . . . . . . . . . . . . . . . . . . 26 3.4.2 Subjective Test Procedure . . . . . . . . . . . . . . . . . . . . 29 3.5 JND Data Post-Processing via Outlier Removal . . . . . . . . . . . . . 33 3.5.1 Unreliable Subjects . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5.2 Outlying Samples . . . . . . . . . . . . . . . . . . . . . . . . . 36 iii 3.5.3 Normality of Post-processed JND Samples . . . . . . . . . . . 37 3.6 Relationship between Source Content and JND Location . . . . . . . . 38 3.7 Significance and Implications of VideoSet . . . . . . . . . . . . . . . . 42 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4 Prediction of Satisfied User Ratio and JND Points 47 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 JND and SUR for Coded Video . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Proposed SUR prediction System . . . . . . . . . . . . . . . . . . . . . 51 4.3.1 Spatial-Temporal Segment Creation . . . . . . . . . . . . . . . 54 4.3.2 Local Quality Assessment . . . . . . . . . . . . . . . . . . . . 56 4.3.3 Significant Segments Selection . . . . . . . . . . . . . . . . . . 58 4.3.4 Quality Degradation Features . . . . . . . . . . . . . . . . . . 58 4.3.5 Masking Features . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.6 Prediction of SUR Curves and JND Points . . . . . . . . . . . . 64 4.4 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5 JND-based Video Quality Model and Its Application 69 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 Derivation of JND-based VQA Model . . . . . . . . . . . . . . . . . . 74 5.3.1 Binary Decisions in Subjective JND Tests . . . . . . . . . . . . 74 5.3.2 JND Localization by Integrating Multiple Binary Decisions . . 75 5.3.3 Decomposing JND into Content and Subject Factors . . . . . . 76 5.4 Proposed JND-based User Model . . . . . . . . . . . . . . . . . . . . . 78 5.4.1 Parameter Inference and User Profiling . . . . . . . . . . . . . 79 5.4.2 Satisfied User Ratio on a Specific User Group . . . . . . . . . . 81 5.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.5.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . 82 5.5.2 Experiments on Raw JND Data . . . . . . . . . . . . . . . . . 83 5.5.3 Experiments on Cleaned JND Data . . . . . . . . . . . . . . . 84 5.5.4 SUR on Different User Groups . . . . . . . . . . . . . . . . . . 86 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6 Conclusion and Future Work 91 6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2.1 JND Against Flexible References . . . . . . . . . . . . . . . . 93 6.2.2 Cross-Resolution JND Analysis . . . . . . . . . . . . . . . . . 94 Bibliography 96 iv List of Tables 3.1 Summarization of source video formats in the VideoSet. . . . . . . . . 24 3.2 Summary on test stations and monitor profiling results. The peak lumi- nance and the black luminance columns show the numbers of stations that meet ITU-R BT.1788 in the corresponding metrics, respectively. The color difference column indicates the number of stations that has the E value smaller than a JND threshold. TheH value indicates the active picture height. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 The percentages of JND samples that pass normality test, where the total sequence number is 220. . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1 SUR and JND prediction setting. The main difference is about the ref- erence used to predict the second and the third JND. . . . . . . . . . . 66 4.2 Summary of averaged prediction errors of the first JND for video clips in four resolutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Mean Absolute Error of predicted SUR, i.e., SUR for the second and the third JND. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Mean Absolute Error of predicted JND location, i.e., QP for the sec- ond and the third JN . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 v List of Figures 3.1 Comparison between the traditional R-D function and the newly observed stair R-D function. The former does not take the nonlinear human per- ception process into account. . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Display of 30 representative thumbnails of video clips from the VideoSet, where video scenes in the first three rows are from two long sequences ‘El Fuente’ and ‘Chimera’ [53], those in the fourth and fifth rows are from the CableLab sequences [6], while those in the last row are from ‘Tears of Steel’ [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 A photo taken at a test station from Shenzhen University. It represents typical subjective test environment. . . . . . . . . . . . . . . . . . . . 27 3.4 Results of monitor profiling: (a) chromaticity of white color in the CIE 1931 color space, (b) the color difference between a specific monitor and the standard, where E 2:3 corresponds to a JND [63], (c) the peak luminance of the screen, and (d) the luminance ratio of the screen (i.e., the luminance of the black level to the peak white.) . . . . . . . . 28 3.5 Quality voting interface promoted to the viewer. Each time two clips were displayed one of another. Clips for the next comparison were dynamically updated based on current voting decision. . . . . . . . . . 29 3.6 An example about the proposed JND search procedure. It is assumed that the voting are ’YES’ for the first comparison between (0; 26) and ’NO’ for the second comparison between (0; 19), respectively. . . . . . 32 3.7 Illustration of unreliable subject detection and removal: (a) the boxplot of z-scores and (b) the dispersion plot of all subjects participating in the same test session, where subjects #4, #7, #8, #9, #32 and #37 are detected as outliers. Subject #15 is kept after removing one sample. . . 35 3.8 The boxplot of JND samples of the first 50 sequences with resolution 1080p: (a) the first JND, (b) the second JND, and (c) the third JND. The bottom, the center and the top edges of the box indicate the first, the sec- ond and the third quartiles, respectively. The bottom and top whiskers correspond to an interval ranging between [2:7; 2:7], which covers 99:3% of all collected samples. . . . . . . . . . . . . . . . . . . . . . . 39 vi 3.9 Representative frames from source sequences #15 (a) and #37 (b), where sequence # 15 (tunnel) is the scene of a car driving through a tunnel with the camera mounted on the windshield while source # 37 (dinner) is the scene of a dinner table with still camera focusing on a male speaker. . . 40 3.10 The histograms of three JND points with all 220 sequences included. . . 40 3.11 The scatter plots of the mean/std pairs of JND samples: (a) 1080p, (b) 720p, (c) 540p and (d) 360p. . . . . . . . . . . . . . . . . . . . . . . . 41 3.12 The JND and the SUR plots of sequence #15 ( = 30:5; = 7:5) and sequence #37 ( = 22:6; = 4:5). . . . . . . . . . . . . . . . . . . . . 44 3.13 Comparison of perceptual quality of two coded video sequences. Top row: (a) the reference frame with QP=0, (b) the coded frame with QP=25, (c) the coded frame with QP=36, and (d) the coded frame with QP=38, of sequence #15. Bottom row: (e) the reference frame with QP=0, (f) the coded frame with QP=19, (g) the coded frame with QP=22, and (h) the coded frame with QP=27, of sequence #37. . . . . . . . . . . . . . 45 4.1 Representative frames from source sequences (a) #37 and (b) #90. . . . 49 4.2 SUR modeling from JND samples. The first column and the second column are about sequence #37 and #90, respectively. The top row, the middle row and the bottom row are about SUR of the first, the second, and the third JND, respectively. . . . . . . . . . . . . . . . . . . . . . . 52 4.3 The block diagram of the proposed SUR prediction system. . . . . . . . 54 4.4 The block diagram of VMAF quality metric. . . . . . . . . . . . . . . . 56 4.5 Spatial randomness prediction based on neighborhood information. . . 60 4.6 Spatial randomness map for sequences (a) #37 and (b) #90. . . . . . . . 63 4.7 Temporal randomness map for sequence (a) #37 and (b) #90. . . . . . . 63 4.8 JND prediction result: (a), (c), and (e) are about the histogram of SUR. (b), (d), (e) are about the predicted VS. the ground truth JND location. The top row is about the first JND, the middle row is about the second JND, the bottom row about the third JND, respectively. . . . . . 65 5.1 Representative frames from 15 source contents. . . . . . . . . . . . . . 77 5.2 Consecutive frames of contents #11 (top) and #203 (bottom), respec- tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Experimental results: (a) raw JND data, where each pixel represents one JND location and a brighter pixel means the JND occurs at a larger QP; (b) estimated subject bias and inconsistency on raw JND data; (c) estimated content difficulty based on raw JND data, using the proposed VQA+MLE method; (d) estimated JND locations based on raw JND data, using both the proposed VQA+MLE method and the MOS method. Error bars in all subfigures represent the 95% confidence interval. . . . 83 vii 5.4 Experimental results: (a) cleaned JND data, where each pixel represents one JND location and a brighter pixel means the JND occurs at a larger QP; (b) estimated subject bias and inconsistency (i.e. v s ) on cleaned JND data; (c) estimated content difficulty (i.e.v c ) based on cleaned JND data using the proposed VQA+MLE method; (d) estimated JND loca- tions based on cleaned JND data, using both the proposed VQA+MLE method and the MOS method. Error bars in all subfigures represent the 95% confidence interval. . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5 Illustration of subject factors. Left: the histogram of the subject bias. Middle: the histogram of subject inconsistency. Right: the scatter plot of subject inconsistency versus the subject bias. . . . . . . . . . . . . . 87 5.6 Illustration of the proposed user model. The blue and red curves demon- strate the SUR of EC and HC contents, respectively. For each content, the three curves show the SUR difference between different user groups. 89 6.1 Illustration of fixed reference model and flexible reference model. . . . 94 6.2 Scaling factors in practical video streaming service pipeline. . . . . . . 95 viii Abstract The problem of human-centric compressed video quality assessment (VQA) is studied in this research. Our studies include three major topics: 1) proposing a new methodology for compressed video quality measurement and assessment based on the just-noticeable- difference (JND) notion and building a large-scale dataset accordingly, 2) measuring the JND-based video quality using the satisfied user ratio (SUR) curve and designing an SUR prediction method with video quality degradation features and masking features, and 3) proposing a probabilistic JND-based video quality model to quantify the influ- ence of subject variabilities as well as content variabilities and building a user model based on viewers’ capability to address inter-group difference. For the first topic, the process of building a large-scale coded H.264/A VC video qual- ity dataset, which measures human subjective experience based on the just-noticeable- difference (JND), is described in Chapter 3. The dataset, called the VideoSet, mea- sures the first three JND points of 220 5-second sequences, each at four resolution (i.e., 1920 1080, 1280 720, 960 540 and 640 360). Each of these 880 video clips was encoded by the H.264/A VC standard withQP = 1; ; 51. An improved bisection search algorithm was adopted to speed up subjective test without loss of robustness. We present the subjective test procedure, detection and removal of outlying measured data, and the properties of collected JND data. ix For the second topic, we propose a machine learning method to predict the satisfied- user-ratio (SUR) curves based on the VideoSet and then derive the JND points accord- ingly. Our method consists of the following steps. First, we partition a video clip into local spatial-temporal segments and evaluate the quality of each segment using the VMAF quality index. Next, we aggregate these local VMAF measures to derive a global index. Then, significant segments are selected based on the slope of quality scores between neighboring coded clips. After that, we incorporate the masking effect that reflects the unique characteristics of each video clip. Finally, we use the support vector regression (SVR) to minimize theL 2 distance of the SUR curves, and derive the JND point accordingly. For the third topic, we propose a JND-based VQA model that takes subject vari- abilities and content variabilities into account. The model parameters used to describe subject and content variabilities are jointly optimized by solving a maximum likelihood estimation (MLE) problem. We use subject inconsistency to filter out unreliable video quality scores. Moreover, we build a user model by utilizing user’s capability to discern the quality difference. We study the SUR difference as it varies with user profile as well as content with variable level of difficulty. The proposed model aggregates quality ratings per user group to address inter-group difference. x Chapter 1 Introduction 1.1 Significance of the Research Real-time video streaming contributed to the largest amount of traffic in both fixed and mobile networks according to the latest global Internet phenomena report [59] in 2016. It accounted for 71% downstream bytes of fixed access and 40% of downstream bytes of mobile access, respectively. Thousands of new titles have been added to the library on a monthly basis. A new content is scaled and compressed in various resolutions and bit rates to match clients’ end-terminal and bandwidth. Streaming service providers, such as Netflix, Youtube, and Amazon, strive to provide their members with the best viewing experience such as smooth video playback and free of picture artifacts. A significant part of this endeavor is to deliver video streams with the best perceptual quality possi- ble, given the constraints of network bandwidths and viewing device resolutions. It is essential to develop a Video Quality Metric (VQM) that can accurately and consistently measure human perceptual quality of a video stream. Traditionally, two methods have been extensively studied to evaluate video qual- ity: 1) subjective visual testing and 2) calculation of quality metrics such as PSNR, or more recently, SSIM [77]. Human satisfaction is the ultimate criteria to determine video quality. However, the subjective test is time-consuming and it is infeasible to con- duct subjective evaluation on every new content. The peak-signal-to-noise-ratio (PSNR) has been used as an objective metric in video coding standards such as MPEG-2 [10], 1 H.264/A VC [57], and more recently, High Efficiency Video Coding (HEVC) [67]. How- ever, it is widely agreed that it is a poor visual quality metric [39]. Absolute Category Rating (ACR) is one of the most commonly used subjective test methods. Test video clips are displayed on a screen for a certain amount of time and observers rate their perceived quality using an abstract scale [25], such as “Excellent (5)”, “Good (4)”, “Fair (3)”, “Poor (2)” and “Bad (1)”. There are two approaches in aggregating multiple scores on a given clip. They are the mean opinion score (MOS) and the difference mean opinion score (DMOS). The MOS is computed as the average score from all subjects while the DMOS is calculated from the difference between the raw quality scores of the reference and the test images. Both MOS and DMOS are popular in the quality assessment community. However, they have several limitations [7, 87]. The MOS scale is as an interval scale rather than an ordinal scale. It is assumed that there is a linear relationship between the MOS distance and the cognitive distance. For example, a quality drop from “Excellent” to “Good” is treated the same as that from “Poor” to “Bad”. There is no difference to a metric learning system as the same ordinal distance is preserved (i.e. the quality distance is 1 for both cases in the aforementioned 5-level scale). However, human viewing experience is quite different when the quality changes at different levels. It is also rare to find a video clip exhibiting poor or bad quality in real-life video applications. As a consequence, the number of useful quality levels drops from five to three. It is too coarse for video quality measurement. The second limitation is that scores from subjects are typically assumed to be inde- pendently and identically distributed (i.i.d.) random variables. This assumption rarely holds. Given multiple quality votings on the same content, individual voting contributes equally in the MOS aggregation method [82]. Subjects may have different levels of expertise on perceived video quality. A critical viewer may give low quality ratings on 2 coded clips whose quality is still good to the majority [31]. The same phenomenon occurs in all presented stimuli. The absolute category rating method is confusing to subjects as they have different understanding and interpretation of the rating scale. Subjective data are noisy due to the nature of “subjective opinion”. In the extreme case, some subjects submit random answers rather than good-faith attempts to label. Even worse, adversary votings may happen due to malice or a systematic misinterpre- tation of the task. The collected subjective scores should go through a cleaning and modeling process before being used to validate the performance of objective video qual- ity assessment metrics. Thus, it is critical to study subject capability and reliability to alleviate their effects in the VQA task. There has been a large amount of efforts in developing new visual quality indices to address this problem. Examples include SSIM [77], FSIM [89], DLM [32], VMAF [35], etc. These indices offer, by definition, users’ subjective test results and thus corre- late better than PSNR with their mean (called mean opinion score, or MOS). However, there is one common shortcoming with these indices. That is, the difference of selected contents for ranking is sufficiently large for a great majority of subjects. Since the dif- ference is higher than the just-noticeable-difference (JND) threshold for most people, disparities between visual content pairs are easier to tell. Humans cannot perceive small pixel variation in coded image/video until the differ- ence reaches a certain level. These phenomena are so called spatial masking effects, which have been studied in [4, 69, 14]. For video, temporal masking effect has a sig- nificant impact on perceived video quality as reported in [19, 2]. The masking effects are created by the characteristics of video contents. The Contrast Sensitivity Function (CSF), developed in Daly’s work [13], was used extensively in visual difference pre- diction studies [28, 81, 46, 49]. However, it is difficult to find closed-form functions to 3 other spatial masking effects and build corresponding quality assessment system. Tem- poral masking effect is still an open problem and the interaction between spatial and temporal masking effects makes VQA even complicated. Recently, machine learning based VQA indices[48, 42, 83, 33, 35, 24] are introduced to assess video quality in a data-driven approach rather than struggling with the interrelationship between human brain and perception systems. Supervised learning relies on a large amount of labeled data to perform well. Our goal is to predict visual difference threshold on video quality, and as such, we need accurate, diversified, and representative ground truth to develop our algorithm. There are quite a few video quality datasets have been well accepted to the public, such as LIVE [62], VQEG-HD [15], LIVE mobile [47], MCL-V [38], and NETFLIX-TEST [35]. However, they are limited in the following areas. 1) Most of them contain 10-20 source videos. 2) Distorted clips have supra-threshold difference. Both of them make it difficult to develop our algorithm. We begin with building a JND-based large-scale dataset with diversified contents and developed a JND prediction system based on the collected data. 1.2 Review of Related Research Subjective quality evaluation is the process of employing human viewers for grading video quality based on individual perception. Formal methods and guidelines for sub- jective quality assessments are specified in various ITU recommendations. The most relevant ones to our work are ITU-T Rec. P.910 [56], which defines subjective video quality assessment methods for multimedia applications, and ITU-R Rec. BT.500 [60], which defines a methodology for the subjective assessment of the quality of television 4 pictures, and ITU-R. Rec. BT.2022 [5], which gives general viewing conditions for sub- jective assessment of quality of SDTV and HDTV television pictures. These specifica- tions describe a number of test methods with distinct presentation and scoring schemes along with the recommended viewing conditions. When each content is evaluated several times by different subjects, a straightfor- ward approach is to use the most common label or the mean opinion score (MOS) as the true label. Recently, efforts have been made to examine MOS-based subjective test methods. Various methods were proposed from different perspectives to address the aforementioned limitations. A theoretical subject model [26] was proposed to model the three major factors that influence MOS accuracy: subject bias, subject inaccuracy, and stimulus scoring difficulty. It was reported that the distribution of these three factors spanned about25% of the rating scale. Especially, the subject error terms explained previously observed inconsistencies both within a single subject’s data and the lab-to- lab differences. A perceptually weighted rank correlation indicator [85] was proposed, which rewarded the capability of corrected ranking high-quality images and suppressed the attention towards insensitive rank mistakes. A generative model [36] was proposed to jointly recover content and subject factors by solving a maximum likelihood esti- mation problem. However, these models were proposed for the traditional MOS-based approaches. Objective Video Quality Assessment algorithms can be classified into three cate- gories according to the information extracted from pristine reference videos: full ref- erence VQA (FR-VQA), reduced reference VQA (RR-VQA) and no reference VQA (NR-VQA). FR-VQA methods [77, 64, 89, 32, 35, 52] calculate the similarity/difference 5 between distorted clip and reference clip to predict the quality of distorted clips. RR- VQA metrics [66, 44, 17] need only partial information of the reference clip and NR- VQA metrics [58, 33, 34] operate without any information from reference clip. FR- VQA is still a challenging task due to highly non-stationary nature of video signals and an incomplete understanding of the human visual system (HVS). The evidence for this claim is the fact that the state-of-the-art FR-VQA methods have recently been approach- ing a reasonable level of correlation (typically (0:75; 0:9)) with subjective scores. There are two categories of FR-VQA algorithms based on the underlying principles. In the first category, the characteristics of the Human Visual System are explored. A video quality metric tries to simulate the perception process of the HVS. The HVS is modeled as an error-prone communication channel. Video quality assessment is formu- lated as a signal distortion summation problem. Typically, a closed-form expression was created to indicate perceptual quality. In the second category, a quality metric extracts quality relevant features and uses machine learning algorithm to predict the perceptual quality. The Motion-tuned Video Integrity Evaluator (MOVIE) [61] is an HVS-inspired algo- rithm where the response of the visual system to video stimulus is modeled as a function of linear spatio-temporal bandpass filter outputs. The distortion is calculated based on the difference of Gabor coefficients between reference and distorted videos. The Visual Information Fidelity (VIF) [64] adopted an approach to tune the orientation of a set of 3D Gabor filters according to local motion based on optical flow. Gabor filter responses are then incorporated into the SSIM to measure video quality. Recently, several learning based VQA metrics were proposed and they all achieved state-of-the-art performance. In VQM [54, 83], the author extracts handcrafted spatio- temporal features based on edge detection filters. A neural network is trained and it 6 achieves the best performance. A multi-method fusion (MMF) method [42] was pro- posed for image quality assessment. A regression approach is used to combine scores of multiple image quality metrics in the MMF. The MMF score is obtained by a non- linear fusion with parameters optimized by a training process. The Video Multimethod Assessment Fusion metric (VMAF) [35] was recently developed by Netflix. It pre- dicts subjective quality by combining multiple elementary quality metrics. The basic rationale is that each elementary metric may have its own strengths and weakness with respect to the source content characteristics, type of artifacts, and degree of distortion. Then elemental scores were fused together using support vector machine. The model is trained and tested using the opinion scores obtained through a subjective experiment and achieves state-of-the-art performance. 1.3 Contributions of the Research Several contributions are made in this research. To address the shortcoming of existing VQA datasets and indices, we propose a JND-based methodology to model the thresh- old of quality degradation of compressed videos. Chapter 3 starts with building a new dataset to facilitate following research. The specific contributions are given below. We propose a new methodology to model the perceived quality of compressed videos. It measures near-threshold quality differences that takes the nonlinearity of the HVS into account. We conducted a large-scale subjective test to measure the first 3 JND points of each source video. The proposed dataset, VideoSet, contains 220 5-second source video clips in four resolutions, i.e., 1920 1080, 1280 720, 960 540, 640 360. It is the most comprehensive collection of public-domain Full-HD video sequences with representative and diversified contents. 7 We adopt an improved bisection search algorithm to speed up subjective test with- out loss of robustness. We conduct statistical analysis on the collected JND samples. It is shown that JND follows normal distribution and a consistent video quality index, the Satisfied User Ratio, was proposed and studied. In Chapter 4, we propose a Satisfied User Ratio (SUR) prediction method using a machine learning framework. Intuitively, solving the JND problem is equivalent to addressing the following question: When, where, and to what extent will the distortion make a coded clip different from its reference? To achieve this objective, we partition each source and the coded clip pair into local spatial-temporal segments and evaluate their similarities using the state-of-the-art VMAF video quality index. Then, these local segments similarity descriptors are aggregated to give a compact global representation. Finally, we incorporate the masking effect that reflects the unique characteristics of each video clip globally. We use the support vector regression (SVR) to minimize the L 2 distance of the SUR curves and derive the JND points accordingly. The contributions of the proposed SUR prediction method are listed below. Video quality are assessed locally using a state-of-the-art video quality index. Local segments similarity descriptors are aggregated and a dynamic feature set is extracted to characterize global quality degradation of a coded clip. A quality-degradation-slope-based selection method was proposed to identify sig- nificant segments which are more correlated to quality difference. We extract masking effects features to reflects the unique characteristics of refer- ence video clip. It compensates threshold difference due to content diversity (the first JND). It also captures threshold change incurred by compression (the second and the third JND). 8 We use the support vector regression (SVR) to minimize the L 2 distance of the SUR curves. The JND location could be derived by thresholding with a specific target SUR. In Chapter 5, we build a JND-based VQA model using a probabilistic framework to explain that the JND location is a normally distributed random variable. While most traditional VQA models focus on content variabilities, our proposed VQA model takes both subject and content variabilities into account. The model parameters used to describe subject and content variabilities are jointly optimized by solving a maximum likelihood estimation (MLE) problem. Estimated subject inconsistency is used to filter out unreliable video quality scores. We build a user model by using user’s capability to discern the quality difference. We study the SUR difference as it varies with user profile as well as content with variable level of difficulty. The proposed model aggregates qual- ity ratings per user group to address inter-group difference. It is demonstrated that the model is flexible to predict SUR distribution of a specific user group. The contributions of the proposed user model are listed below. We model video quality ratings by taking subject variabilities and content vari- abilities into account. A probabilistic model is derived to explain that the JND location is a normally distributed random variable. The model parameters used to describe subject and content variabilities are jointly optimized by solving a maximum likelihood estimation (MLE) problem. We use estimated subject inconsistency to identify and filter out inconsistent sub- jects. We build a user model by profiling users based on their capability to discern quality difference. 9 We study the SUR difference as it varies with user profile as well as content with variable level of difficulty. The proposed model aggregates quality ratings per user group to address inter-group difference. 1.4 Organization of the Dissertation The rest of this dissertation is organized as follows. The background of this research is described in Chapter 2. It briefly revisits full-reference video quality assessment metrics. Then, a JND based video quality assessment dataset is described in Chapter 3. The satisfied user ratio regression method and the JND prediction method are presented in Chapter 4. Video quality scores analysis and the proposed user model are presented in 5. Finally, concluding remarks and future research are given in Chapter 6. 10 Chapter 2 Background 2.1 Subjective Video Quality Assessment Methodologies Subjective test is the ultimate means to measure quality of experience (QoE). It pools viewing experience from a group of subjects and pins down Ground Truth in terms of mean opinion score (MOS). However, reliable and meaningful results can only be obtained if experiments are properly designed and subjective test was conducted follow- ing a strict methodology. Several international recommendations have been published to provide guidelines for conducting subjective visual quality assessment, such as ITU-R BT.1788, 2007; ITU-T P.910, 2008, ITU-R BT.500, 2012; ITU-R BT.2022, etc. These recommendations result from experience gathered by different groups, e.g. Video Quality Experts Group (VQEG), JPEG, MPEG, Video Coding Experts Group (VCEG), and some ITU study groups. The different recommendations cover the selec- tion of test material, set up of the viewing environment, choice of test method, pre- and post-screening of the subjects, and analysis of data. They serve as a set of best practices and guidelines that should be followed when designing a subjective test. The proposed Just Noticeable Difference methodology follows most of the relevant recommendations. The only difference between JND and traditional work is about test method because we are interested in the threshold of visual quality differences. In Single Stimulus Absolute Category Rating (SS-ACR), the stimuli are presented one at a time and are rated independently on a category scale. Usually, a five- grade quality scale is used with 5 being the best and 1 the worst quality, e.g., 11 5) Excellent, 4) Good, 3) Fair, 2) Poor, 1) Bad. This test method is suitable to evaluate contents that cover a large quality range. In Double Stimulus Impairment Scale (DS-IS) method, subjects are presented with pairs of video sequences in a fixed order (the impaired reference comes first and then the paired clip). Subjects are asked to rate the impairments scale of the second stimulus, e.g., 1) Imperceptible, 2) Perceptible, but not annoying, 3) Slightly annoying, 4) Annoying and 5) Very annoying. Labels associated with Imperceptible/Perceptible are valuable for high quality systems. The proposed Pair Comparison Just Noticeable Difference (PC-JND) achieves the highest discriminatory power. Two stimulus are randomly presented to the subjects rather than in a fixed order. Subjects are asked to answer whether they are perceptually the same. The proposed method directly models the boundary between perceptually lossy/lossless visual experience. 2.2 Objective Video Quality Assessment Indices Video quality assessment is a more complex problem than that of image because of the temporal dimension and increased dimension of data. Thus, a common approach con- sists of performing a frame-by-frame analysis using standard image quality metrics and pooling individual scores to compute an overall quality score. More complex video qual- ity assessments consist of spatio-temporal analysis, motion-based temporal alignment or slicing. We briefly review several image/video quality metrics that are milestones in the quality assessment field. 1) Natural Visual Statistics: SSIM [77] is the most commonly used perceptual FR image quality metric. It is based on the assumption that structure information percep- tion plays an important role in perceived quality. It can be made very fast and delivers 12 highly competitive image quality predictions against human judgments, particularly in multiscale implementation [79]. SSIM is defined as a product of three terms computed over small image patches: SSIM(x;y) =jl(x;y)j jc(x;y)j js(x;y)j ; (2.1) where x and y denote corresponding patches from reference and compressed images. l(x;y) is the luminance similarity term, c(x;y) is the contrast similarity term, and s(x;y) is the structural similarity term. A more compact notation of SSIM is defined as: SSIM(x;y) = (2 x y +C 1 )(2 xy +C 2 ) ( 2 x + 2 y +C 1 )( 2 x + 2 y +C 2 ) ; (2.2) where x and y are the mean and standard deviation of local luminance, respectively. The constant termC is included to avoid dividing by zero. The block size is 11 11. A block-wise SSIM map is generated first and the global index is the average of them. For video sequences, the VSSIM [78] metric measures the quality of the distorted video in three levels, namely the local region level, the frame level, and the sequence level. The local quality index is obtained as a function of the SSIM indices for the Y , Cb, and Cr components as: SSIM ij =W Y SSIM Y ij +W Cb SSIM Cb ij +W Cr SSIM Cr ij ; (2.3) whereW Y ,W Cb , andW Cr are weights for the Y , Cb, and Cr components, respectively. In addition to SSIM and VSSIM, the MultiScale-SSIM (MS-SSIM) [79] has been proposed. The MS-SSIM is an extension of the single-scale approach used in SSIM and provides more flexibility by incorporating the variations of the image resolution and viewing conditions. At every stage (also referred to as scale), the MS-SSIM method applies a low pass filter to the reference and distorted images and downsamples the 13 filtered images by a factor of two. At them-th scale, contrast and structure comparisons are evaluated separately. Similarities at different stages are combined as: MSSSIM(x;y) =jl M (x;y)j M Y m=1 jc m (x;y)j js m (x;y)j ; (2.4) The MS-SSIM index can be extended to video by applying it frame-by-frame on the luminance component of the video and the overall MS-SSIM index for the video is computed as the average of the frame level quality scores. 2) Natural Visual Features: The Video Quality Metric (VQM) software tools [54] developed by the National Telecommunications and Information Administration (NTIA), provide standardized methods to measure the perceived video quality of digi- tal video systems. The main impairments considered include blurring, block distortion, jerky/unnatural motion, noise in luminance and chrominance channels, and error blocks (e.g., transmission errors). A core component of both VQM and VFM-VFD [83] is a spatial information (SI) filter that detects long edges. This filter is similar to classical Sobel filter in that separate horizontal and vertical filters are applied. It assumes that the subjects focus on long edges and tend to ignore short edges. As the filter size increases, individual pixels and small details have a deceasing impact on the edge strength. By contrast, Sobel filter responds identically to short and long edges. Optimal filter size depends upon the resolution of target video. A filter size of 13 performs well to HD video. The blurring information is computed using a 13-pixel information filter (SI13). Jerky/unnatural motion is detected by considering the shift of horizontal and vertical edges with respect to diagonal orientation due to high blurring. Then, the shift of edges from the diagonal to horizontal and vertical orientations due to tiling or blocking arti- facts is considered. 14 A weighted linear combination of all the impairments metrics is used to arrive at the VQM rating. The NTIA/VQM General Model was the first model that broke the 0.9 threshold of the Pearson correlation coefficient on the VQEG FRTV Phase II test database [11]. It was standardized by ANSI in July 2003 (ANSI T1.801.03-2003) and included as normative model in ITU Recommendations ITU-T J.144 and ITU-R BT.1683 (both adopted in 2004). 3) Machine Learning Metrics: Video Multimethod Assessment Fusion (VMAF) [35] metric is an open-source full-reference perceptual video quality index that aims to cap- ture the perceptual quality of compressed video. It is intended to be useful as an absolute socre across all types of content, and focused on quality degradation due to compression as well as rescaling. V AMF first estimates the quality score of a video clip with multiple high- performance image quality indices on a frame-by-frame basis. Then, these image quality scores are fused together using the support vector machine (SVM) at each frame. Cur- rently, two image fidelity metrics and one temporal signal have been chosen as features to the SVM. These elementary metrics and features were chosen from amongst other candidates through iterations of testing and validation. Detail Loss Measure (DLM) [32]. DLM is an image quality metric based on the rationale of separately measuring the loss of details which affects the content visi- bility, and the redundant impairment which distracts viewer attention. It estimates the blurriness component in the distortion signal using wavelet decomposition. It uses contrast sensitivity function (CSF) to model the human visual system (HVS), and the wavelet coefficients are weighted based on CSF thresholds. Visual Information Fidelity (VIF) [64]. VIF is based on visual statistics combined with HVS modeling. It quantifies the Shannon information shared between the source and the distortion relative to the information contained in the source itself. 15 The masking and sensitivity aspects of the HVS are modeled through a zero mean, additive white Gaussian noise model in the wavelet domain that is applied to both the reference image and the distorted image model. Motion information. Motion in videos is chose as the temporal signal, since the HVS is less sensitive to quality degradation in high motion frames. The global motion value of a frame is the mean co-located pixel difference of a frame with respect to the previous frame. Since noise in the video can be misinterpreted as motion, a low-pass filter is applied before the difference calculation. Results on various video quality databases show that VMAF outperforms other video quality indices such as PSNR, SSIM [77], Multiscale Fast-SSIM [8], and PSNR-HVS [55] in terms of the Spearman Rank Correlation Coefficient (SRCC), Pearson Cor- relation Coefficient (PCC), and the root-mean-square error (RMSE) criteria. VMAF achieves comparable or outperforms the state-of-the-art video index, the VQM-VFD index [83], on several publicly available databases. 2.3 Evaluation Criteria As described in VQEG report [11], there are three commonly used metrics that are used for evaluating the performance of objective video quality metrics. These include the following: The Pearson correlation coefficient (PCC) is the linear correlation coefficient between the predicted MOS (DMOS) and the subjective MOS (DMOS). It mea- sures the prediction accuracy of a metric, i.e., the ability to predict the subjective 16 quality ratings with low error. ForN data pairs (x i ;y i ), with x and y being the means of the respective data sets, the PCC is given by: PCC = P (x i x)(y i y) p P (x i x) 2 p P (y i y) 2 ; (2.5) Typically, the PCC is computed after performing a nonlinear regression using a logistic function, as described in [11], in order to fit the objective metric quality scores to the subjective quality scores. y = 1 (0:5 1 1 +e 2 (x 3 ) ) + 4 x + 5 ; (2.6) wherex is an objective quality score and i ;i = 1; ; 5, are fitting parameters. The Spearman rank order correlation coefficient (SROCC) is the correlation coef- ficient between the predicted MOS (DMOS) and the subjective MOS (DMOS). It measures he prediction monotonicity of a metric, i.e., the degree to which the predictions of a metric agree with the relative magnitudes of the subjective quality ratings. The SROCC is defined as: SROCC = P (X i X 0 )(Y i Y 0 ) p P (X i X 0 ) 2 p P (Y i Y 0 ) 2 ; (2.7) whereX i is the rank ofx i andY i the rank ofy i for the ordered data series andX 0 andY 0 denote the respective midranks. The Root Mean Square Error (RMSE) forN data pointsx i ;i = 1; ;N, with x being the mean of the data set, is defined as: RMSE = r 1 N X (x i x) 2 ; (2.8) 17 The fidelity of an objective quality assessment metric to the subjective assessment is considered high if the Pearson and Spearman correlation coefficients are close to 1. Some studies use the Root Mean Square Error (RMSE) to measure the degree of accu- racy of the predicted objective scores. 18 Chapter 3 VideoSet: A Large-Scale Compressed Video Quality Dataset Based on JND Measurement 3.1 Introduction Digital video plays an important role in our daily life. About 70% of today’s Internet traffic is attributed to video, and it will continue to grow to the 80-90% range within a couple of years. It is critical to have a major breakthrough in video coding technology to accommodate the rapid growth of video traffic. Despite the introduction of a set of fine- tuned coding tools in the standardization of H.264/A VC and H.265 (or HEVC), a major breakthrough in video coding technology is needed to meet the practical demand. To address this problem, we need to examine limitations of today’s video coding method- ology. Today’s video coding technology is based on Shannon’s source coding theorem, where a continuous and convex rate-distortion (R-D) function for a probabilistic source is derived and exploited (see the black curve in Fig. 3.1). However, humans cannot per- ceive small variation in pixel differences. Psychophysics study on the just-noticeable- distortion (JND) clearly demonstrated the nonlinear relation between human perception and physical changes. The traditional R-D function does not take this nonlinear human perception process into account. In the context of image/video coding, recent subjective 19 studies in [37] show that humans can only perceive discrete-scale distortion levels over a wide range of coding bitrates (see the red curve in Fig. 3.1). Without loss of generality, we use H.264/A VC video as an example to explain it. The quantization parameter (QP) is used to control its quality. The smaller the QP, the better the quality. Although one can choose a wide range of QP values, humans can only differentiate a small number of discrete distortion levels among them. In contrast with the conventional R-D function, the perceived R-D curve is neither continuous nor convex. Rather, it is a stair function that contains a couple of jump points (or the JND points). The JND is a statistical quantity that accounts for the maximum difference unnoticeable to a human being. Figure 3.1: Comparison between the traditional R-D function and the newly observed stair R-D function. The former does not take the nonlinear human perception process into account. The measure of coded image/video quality using the JND notion was first proposed in [37]. As a follow-up, two small-scale JND-based image/video quality datasets were released by the Media Communications Lab at the University of Southern California. They are the MCL-JCI dataset [29] and the MCL-JCV dataset [70] targeted the JPEG 20 image and the H.264/A VC video, respectively. To build a large-scale JND-based video quality dataset, an alliance of academic and industrial organizations was formed and the subjective test data were acquired in Fall 2016. The resulting dataset is called the ‘VideoSet’ – an acronym for ’Video Subject Evaluation Test (SET)’. The VideoSet con- sists of 220 5-second sequences in four resolutions (i.e., 1920 1080, 1280 720, 960 540 and 640 360). For each of the 880 video clips, we encode it using the x264 [1] encoder implementation of the H.264/A VC standard withQP = 1; ; 51 and measure the first three JND points with 30+ subjects. All source/coded video clips as well as measured JND data included in the VideoSet are available to the public in the IEEE DataPort [72]. 3.2 Review on Perceptual Visual Coding By perceptual visual coding (PVC), one attempts to exploit the perception character- istics of the human visual system (HVS) to reduce psychovisual redundancy of coded video. A couple of PVC techniques were proposed recently, which will be reviewed below. The structural similarity (SSIM) index [77] was incorporated in the rate- distortion optimization and, then, an optimum bit allocation and a perceptual based rate control scheme was developed to achieve considerable bitrate reduction in [51]. An SSIM-inspired residual divisive normalization scheme was proposed to replace con- ventional SAD and SSD in [74] and [75]. A quality consistent encoder was proposed in [84] by incorporating the SSIM index and adjusting the quantization parameter at the MB level adaptively. Saliency-based PVC algorithms were developed based on the assumption that only a small region has the highest resolution on the fovea and visual acuity decreases quickly from the fixation point. Foveation theory becomes popular in 21 quality evaluation [12, 88, 22]. It is intuitive to allocate more bits in regions with strong visual saliency. The JND provides an alternative to the design of PVC algorithms by considering the distortion visibility threshold. An explicit spatio-temporal JND estimation model was proposed in [28] by integrating the spatio-temporal contrast sensitivity function (CSF), eye movement, luminance adaption, intra- and inter-band contrast masking. An adaptive block-size transform (ABT) based JND model was proposed in [45], where both spatial and temporal similarities were exploited to decide the optimal block size of the JND map. More recently, standard-compliant JND models were proposed in [43, 30]. An estimated masking threshold was integrated into the quantization process and followed by the so-called rate-perceptual-distortion-optimization (RPDO). Significant bit reduc- tion was reported with little perceptual quality degradation. A JND-based perceptual quality optimization was proposed for JPEG in [92]. A foveated JND (FJND) model was proposed in [9] to adjust the spatial and temporal JND threshold. The quantization parameter was optimized at the MB level based on the improved distortion visibility threshold. Yet highly visible or annoying artifacts may change the saliency map derived directly from the reference. A saliency-preserving framework was recently proposed in [18] to improve this drawback, where saliency map’s change on coded frames was taken into consideration in quantization parameter (QP) selection and RDO mode decision However, a benchmarking dataset is lacking in the PVC community. Researchers either resort to image/video quality datasets that do not aim at PVC (e.g., VQEG [15], LIVE [62] and MLC-V [38]) or build a small in-house dataset by themselves. This motivates us to build a large-scale subjective test dataset on perceived video quality. It is worthwhile to point out that subjective tests in traditional visual coding were only conducted by very few experts called golden eyes, which corresponds to the worst-case analysis. As the emergence of big data science and engineering, the worst-case analysis 22 cannot reflect the statistical behavior of the group-based quality of experience (QoE). When the subjective test is conducted among a viewer group, it is more meaningful to study their QoE statistically to yield an aggregated result. 3.3 Source and Compressed Video Content We describe both the source and the compressed video content in this section. 3.3.1 Source Video Figure 3.2: Display of 30 representative thumbnails of video clips from the VideoSet, where video scenes in the first three rows are from two long sequences ‘El Fuente’ and ‘Chimera’ [53], those in the fourth and fifth rows are from the CableLab sequences [6], while those in the last row are from ‘Tears of Steel’ [3]. The VideoSet consists of 220 source video clips, each of which has a duration of 5 seconds. We show thumbnails images for 30 representative video clips in Fig 3.2. The source video clips were collected from publicly available datasets in [53, 6, 3]. The 23 original sequences have multiple spatial resolutions (i.e., 4096 2160, 4096 1714, 38402160), frame rates (i.e., 60, 30, 24) and color formats (i.e., YUV444p, YUV422p, YUV420p). We pay special attention to the selection of these source video clips to avoid redundancy and enrich diversity of selected contents. Table 3.1: Summarization of source video formats in the VideoSet. Frame rate Spatial resolution Pixel format Source Selected Original Trimmed Original Trimmed Original Trimmed El Fuente 31 60 30 4096 2160 3840 2160 YUV444p YUV420p Chimera 59 30 30 4096 2160 3840 2160 YUV422p YUV420p Ancient Thought 11 24 24 3840 2160 3840 2160 YUV422p YUV420p Eldorado 14 24 24 3840 2160 3840 2160 YUV422p YUV420p Indoor Soccer 5 24 24 3840 2160 3840 2160 YUV422p YUV420p Life Untouched 15 60 30 3840 2160 3840 2160 YUV444p YUV420p Lifting Off 13 24 24 3840 2160 3840 2160 YUV422p YUV420p Moment of Intensity 10 60 30 3840 2160 3840 2160 YUV422p YUV420p Skateboarding 9 24 24 3840 2160 3840 2160 YUV422p YUV420p Unspoken Friend 13 24 24 3840 2160 3840 2160 YUV422p YUV420p Tears of Steel 40 24 24 4096 1714 3840 2160 YUV420p YUV420p After content selection, we process each 5-second video clip to ensure that they are in similar format. Their formats are summarized in Table 3.1, where the first column shows the names of the source video material of longer duration and the second column indicates the number of video clips selected from each source material. The third, fourth and fifth columns describe the frame rate, the spatial resolution and the pixel format, respectively. They are further explained below. Frame Rate. The frame rate affects the perceptual quality of certain contents significantly [52]. Contents of a higher frame rate (e.g. 60fps) demand a more powerful CPU and a larger memory to avoid impairments in playback. For this reason, if the original frame rate is 60fps, we convert it from 60fps to 30fps to ensure smooth playback in a typical environment. If the original frame rate is not greater than 30fps, no frame rate conversion is needed. 24 Spatial Resolution. The aspect ratio of most commonly used display resolutions for web users is 16 : 9. For inconsistent aspect ratios, we scale them to 16 : 9 by padding black horizontal bars above and below the active video window. As a result, all video clips are of the same spatial resolution – 3840 2160. Pixel Format. We down-sample the trimmed spatial resolution 3840 2160 (2160p) to four lower resolutions. They are: 1920 1080 (1080p), 1280 720 (720p), 960 540 (540p) and 640 360 (360p) for the subjective test in build- ing the VideoSet. In the spatial down-sampling process, the Lanczos interpola- tion [68] is used to keep a good compromise between low and high frequencies components. Also, the YCbCr4:2:0 chroma sampling is adopted for maximum compatibility. It is worthwhile to point out that 1080p and 720p are two most dominant video formats on the web nowadays while 540p and 360p are included to capture the viewing experience on tablets or mobile phones. After the above-mentioned processing, we obtain 880 uncompressed sequences in total. 3.3.2 Video Encoding We use the H.264/A VC [1] high profile to encode each of the 880 sequences, and choose the constant quantization parameter (CQP) as the primary bit rate control method. The adaptive QP adjustment is reduced to the minimum amount since our primary goal is to understand a direct relationship between the quantization parameter and perceptual quality. The encoding recipe is included in the read-me file of the released dataset. The QP values under our close inspection are between [8; 47]. It is unlikely to observe any perceptual difference between the source and coded clips with a QP value smaller than 8. Furthermore, coded video clips with a QP value larger than 47 will not 25 be able to offer acceptable quality. On the other hand, it is ideal to examine the full QP range; namely, [0; 51], in the subjective test since the JND measure is dependent on the anchor video that serves as a fixed reference. To find a practical solution, we adopt the following modified scheme. The reference is losslessly encoded and referred to asQP = 0. We use the sourceQP = 0 to substitute all sequences with a QP value smaller than 8. Similarly, sequences with a QP larger value than 47 are substituted by that withQP = 47. The modification has no influence on the subjective test result. This will become transparent when we describe the JND search procedure in Sec. 3.4.2. By including the source and all coded video clips, there are 220 4 52 = 45; 760 video clips in the VideoSet. 3.4 Subjective Test Environment and Procedure The subjective test environment and procedure are described in detail in this section. 3.4.1 Subjective Test Environment The subjective test was conducted in six universities in the city of Shenzhen in China. There were 58 stations dedicated to the subjective test. Each station offered a controlled non-distracting laboratory environment. The viewing distance was set based on the ITU-R BT.2022 recommendation. The background chromaticity and luminance were set up as an environment of a common office/laboratory. We did not conduct monitor calibration in different test stations. The monitors were adjusted to a comfortable setting for test subjects. The uncalibrated monitors provided a natural platform to capture the practical viewing experience in our daily life. On the other hand, the monitors used in the subjective test were profiled for completeness. A photo is given in Fig. 3.3 to show 26 Figure 3.3: A photo taken at a test station from Shenzhen University. It represents typical subjective test environment. the test environment. Monitor profiling results are given in Fig. 3.4 and summarized in Table 3.2. As shown in Table 3.2, most stations comply with the ITU recommendations. Table 3.2: Summary on test stations and monitor profiling results. The peak lumi- nance and the black luminance columns show the numbers of stations that meet ITU-R BT.1788 in the corresponding metrics, respectively. The color difference column indi- cates the number of stations that has the E value smaller than a JND threshold. The H value indicates the active picture height. Resolution Station Number Peak Luminance (cd=m 2 ) Black Luminance Color Difference Viewing Distance (H) 1080p 15 15 15 13 3:2 720p 15 15 15 15 4:8 540p 14 14 14 13 6:4 360p 14 13 14 11 7 We indexed each video clip with a content ID and a resolution ID and partitioned 880 video clips into 58 packages. Each package contains 14 or 15 sequence sets of a content/resolution ID pair, and each sequence set contains one source video clip and its 27 (a) 0 2 4 6 8 10 12 14 16 Station ID 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 CIE76 1080p 720p 540p 360p (b) 0 2 4 6 8 10 12 14 16 Station ID 0 50 100 150 200 250 300 Peak Luminance 1080p 720p 540p 360p (c) 0 2 4 6 8 10 12 14 16 Station ID 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Black Luminance 1080p 720p 540p 360p (d) Figure 3.4: Results of monitor profiling: (a) chromaticity of white color in the CIE 1931 color space, (b) the color difference between a specific monitor and the standard, where E 2:3 corresponds to a JND [63], (c) the peak luminance of the screen, and (d) the luminance ratio of the screen (i.e., the luminance of the black level to the peak white.) all coded video clips. One subject can complete one JND point search for one package in one test session. The duration of one test session was around 35 minutes with a 5- minute break in the middle. Video sequences were displayed in their native resolution without scaling on the monitor. The color of the inactive screen was set to light gray. We randomly recruited around 800 students to participate in the subjective test. A brief training session was given to each subject before a test session starts. In the train- ing session, we used different video clips to show quality degradation of coded video contents. The scenario of our intended application; namely, the daily video streaming 28 experience, was explained. Any question from the subject about the subjective test was also answered. 3.4.2 Subjective Test Procedure In the subjective test, each subject compares the quality of two clips displayed one after another, and determines whether these two sequences are noticeably different or not. The subject should choose either ‘YES’ or ‘NO’ to proceed. The subject has an option to ask to play the two sequences one more time. A screenshot of the voting interface was given in Fig. 3.5. The comparison pair is updated based on the response. (a) Figure 3.5: Quality voting interface promoted to the viewer. Each time two clips were displayed one of another. Clips for the next comparison were dynamically updated based on current voting decision. One aggressive binary search procedure was described in [70] to speed up the JND search process. At the first comparison, the procedure asked a subject whether there would be any noticeable difference between QP = 0 and QP = 25. If a subject made an unconfident decision of ‘YES’ at the first comparison, the test procedure would 29 exclude intervalQP = [26; 51] in the next comparison. Although the subjects selects ‘Unnoticeable Difference’ in all comparisons afterwards, the final JND location would stay atQP = 25. It could not belong toQP = [26; 51] any longer. A similar problem arose if a subject made an unconfident decision of ‘NO’ at the first comparison. To fix this problem, we adopt a more robust binary search procedure in our current subjective test. Instead of eliminating the entire left or right half interval, only one quar- ter of the original interval at the farthest location with respect to the region of interest is dropped in the new test procedure. Thus, if a subject made an unconfident decision of ‘YES’ at the first comparison, the test procedure will remove intervalQP = [39; 51] so that the updated interval isQP = [0; 38]. An example was given in Fig. 3.6. The dashed arrows indicate how the search range was updated as the test proceeds. The new binary search procedure allows a buffer even if a wrong decision is made. The comparison points may oscillate around the final JND position but still converge to it. The new binary search procedure is proved to be more robust than the previous binary search procedure at the cost of a slightly increased number of comparisons (i.e., from 6 comparisons in the previous procedure to 8 comparisons in the new procedure). Letx n 2 [0; 51] be the QP used to encode a source sequence. We usex s andx e as the start and the end QP values of a search interval, [x s ;x e ], at a certain round. Since x s < x e , the quality of the coded video clip with QP = x s is better than that with QP = x e . We usex a to denote the QP value of the anchor video clip. It is fixed in the entire binary search procedure until the JND point is found. The QP value, x c , of the comparison video is updated within [x s ;x e ]. One round of the binary search procedure is described in Algorithm 1. The global JND search algorithm is stated below. Initialization. We setx s = 0 andx e = 51. 30 Data: QP range [x s ;x e ] Result: JND locationx n x a =x s ; x l =x s ; x r =x e ; flag = true; while flag do if x a andx c have quality difference then x n =x c ; ifx c x l 1 then flag=false; else x r =b(x l + 3x r =4c; x c =b(x l +x r )=2c ; end else ifx r x c 1 then flag=false; else x l =d(3x l +x r )=4e; x c =d(x l +x r )=2e ; end end end Algorithm 1: One round of the JND search procedure. Search range update. Ifx a andx c exhibit a noticeable quality difference, update x r to the third quartile of the range. Otherwise, updatex l to the first quartile of the range. The ceiling and the floor integer-rounded operations, denoted bybc and de, are used in the update process as shown in the Algorithm of the one round of the JND search procedure. Comparison video update. The QP value of the comparison video clip is set to the middle point of the range under evaluation with the integer-rounded operation. Termination. There are two termination cases. First, if x c x l 1 and the comparison result is ‘Noticeable Difference’, then search process is terminated 31 andx c is set to the JND point. Second, ifx r x c 1 and the comparison result is ‘Unnoticeable Difference’, the process is terminated and the JND is the latest x c when the comparison result was ‘Noticeable Difference’. (a) Figure 3.6: An example about the proposed JND search procedure. It is assumed that the voting are ’YES’ for the first comparison between (0; 26) and ’NO’ for the second comparison between (0; 19), respectively. The JND location depends on the characteristics of the underlying video content, the visual discriminant power of a subject and the viewing environment. Each JND point can be modeled as a random variable with respect to a group of test subjects. We search and report three JND points for each video clip in the VideoSet. It will be argued in Sec. 3.6 that the acquisition of three JND values are sufficient for practical applications. For a coded video clip set, the same anchor video is used for all test subjects. The anchor video selection procedure is given below. We plot the histogram of the current JND point collected from all subjects and then set the QP value at its first quartile as the anchor video in the search of the next JND point. For this QP value, 75% of test 32 subjects cannot notice a difference. We select this value rather than the median value, where 50% of test subjects cannot see a difference, so as to set up a higher bar for the next JND point. The first JND point search is conducted for QP belonging to [0; 51]. Let x N be the QP value of theN th JND point for a given sequence. The QP search range for (N + 1) th JND is [x N ; 51]. 3.5 JND Data Post-Processing via Outlier Removal Outliers refer to observations that are significantly different from the majority of other observations. The notion applies to both test subjects and collected samples. In practice, outliers should be eliminated to allow more reliable conclusion. For JND data post- processing, we apply outlier detection and removal based on the individual subject and collected JND samples. They are described below. 3.5.1 Unreliable Subjects As described in Sec. 3.3.2, video clips are encoded withQP = [8; 47] whileQP = 0 denotes the source video without any quality degradation. The QP range is further extended to [0; 51] by substituting video of QP = [1; 7] with video of QP = 0, and video of QP = [48; 51] with video of QP = 47. With this substitution, the video for QP = [1; 7] is actually lossless, and no JND point should lie in this range. If a JND sample of a subject comes to this interval, the subject is treated as an outlier. All collected samples from this subject are removed. The ITU-R BT 1788 document provides a statistical procedure on subject screening. It examines score consistency of a subject against all subjects in a test session, where the scores typically range from 1 to 5 denoting from the poorest to the best quality levels. This is achieved by evaluating the correlation coefficient between the scores 33 of a particular subject with the mean scores of all subjects for the whole test session, where the Pearson correlation or the Spearman rank correlation is compared against a pre-selected threshold. However, this procedure does not apply to the collected JND data properly since our JND data is the QP value of the coded video that meets the just noticeable difference criterion. Alternatively, we adopt the z-scores consistency check. Let x m n be the samples obtained from subject m on a video sequence set with video index n, where m = 1; 2;:::;M andn = 1; 2;:::;N. For subjectm, we can form a vector of his/her asso- ciated samples as x m = (x m 1 ;x m 2 ;:::;x m N ): (3.1) Its mean and standard deviation (SD) vectors against all subjects can be written as = ( 1 ; 2 ;:::; N ); n = 1 M M X m=1 x m n ; (3.2) = ( 1 ; 2 ;:::; N ); n = v u u t 1 M 1 M X m=1 (x m n n ) 2 : (3.3) Then, the z-scores vector of subjectm is defined as z m = (z m 1 ;z m 2 ;:::;z m N ); z m n = x m n n n : (3.4) The quantity,z m n , indicates the distance between the raw score and the population mean in the SD unit for subjectm and video clipn. The dispersion of the z-score vector shows consistency of an individual subject with respect to the majority. Both the range and the SD of the z-score vector,z m , are used as the dispersion metrics. They are defined as R = max(z m ) min(z m ); andD =std(z m ); (3.5) 34 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Subject Index -3 -2 -1 0 1 2 3 Boxplot of z-scores (a) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Range 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Standard Deviation Subject # 8, 9, 32 Subject # 4, 7, 27 Subject # 15 Reliable subjects (b) Figure 3.7: Illustration of unreliable subject detection and removal: (a) the boxplot of z-scores and (b) the dispersion plot of all subjects participating in the same test session, where subjects #4, #7, #8, #9, #32 and #37 are detected as outliers. Subject #15 is kept after removing one sample. 35 respectively. A larger dispersion indicates that the corresponding subject gives incon- sistent evaluation results in the test. A subject is identified as an outlier if the associated range and SD values of z-score vector are both large. An example is shown in Fig. 3.7. We provide the boxplot of z-scores for all 32 subjects in Fig. 3.7a and the corresponding dispersion plot in Fig. 3.7b. The horizontal and the vertical axes of Fig. 3.7b are the range and the SD metrics, respectively. For this particular test example, subjects #8, #9 and #32 are detected as outliers because some of their JND samples haveQP = 1. Subjects #4, #7 and #27 are removed since their range and SD are both large. For subject #15, the SD value is small yet the range is large due to one sample. We remove that sample and keep others. 3.5.2 Outlying Samples Besides unreliable subjects, we consider outlying samples for a given test content. This may be caused by the impact of the unique characteristics of different video contents on the perceived quality of an individual. Here, we use the Grubbs’ test [16] to detect and remove outliers. It detects one outlier at a time. If one sample is declared as an outlier, it is removed from the dataset, and the test is repeated until no outliers are detected. We uses = (s 1 ;s 2 ;:::;s N ) to denote a set of raw samples collected for one test sequence. The test statistics is the largest absolute deviation of a sample from the sample mean in the SD unit. Mathematically, the test statistics can be expressed as G = max i=1;:::;N js i sj s : (3.6) 36 At a given significant level denoted by, a sample is declared as an outlier if G> N 1 N v u u t t 2 =(2N);N2 N 2 +t 2 =(2N);N2 ; (3.7) where t 2 =(2N);N2 is the upper critical value of the t-distribution with N 2 degrees of freedom. In our subjective test, the sample size is around N = 30 after removing unreliable subjects and outlying samples. We set the significance level at = 0:05 as a common scientific practice. Then, a sample is identified as an outlier if its distance to the sample mean is larger than the 2:9085 SD unit. Table 3.3: The percentages of JND samples that pass normality test, where the total sequence number is 220. Resolution The first JND The second JND The third JND 1080p 95:9% 95:9% 93:2% 720p 94:1% 98:2% 95:9% 540p 94:5% 97:7% 96:4% 360p 95:9% 97:7% 95:5% 3.5.3 Normality of Post-processed JND Samples Each JND point is a random variable. We would like to check whether it can be approx- imated by a Gaussian random variable [70] after outlier removal. The 2 test was sug- gested in ITU-R BT.500 to test whether a collected set of samples is normal or not. It calculates the kurtosis coefficient of the data samples and asserts that the distribution is Gaussian if the kurtosis is between 2 and 4. 37 Here, we adopt the Jarque-Bera test [27] to conduct the normality test. It is a two- sided goodness-of-fit test for normality of observations with unknown parameters. Its test statistic is defined as JB = n 6 (s 2 + (k 3) 2 4 ); (3.8) where n is the sample size, s is the sample skewness and k is the sample kurtosis. The test rejects the null hypothesis if the statistic JB in Eq. (3.8) is larger than the precomputed critical value at a given significance level, . This critical value can be interpreted as the probability of rejecting the null hypothesis given that it is true. We show the percentage of sequences passing normality test in Table 3.3. It is clear from the table that a great majority of JND points do follow the Gaussian distribution after the post-processing procedure. 3.6 Relationship between Source Content and JND Location We show the JND distribution of the first 50 sequences (out of 220 sequences in total) with resolution 1080p in Fig. 3.8. The figure includes three sub-figures which show the distributions of the first, the second, and the third JND points, respectively. Generally speaking, there exhibit large variations among JND points across different sequences. We examine sequences #15 (tunnel) and #37 (dinner) to offer deeper insights into the JND distribution. Representative frames are given in Fig. 3.9. Sequence #15 is a scene with fast motion and rapid background change. As a result, the masking effect is strong. It is not a surprise that the JND samples vary a lot among different subjects. As shown in Fig. 3.8a, the JND samples of this sequence have the largest deviation among the 50 sequences in the plot. This property is clearly revealed by the collected 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Sequence Index 0 5 10 15 20 25 30 35 40 45 50 Boxplot (a) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Sequence Index 0 5 10 15 20 25 30 35 40 45 50 Boxplot (b) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Sequence Index 0 5 10 15 20 25 30 35 40 45 50 Boxplot (c) Figure 3.8: The boxplot of JND samples of the first 50 sequences with resolution 1080p: (a) the first JND, (b) the second JND, and (c) the third JND. The bottom, the center and the top edges of the box indicate the first, the second and the third quartiles, respectively. The bottom and top whiskers correspond to an interval ranging between [2:7; 2:7], which covers 99:3% of all collected samples. JND samples. Sequence #37 is a scene captured around a dinner table. It focuses on a male speaker with dark background. The face of the man offers visual saliency that attracts the attention of most people. Thus, the quality variation of this sequence is more 39 noticeable than others and its JND distribution is more compact. As shown in Fig. 3.8a, sequence #37 has the smallest SD among the 50 sequences. Furthermore, we plot the histograms of the first, the second, and the third JND points of all 220 sequences in Fig. 3.10. They are centered around QP = 27, 31 and 34, (a) (b) Figure 3.9: Representative frames from source sequences #15 (a) and #37 (b), where sequence # 15 (tunnel) is the scene of a car driving through a tunnel with the camera mounted on the windshield while source # 37 (dinner) is the scene of a dinner table with still camera focusing on a male speaker. 0 5 10 15 20 25 30 35 40 45 50 QP 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Distribution The 1st JND The 2nd JND The 3rd JND Figure 3.10: The histograms of three JND points with all 220 sequences included. 40 respectively. For the daily video service such as the over-the-top (OTT) content, the QP values are in the range of 18 to 35. Furthermore, take the traditional 5-level quality criteria as an example (i.e., excellent, good, fair, poor, bad). The quality of the third JND is between fair and poor. For these reasons, we argue that it is sufficient to measure 3 JND points. The quality of coded video clips that go beyond this range is too bad to be acceptable by today’s viewers in practical Internet video streaming scenarios. 15 20 25 30 35 40 45 50 Mean 0 1 2 3 4 5 6 7 8 9 10 Std The 1st JND The 2nd JND The 3rd JND (a) 15 20 25 30 35 40 45 50 Mean 0 1 2 3 4 5 6 7 8 9 10 Std The 1st JND The 2nd JND The 3rd JND (b) 15 20 25 30 35 40 45 50 Mean 0 1 2 3 4 5 6 7 8 9 10 Std The 1st JND The 2nd JND The 3rd JND (c) 15 20 25 30 35 40 45 50 Mean 0 1 2 3 4 5 6 7 8 9 10 Std The 1st JND The 2nd JND The 3rd JND (d) Figure 3.11: The scatter plots of the mean/std pairs of JND samples: (a) 1080p, (b) 720p, (c) 540p and (d) 360p. The scattered plots of the mean and the SD pairs of JND samples with four resolu- tions are shown in Fig. 3.11. We observe similar general trends of the scattered plots 41 in Fig. 3.11 in all four resolutions. For example, the SD values of the second and the third JND points are significantly smaller than that of the first JND point. The first JND point, which is the boundary between the perceptually lossy and lossless coded video, is most difficult for subjects to determine. The main source of observed artifacts is slight blurriness. In contrast, subjects are more confident in the decision on the second and the third JND points. The dominant factor is noticeable blockiness. The masking effect plays an important role in the visibility of artifacts. For sequences with a large SD value such as sequence # 15 in Fig. 3.8a, its masking effect is strong. On one hand, the JND arrives earlier for some people who are less affected by the masking effect so that they can see the compression artifact easily. On the other hand, the compression artifact is masked with respect to others so that the coding artifact is less visible. For the same reason, the masking effect is weaker for sequences with a smaller SD value. 3.7 Significance and Implications of VideoSet The peak-signal-to-noise (PSNR) value has been used extensively in the video coding community as the video quality measure. Although it is easy to measure, it is not exactly correlated with the subjective human visual experience [39]. The JND measure demands a great amount of effort in conducting the subjective evaluation test. However, once a sufficient amount of data are collected, it is possible to use the machine learning tech- nique to predict the JND value within a short interval. The construction of the VideoSet serves for this purpose. In general, we can convert a set of measured JND samples from a test sequence to its satisfied user ratio (SUR) curve through integration from the smallest to the largest 42 JND values. For the discrete case, we can change the integration operation to the sum- mation operation. For example, to satisfyp% viewers with respect to the first JND, we can divide all viewers into two subsets - the first (100p)% and the remainingp% - according to ordered JND values. Then, we can set the boundary QP p value between the two subsets as the target QP value in video coding. For the first subset of viewers, their JND value is smaller than QP p so that they can see the difference between the source and coded video clips. For the second subset of viewers, their JND value is larger than QP p so that they cannot see the difference between the source and coded video clips. We call the latter group the satisfied user group. When we model the JND distribution as a normal distribution, the SUR curve becomes the Q-function. Two examples are given in Fig. 3.12, where the first JND points of sequence #15 and #37 are plotted based on their approximating normal distri- butions, where the mean and SD values are derived from the subjective test data. Their corresponding Q-functions are also plotted. The Q-function is the same as the SUR curve. For example, the top quartile of the Q-function gives the QP value to encode the video content whose quality will satisfy 75% of viewers in the sense that they cannot see the difference between the coded video and the source video. In other words, it is perceptually lossless compression for these viewers. We show four representative thumbnail images from the two examples in Fig. 3.13. The top and bottom rows are encoded results of sequence #15 and sequence #37, respec- tively. The first column has the best quality with QP=0. Columns 2-4 are encoded with the QP values of the first quartiles of the first, the second, and the third JND points. For a great majority of viewers (say, 75%), the video clip of the first JND point is percep- tually lossless to the reference one as shown in the first column. The video clip at the second JND point begins to exhibit noticeable artifacts. The quality of the video clip at the third JND point is significantly worse. 43 0 5 10 15 20 25 30 35 40 45 50 QP 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 SUR pdf - seq 37 cdf - seq 37 pdf - seq 15 cdf - seq 15 Figure 3.12: The JND and the SUR plots of sequence #15 ( = 30:5; = 7:5) and sequence #37 ( = 22:6; = 4:5). The VideoSet and the SUR quality metric have the following four important impli- cations. 1. It is well known that the comparison of PSNR values of coded video of different contents does not make much sense. In contrast, we can compare the SUR value of coded video of different contents. In other words, the SUR value offers a universal quality metric. 2. We are not able to tell whether a certain PSNR value is sufficient for some video contents. It is determined by an empirical rule. In contrast, we can determine the proper QP value to satisfy a certain percentage of targeted viewers. It provides a practical and theoretically solid foundation in selecting the operating QP for rate control. 3. To the best of our knowledge, the VideoSet is the largest scale subject test ever conducted to measure the response of the human visual system (HVS) to coded 44 (a) (b) (c) (d) (e) (f) (g) (h) Figure 3.13: Comparison of perceptual quality of two coded video sequences. Top row: (a) the reference frame with QP=0, (b) the coded frame with QP=25, (c) the coded frame with QP=36, and (d) the coded frame with QP=38, of sequence #15. Bottom row: (e) the reference frame with QP=0, (f) the coded frame with QP=19, (g) the coded frame with QP=22, and (h) the coded frame with QP=27, of sequence #37. video. It goes beyond the PSNR quality metric and opens a new door for video coding research and standardization, i.e. data-driven perceptual coding. 4. Based on the SUR curve, we can find out the reason for the existence of the first JND point. Then, we can try to mask the noticeable artifacts with novel methods so as to shift the first JND point to a larger QP value. It could be easier to fool human eyes than to improve the PSNR value. 3.8 Conclusion The construction of a large-scale compressed video quality dataset based on the JND measurement, called the VideoSet, was described in detail in this paper. The subjective 45 test procedure, detection and removal of outlying measured data, and the properties of collected JND data were detailed. The significance and implications of the VideoSet to future video coding research and standardization efforts were presented. It points out a clear path to data-driven perceptual coding. One of the follow-up tasks is to determine the relationship between the JND point location and the video content. We need to predict the mean and the variance of the first, second and third JND points based on the calibrated dataset; namely, the VideoSet. The application of the machine learning techniques to the VideoSet for accurate and efficient JND prediction over a short time interval is challenging but an essential step to make data-driven perceptual coding practical for real world applications. Another follow-up task is to find out the artifacts caused by today’s coding technology, to which humans are sensitive. Once we know the reason, it is possible to mask the artifacts with some novel methods so that the first JND point can be shifted to a larger QP value. The perceptual coder can achieve an even higher coding gain if we take this into account in the next generation video coding standard. 46 Chapter 4 Prediction of Satisfied User Ratio and JND Points 4.1 Introduction A large amount of bandwidth of fixed and mobile networks is consumed by real-time video streaming. It is desired to lower the bandwidth requirement by taking human visual perception into account. Although the peak signal-to-noise ratio (PSNR) has been used as an objective measure in video coding standards for years, it is generally agreed that it is a poor visual quality metric that does not correlate with human visual experience well [39]. There has been a large amount of efforts in developing new visual quality indices to address this problem, including SSIM [77], FSIM [89], DLM [32], etc. Humans are asked to evaluate the quality of visual contents by a set of discrete or continuous values called opinion score; typical opinion scores in the range 1-5, with 5 being the best and 1 the worst quality. These indices offer, by definition, users’ subjective test results and thus correlate better than PSNR with their mean (called mean opinion score, or MOS). However, there is one shortcoming with these indices. That is, the difference of selected contents for ranking is sufficiently large for a great majority of subjects. Since the difference is higher than the just-noticeable-difference (JND) threshold for most people, disparities between visual content pairs are easier to tell. 47 Humans cannot perceive small pixel variation in coded image/video until the dif- ference reaches a certain level. There is a recent trend to measure the JND threshold directly for each individual subject. The idea was first proposed in [37]. An assessor is asked to compare a pair of coded image/video contents and determine whether they are the same or not in the subjective test, and a bisection search is adopted to reduce the number of comparisons. Two small-scale JND-based image/video quality datasets were built by the Media Communications Lab at the University of Southern California. They are the MCL-JCI dataset [29] and the MCL-JCV dataset [70]. They target at the JND measurement of JPEG coded images and H.264/A VC coded video, respectively. The number of JPEG coded images reported in [29] is 50 while the number of sub- jects is 30. The distribution of multiple JND points were modeled by a Gaussian Mixture Model (GMM) in [23], where the number of mixtures was determined by the Bayesian Information Criterion (BIC). The MCL-JCV dataset in [70] consists of 30 video clips of wide content variety and each of them were evaluated by 50 subjects. Differences between consecutive JND points were analyzed with outlier removal. It was also shown in [70] that the distribution of the first JND samples of multiple subjects can be well approximated by the normal distribution. The JND measure was further applied to the HEVC coded clips and, more importantly, a JND prediction method was proposed in [24]. The masking effect was considered, related features were derived from source video, and a spatial-temporal sensitive map (STSM) was defined to capture the unique characteristics of the source content. The JND prediction problem was treated as a regression problem. More recently, a large-scale JND-based video quality dataset, called the VideoSet, was built and reported in [72]. The VideoSet consists of 220 5-second sequences, each at four resolutions (i.e., 19201080, 1280720, 960540 and 640360). Each of these 880 video clips was encoded by the x264 encoder implementation [1] of the H.264/A VC 48 (a) (b) Figure 4.1: Representative frames from source sequences (a) #37 and (b) #90. standard with QP = 1; ; 51 and the first three JND points were evaluated by 30+ subjects. The VideoSet dataset is available to the public in the IEEE DataPort [?]. It includes all source/coded video clips and measured JND data. In this work, we focus on the prediction of the satisfied user ratio (SUR) curves for the VideoSet and derive the JND points from the predicted curves. This is different from the approach in [24], which attempted to predict the JND point directly. Here, we adopt a machine learning framework for the SUR curve prediction. First, we partition a video clip into local spatial-temporal segments and evaluate the quality of each segment using the VMAF [35] quality index. Then, we aggregate these local VMAF measures to derive a global one. Finally, the masking effect is incorporated and the support vector regression (SVR) is used to predict the SUR curves, from which the JND points can be derived. Experimental results are given to demonstrate the performance of the proposed SUR prediction method. The rest of this chapter is organized as follows. The SUR curve prediction problem is defined in Sec. 4.2. The SUR prediction method is detailed in Sec. 4.3. Experimen- tal results are provided in Sec. 4.4. Finally, concluding remarks and future research direction are given in Sec. 4.5. 49 4.2 JND and SUR for Coded Video Given a set of clipsd i ,i = 0; 1; 2; ; 51, coded from the same source videor, wherei is the quantization parameter (QP) index used in the H.264/A VC. Typically, clipd i has a higher PSNR value than clipd j , ifi<j, andd 0 is the losslessly coded copy ofr. We use the first JND to demonstrate the process to model SUR curve from JND samples. The same methodology also applies to the second and the third JND. The first JND location is the transitional index i that lies on the boundary of per- ceptually lossless and lossy visual experience for a subject. The first JND is a random variable rather than a fixed quantity since it varies with several factors, including the visual content under evaluation, the test subject and the test environment. Based on the study in [?], the JND position can be approximated by a Gaussian distribution in form of XN ( x;s 2 ); (4.1) where x ands are the sample mean and sample standard deviation, respectively. We say that a viewer is satisfied if the compressed video appears to be perceptually the same as the reference. Mathematically, the satisfied user ratio (SUR) of video clip d i can be expressed as S i = 1 1 M M X m=1 1 m (d i ); (4.2) whereM is the total number of subjects and1 m (d i ) = 1 or 0 if themth subject can or cannot see the difference between compressed clipd i and its reference, respectively. The summation term in right-hand-side of Eq. (4.2) is the empirical cumulative distribution function (CDF) of random variableX as given in Eq. (4.1). Then, by plugging Eq. (4.1) into Eq. (4.2), we can obtain a compact formula for the SUR curve as S i =Q(d i j x;s 2 ); (4.3) 50 whereQ() is the Q-function of the normal distribution. Consecutive JND points were modeled sequentially in the subjective test. Satisfied user ratio for the second and the third JND points are defined similarly. The only dif- ference is about the reference used for quality comparison. To be specific, losslessly encoded clip, i.e. d 0 , is used as the reference clip to search for the first JND. We mod- eled the first SUR curve as described above when the subjective test of the first JND is finished. The 75% point from the first SUR curve is used as the reference for the second JND. Then, the reference for the third JND is the 75% point of the second SUR curve. An example about SUR curve modeling is given in Figs. 4.2 a, c, e. They are about the three JND points of a sequence. The reference for the first JND is perceptually lossless coded clip d 0 and the location of 75% SUR point is d 18 . Then d 18 is used as the reference for the second JND and we got a bunch of JND samples for the second JND as shown in Figs. 4.2 b. Lastly, the third JND usesd 23 as its reference for quality comparison. Corresponding JND samples and SUR curve were shown in Figs. 4.2 e. 4.3 Proposed SUR prediction System The SUR curve is primarily determined by two factors: 1) quality degradation due to compression and 2) the masking effect. To shed light on the impact of the masking effect, we use sequences #37 (DinnerTable) and #90 (TodderFountain) as examples. Their representative frames are shown in Figs. 4.1 (a) and (b) and the first JND data distributions are given in Figs. 4.2 (a) and (b), respectively. Sequence #37 is a scene captured around a dining table. It focuses on a male speaker with still dark background. His face is the visual salient region that attracts people’s attention. The masking effect is weak and, as a result, the JND point arrives earlier (i.e. a smalleri value ind i ). On the other hand, sequence #90 is a scene about a toddler playing in a fountain. The masking 51 10 15 20 25 30 35 40 45 QP 0.0 0.2 0.4 0.6 0.8 1.0 Distribution of JND Data (#37) PDF SUR Histogram (a) 10 15 20 25 30 35 40 45 QP 0.0 0.2 0.4 0.6 0.8 1.0 Distribution of JND Data (#90) PDF SUR Histogram (b) 10 15 20 25 30 35 40 45 QP 0.0 0.2 0.4 0.6 0.8 1.0 Distribution of JND Data (#37) PDF SUR Histogram (c) 10 15 20 25 30 35 40 45 QP 0.0 0.2 0.4 0.6 0.8 1.0 Distribution of JND Data (#90) PDF SUR Histogram (d) 10 15 20 25 30 35 40 45 0.0 0.2 0.4 0.6 0.8 1.0 Distribution of JND Data (#37) PDF SUR Histogram (e) 10 15 20 25 30 35 40 45 0.0 0.2 0.4 0.6 0.8 1.0 Distribution of JND Data (#90) PDF SUR Histogram (f) Figure 4.2: SUR modeling from JND samples. The first column and the second column are about sequence #37 and #90, respectively. The top row, the middle row and the bottom row are about SUR of the first, the second, and the third JND, respectively. 52 effect is strong due to water drops in background and fast object movement. As a result, compression artifacts are difficult to perceive and the JND point arrives later. The masking effect also leads to significant differences on the second and the third JND. We can observe this phenomenon by comparing the second SUR curve between sequence #37 Figs. 4.2 (c) and sequence #90 Figs. 4.2 (d). DinnerTable has much smaller JND points than that of TodderFountain. Indeed, the JND points of the third JND about DinnerTable, i.e., Figs. 4.2 (e) are smaller than that of the first JND points of ToddlerFountain, i.e., Figs. 4.2 (f). Thus, we need to pay special attention to masking effect while predicting SUR curves. The block diagram of the proposed SUR prediction system is given in Fig. 4.3. When a subject evaluates a pair of video clips, different spatial-temporal segments of the two video clips are successively assessed. The segment dimensions are spatially and temporally bounded. The spatial dimension is determined by the area where the sequence is projected on the fovea. The temporal dimension is limited by the fixation duration or the smooth pursuit duration, where the noticeable difference is more likely to happen than the process of saccades [20, 50]. Thus, the proposed SUR prediction system first evaluates the quality of local Spatial-Temporal Segments. Then, similarity indices in these local segments are aggregated to give a compact global index. Then, significant segments are selected based on the slope of quality scores between neighbor- ing coded clips. After that, we incorporate the masking effect that reflects the unique characteristics of each video clip. Finally, we use the support vector regression (SVR) to minimize theL 2 distance of the SUR curves, and derive the JND point accordingly. Several major modules of the system will be detailed below. 53 Figure 4.3: The block diagram of the proposed SUR prediction system. 4.3.1 Spatial-Temporal Segment Creation The perception of the video distortions is closely linked to the visual attention mecha- nisms. HVS is intrinsically a limited system. The visual inspection of the visual field is performed through many visual attention mechanisms. The eye movements can be mainly decomposed into three types [20]: saccades, fixations, and smooth pursuits. Saccades are very rapid eye movements allowing humans to explore the visual field. Fixation is a residual movement of the eye when the gaze is fixed on a particular area of 54 the visual field. Pursuit movement is the ability of the eyes to smoothly track the image of a moving object. Saccades allow us to mobilize the visual sensory resources (i.e., all parts of the HVS dedicated to processing the visual signal coming from the central part of the retina: the fovea) on the different parts of a scene. Between two saccade periods a fixation (or smooth pursuit) occurs. When a human observer assesses a video sequence, different spatiotemporal seg- ments of the video sequence are successively assessed. These segments are spatially limited by the area of the sequence projected on the fovea. Furthermore, these segments are temporally limited by the fixation duration, or by the smooth pursuit duration. The perception of a video distortions is likely to happen during a fixation, or during a smooth pursuit. The purpose of this module is to divide a video clip into multiple spatial-temporal segments and evaluate their quality at the eye fixation or a smooth pursuit level. The dimension of a spatial-temporal segment is WHT . In case of eye pursuit, the spatial dimension should be large enough while the temporal dimension should be short enough to ensure that the moving object is still covered in one segment. In case of eye fixation, the spatial dimension should not be too large and the temporal dimension should not be too long to represent quality well at the fixation level. Based on the study in [83, 50, 76], we set W = 320, H = 180 and T = 0:5s here. The neighboring segments overlap 50% in the spatial dimension. For example, the original dimension of 720p video is 1280 720 5s, and there are 7 7 10 = 490 segments created from each clip. 55 Figure 4.4: The block diagram of VMAF quality metric. 4.3.2 Local Quality Assessment We choose the Video Multimethod Assessment Fusion (VMAF) [35] as the primary quality index to assess quality degradation of compressed segments. VMAF is an open- source full-reference perceptual video quality index that aims to capture the perceptual quality of compressed video. It first estimates the quality score of a video clip with mul- tiple high-performance image quality indices on a frame-by-frame basis. Then, these image quality scores are fused together using the support vector machine (SVM) at each frame. Currently, two image fidelity metrics and one temporal signal have been cho- sen as features to the SVM. These elementary metrics and features were chosen from amongst other candidates through iterations of testing and validation. 56 Detail Loss Measure (DLM) [32]. DLM is an image quality metric based on the rationale of separately measuring the loss of details which affects the content visi- bility, and the redundant impairment which distracts viewer attention. It estimates the blurriness component in the distortion signal using wavelet decomposition. It uses contrast sensitivity function (CSF) to model the human visual system (HVS), and the wavelet coefficients are weighted based on CSF thresholds. Visual Information Fidelity (VIF) [64]. VIF is based on visual statistics combined with HVS modeling. It quantifies the Shannon information shared between the source and the distortion relative to the information contained in the source itself. The masking and sensitivity aspects of the HVS are modeled through a zero mean, additive white Gaussian noise model in the wavelet domain that is applied to both the reference image and the distorted image model. Motion information. Motion in videos is chose as the temporal signal, since the HVS is less sensitive to quality degradation in high motion frames. The global motion value of a frame is the mean co-located pixel difference of a frame with respect to the previous frame. Since noise in the video can be misinterpreted as motion, a low-pass filter is applied before the difference calculation. Results on various video quality databases show that VMAF outperforms other video quality indices such as PSNR, SSIM [77], Multiscale Fast-SSIM [8], and PSNR-HVS [55] in terms of the Spearman Rank Correlation Coefficient (SRCC), Pearson Cor- relation Coefficient (PCC), and the root-mean-square error (RMSE) criteria. VMAF achieves comparable or outperforms the state-of-the-art video index, the VQM-VFD index [83], on several publicly available databases. For more details about VMAF, we refer interested readers to [35]. 57 4.3.3 Significant Segments Selection VMAF is typically applied to all spatial-temporal segments. However, not all segments contribute equally to the final quality of the entire clip [90, 91, 86]. To select signif- icant segments that are more relevant to our objective, we examine the local quality degradation slope, which is defined as V (S d i wht ) = V (S d ik wht )V (S d i wht ) k ; (4.4) V (S i ) = V (S ik )V (S i ) k ; (4.5) where V (S d i wht ) is the VMAF score of segment S d i wht that is cropped from compressed clipd i with spatial indices (w;h) and temporal indext, respectively. The slope in Eq. (4.5) evaluates how much the VMAF score of the current segment S d i wht differs from that in its neighboring compressed clipS d ik wht , wherek = 2 is the QP difference between them. If the slope is small, the local quality does not change too much and the probability of the associated coding indexi to be a JND point is lower. We order all spatial-temporal segments based on their slopes and select p percents of them with larger slope values. We setp = 80% in our experiment. The goal is to filter out less important segments before we extract a representative feature vector. 4.3.4 Quality Degradation Features A cumulative quality degradation curve is computed for every coded clip based on the change of VMAF scores in significant segments. Its computation consists of two steps. 58 First, we compute the difference of VMAF scores between a significant segment from compressed clipd i and its referencer as V (S d i wht ) =V (S r wht )V (S d i wht ): (4.6) The values V (S d i wht ) collected from all significant segments can be viewed as samples of a random variable denoted by V (S d i ). Then, based on the distribution of V (S d i ), we can compute the cumulative quality degradation curve as F d i (n) =Prob[V (S d i ) 2n]; forn = 1; ; 20; (4.7) which captures the cumulative histogram of VMAF score differences for coded video d i . As shown in Eq. (4.7), the cumulative quality degradation curve is represented in form of a 20-D feature vector. 4.3.5 Masking Features As mentioned earlier, quality degradation in a spatial-temporal segment is more difficult to observe if there exists a masking effect in the segment. Here, we use the spatial ran- domness and temporal randomness proposed in [21, 22] to measure the masking effect. The process is sketched below. First, high frequency components of distortions are first removed by applying a low-pass filter, which is inspired by the Contrast Sensitivity Function (CSF), in the pre-processing step. Then, we use the spatial randomness (SR) model [21] and the temporal randomness (TR) [22] to compute the spatial and temporal regularity in a spatial-temporal segment that is generated in Step 1. Spatial Randomness is measured quantitatively using the spatial estimation error from neighboring pixels as shown in Fig. 4.5. The red pixel Y is the central region 59 Figure 4.5: Spatial randomness prediction based on neighborhood information. we want to predict from its neighboring information X which are shown in blue dots aroundY . The model can be expressed as: ^ Y =HX(u); (4.8) whereu is an indication of spatial location. The optimal prediction can be derived by optimizing the transform matrixH using minimum mean squared error optimization as ^ H =R YX R 1 X ; (4.9) whereR YX = E[Y (u)X(u) T ] is the cross-correlation matrix betweenX(u) andY (u) andR X =E[X(u)X(u) T ] is the correlation matrix ofX(u). HereR YX andR X specif- ically preserve structure information of corresponding image content. However since R X is not always invertible, we can replace R 1 X by the approximated pseudo inverse ^ R 1 X as ^ R 1 X =U m 1 m U T m ; (4.10) 60 where m is the eigenvalue matrix with topm non-zero eigenvalues of matrixR X and U m is the corresponding eigenvector matrix. Spatial randomness is the estimation error from neighborhood with structural correlation as SR(u) = Y (u)R YX ^ R 1 X X(u) ; (4.11) A large value ofSR(u) indicates that the structure is more irregular and thus contains more randomness. On the other hand,SR(u) would be close to zero for regular regions. 2) Temporal Randomness is an important characteristic of video, which is highly related to masking activities. Usually distortion is highly masked in the massive and random motions while less masked in regular and smooth motions. For regular motion, the future frames can be predicted from the past frames by learning the temporal behav- ior of a short video clip in past. Thus the prediction error reflects the randomness of motion. Motion randomness serves as a proper feature to measure the homogeneity of motion, and predict the possible temporal influence on perceptual quality. Based on this, we follow the method in [22] to use prediction residues as an indication of motion randomness in a similar way of spatial randomness. Specifically, we divide a sequence segmentY l k from thek-th frame to thel-th frame into two parts with one frame distance, each representing a set of motion information in a certain period of time. We then model the change of motion information as Y l k+1 =CX l k+1 +W l k+1 X l k+1 =AX l1 k +V l k+1 ; (4.12) where X l k+1 = [x(k + 1); ;x(l)] and X l1 k = [x(k); ;x(l 1)] are state sequences of Y l k+1 and Y l1 k , respectively. A is the state transition matrix encoding 61 the regular motion information. V l k+1 is the sequence of motion noise that can not be captured by regular motion. C is the observation matrix encoding the shapes of objects within the frames.W l k+1 is the sequence of observation noise that can not be represented by regular shapes. We apply singular value decomposition to keep then largest singular values as, Y l k+1 =UV T +W l k+1 ; (4.13) where =diag[ 1 ; ; n ] contains then largest singular values andU,V are corre- sponding decomposition vectors. By setting X l k+1 = V T , and C(l) = U, we could determine the state sequences and the model parameter. Moreover, state transition matrix A is expected to capture the motion information between frames. Similar to spatial randomness, we can get the optimal prediction ofA by minimizing the squared estimation error as ^ A(l) = arg min A kX l k+1 AX l1 k k; =X l k+1 X l1 k + (4.14) whereX l1 k + is the pseudo inverse ofX l1 k . Finally, we can predict future framey(l+1) based on the obtained model parameters. TR(l + 1) =jy(l + 1)C(l)A(l)x(l)j; (4.15) whereTR(l + 1) is the noise that could not be predicted with regular information. We show spatial and temporal randomness maps in Figs. 4.6 and Figs. 4.7. It is obvious that the randomness for DinnerTable is low for background regions. By 62 (a) (b) Figure 4.6: Spatial randomness map for sequences (a) #37 and (b) #90. (a) (b) Figure 4.7: Temporal randomness map for sequence (a) #37 and (b) #90. contrast, water drops in ToddlerFountain result in large randomness. The differences of temporal randomness between these two sequences are more obvious as shown in Fig. 4.7. The spatial randomness is small in smooth or highly structured regions. Similarly, the temporal randomness is small if there is little or regular motion between adjacent frames. When the SR and TR values are higher, the spatial and temporal masking effect are stronger. The masking features,M s , are extracted from the reference clip only. The histograms of the SR and the TR are concatenated to yield the final masking feature vector: M d 0 = [Hist 10 (SR);Hist 10 (TR)]: (4.16) 63 4.3.6 Prediction of SUR Curves and JND Points The final feature vector is the concatenation of two feature vectors. The first one is the quality degradation feature vector of dimension 20 as given in Eq. (4.7). The second one is the masking feature vector of dimension 20 as given in Eq. (4.16). Thus, the dimension of the final concatenated feature vector is 40. The SUR prediction problem is treated as a regression problem, and solved by the Support Vector Regressor (SVR) [65]. Specifically, we adopt the-SVR with the radial basis function kernel. 4.4 Experiment Results In this section, we present the prediction results of the proposed SUR prediction frame- work. The VideoSet consists of 220 videos in 4 resolutions and three JND points per resolution per video clip. Here, we focus on the SUR prediction of all the three JND and conduct this task for each video resolution independently. For each prediction task, we trained and tested 220 video clips using the 5-fold validation. That is, we choose 80% (i.e. 176 video clips) as the training set and the remaining 20% (i.e., 44 video clips) as the testing set. We rotated the 20% testing set five times so that each video clip was tested once. Since the JND location is chosen to be the QP value when the SUR value is equal to 75% in the VideoSet, we adopt the same rule here so that the JND position can be easily computed from the predicted SUR curve. We tested the proposed system on different settings as shown in Table 4.1. There are three different settings based on the reference used to predict the second and the third JND. Setting 1: ground truth reference locations were used. There is no calibration error on reference. 64 0.4 0.2 0.0 0.2 0.4 SUR Prediction Error 0 5 10 15 20 Histogram (a) (b) 0.4 0.2 0.0 0.2 0.4 SUR Prediction Error 0 5 10 15 20 25 30 Histogram (c) 10 15 20 25 30 35 40 45 Ground Truth 10 15 20 25 30 35 40 45 Prediction 1280x720, MAE: 1.227 (d) 0.4 0.2 0.0 0.2 0.4 SUR Prediction Error 0 5 10 15 20 25 30 35 Histogram (e) 10 15 20 25 30 35 40 45 Ground Truth 10 15 20 25 30 35 40 45 Prediction 1280x720, MAE: 0.750 (f) Figure 4.8: JND prediction result: (a), (c), and (e) are about the histogram of SUR. (b), (d), (e) are about the predicted VS. the ground truth JND location. The top row is about the first JND, the middle row is about the second JND, the bottom row about the third JND, respectively. 65 Table 4.1: SUR and JND prediction setting. The main difference is about the reference used to predict the second and the third JND. Models Order Reference Samples JND 1st R 0 [1; 51] Y 1 Subjective Test 2nd Y 1 [Y 1 +1, 51] Y 2 3rd Y 2 [Y 2 +1, 51] Y 3 1st R 0 [1; 51] ^ Y 1 Setting 1 2nd Y 1 [Y 1 +1, 51] ^ Y 2 Ground truth reference 3rd Y 2 [Y 2 +1, 51] ^ Y 3 Setting 2 2nd ^ Y 1 [ ^ Y 1 +1, 51] ^ Y 2 Predicted reference 3rd ^ Y 2 [ ^ Y 2 +1, 51] ^ Y 3 Setting 3 2nd R 0 [1; 51] ^ Y 2 Same reference 3rd R 0 [1; 51] ^ Y 3 Setting 2: operates on the predicted reference from the first SUR curve. It is the practical scenario when no subjective data was available. Setting 3: uses the same reference as the first JND point. JND is reference depen- dent and we want to alleviate the dependencies in a prediction system. The averaged prediction errors of the first SUR curve and the first JND position for video clips in four resolutions were summarized in Table 4.2. Meanwhile, Table 4.3 and Table 4.4 give SUR and QP for the second and the third for different settings, respectively. We see that prediction errors increase as the resolution becomes lower for the first JND point. This is probably due to the use of fixed W and H values in generating spatial-temporal segments as described in Sec. 4.3.1. We will finetune these parameters to evaluate their influence on prediction results in the future. It is also clear that Setting 1 always achieve the best performance and Setting 3 gets the largest QP in Table 4.4. However, we did not observe the same phenomena in SUR in Table 4.3. The reason is that the third JND point (predicted) is the 75% point on the curve. The remaining QP range is limited and SUR tends to be close to zero for large QP, sayQP > 40. 66 Table 4.2: Summary of averaged prediction errors of the first JND for video clips in four resolutions. 1080p 720p 540p 360p SUR 0.039 0.038 0.037 0.042 QP 1.218 1.273 1.345 1.605 Table 4.3: Mean Absolute Error of predicted SUR, i.e., SUR for the second and the third JND. Setting 1 Setting 2 Setting 3 Resolution 2nd 3rd 2nd 3rd 2nd 3rd 1080p 0.062 0.029 0.063 0.065 0.057 0.056 720p 0.054 0.032 0.057 0.060 0.055 0.056 540p 0.050 0.030 0.054 0.052 0.046 0.049 360p 0.052 0.030 0.058 0.056 0.048 0.053 Table 4.4: Mean Absolute Error of predicted JND location, i.e., QP for the second and the third JN Setting 1 Setting 2 Setting 3 Resolution 2nd 3rd 2nd 3rd 2nd 3rd 1080p 1.618 0.709 2.009 2.245 2.364 2.445 720p 1.227 0.750 1.709 1.927 2.209 2.227 540p 1.223 0.773 1.523 1.700 1.873 2.009 360p 1.341 0.745 1.750 1.836 1.923 2.105 To see the prediction performance of each individual clip, we use 720p video as an example. The histogram of the SUR prediction error are given in Fig. 4.8 (a), (c), and (e) for the first three SUR curves, where the mean absolute error (MAE) are 0.038, 0.054, and 0.032, respectively. The predicted JND location versus the ground-truth JND loca- tion are plotted in Fig. 4.8 (b), (d), and (e), where each dot denotes one video clip. As shown in the figure, most dots are distributed along the 45-degree line, which indicates that the predicted JND is very close to the ground truth JND for most sequences. 67 4.5 Conclusion A Satisfied User Ratio (SUR) prediction framework for H.264/A VC coded video is pro- posed in this work. The proposed video quality index, i.e. the SUR, seamlessly reflects the perceived quality of compressed video clips. We present a SUR prediction frame- work for the first three JND points. It takes both the local quality degradation as well as the masking effect into consideration and extracts a compact feature vector. We train a support vector regressor to obtain the predicted SUR curve. The JND point can be derived accordingly. The system achieves good performance in all resolutions. We will continue finetune the framework, such as, the dimension of segment, percentage of key segments, other sophisticated spatial-temporal pooling method, in the near future. 68 Chapter 5 JND-based Video Quality Model and Its Application 5.1 Introduction Machine learning-based video quality assessment (VQA) systems rely heavily on the quality of collected subjective scores. Obtaining accurate and robust labels based on subjective votings provided by human observers is a critical step in Quality of Experi- ence (QoE) evaluation. A typical pipeline consists of three main steps. First, a group of human viewers are recruited to grade video quality based on individual perception. Second, noisy subjective data should be cleaned and combined to provide an estimate of the actual video quality. Third, a machine learning model was trained and tested on the calibrated datasets, and the performance was reported in terms of evaluation criteria. They are called the data collection, cleaning and analysis steps, respectively. Subjective quality evaluation is the ultimate means to measure quality of experience (QoE) of users. Formal methods and guidelines for subjective quality assessments are specified in various ITU recommendations, such as ITU-T P.910 [25], ITU-R BT.500 [60], etc. Several datasets on video quality assessment were proposed, such as the LIVE dataset [62], the Netflix Public dataset [35], the VQEG HD3 dataset [15] and the VideoSet [72]. Furthermore, efforts have been made in developing objective quality metrics such as VQM-VFD [83], MOVIE [61] and VMAF [35]. 69 Absolute Category Rating (ACR) is one of the most commonly used subjective test methods. Test video clips are displayed on a screen for a certain amount of time and observers rate their perceived quality using an abstract scale [25], such as “Excellent (5)”, “Good (4)”, “Fair (3)”, “Poor (2)” and “Bad (1)”. When each content is evaluated several times by different subjects, a straightforward approach is to use the most common label as the true label [82, 40]. There are two approaches in aggregating multiple scores on a given clip. They are the mean opinion score (MOS) and the difference mean opinion score (DMOS). The MOS is computed as the average score from all subjects while the DMOS is calculated from the difference between the raw quality scores of the reference and the test images. Both MOS and DMOS are popular in the quality assessment community. However, they have several limitations [7, 87, 93, 80]. The first limitation is that the MOS scale is as an interval scale rather than an ordinal scale. It is assumed that there is a linear relationship between the MOS distance and the cognitive distance. For example, a qual- ity drop from “Excellent” to “Good” is treated the same as that from “Poor” to “Bad”. There is no difference to a metric learning system as the same ordinal distance is pre- served (i.e. the quality distance is 1 for both cases in the aforementioned 5-level scale). However, human viewing experience is quite different when the quality changes at dif- ferent levels. It is also rare to find a video clip exhibiting poor or bad quality in real-life video applications. As a consequence, the number of useful quality levels drops from five to three. It is too coarse for video quality measurement. The second limitation is that scores from subjects are typically assumed to be inde- pendently and identically distributed (i.i.d.) random variables. This assumption rarely holds. Given multiple quality votings on the same content, individual voting contributes equally in the MOS aggregation method [82]. Subjects may have different levels of expertise on perceived video quality. A critical viewer may give low quality ratings on 70 coded clips whose quality is still good to the majority [31]. The same phenomenon occurs in all presented stimuli. The absolute category rating method is confusing to subjects as they have different understanding and interpretation of the rating scale. To overcome the limitations of the MOS method, the just-noticeable-difference (JND) based VQA methodology was proposed in [37] as an alternative. A viewer is asked to compare a pair of coded clips and determine whether noticeable difference can be observed or not. The pair consists of two stimuli, i.e. a distorted stimulus (compar- ison) and an anchor preserving the targeted quality. A bisection search is adopted to reduce the number of pair comparisons. The JND reflects the boundary of perceived quality levels, which is well suited for the determination of the optimal image/video quality with minimum bit rates. For example, the first JND, whose anchor is the source clip, is the boundary between “Excellent” and “Good” categories. The boundary is sub- jectively decided rather than empirically selected by the experiment designer. Data cleaning is a necessary step once subjective data is obtained. In MOS or JND- based VQA methods, subjective data are noisy due to the nature of “subjective opin- ion”. In the extreme case, some subjects submit random answers rather than good-faith attempts to label. Even worse, adversary votings may happen due to malice or a sys- tematic misinterpretation of the task. In this Chapter, we first propose a novel method for the data cleaning step, which is essential for a variety of video contents viewed by different individuals. This is a challenging problem due to the following variabilities. Inter-subject variability. Subjects may have a different vision capability. Intra-subject variability. The same subject may have different scores against the same content in multiple rounds. Content variability. Video contents have varying characteristics. 71 Thus, it is critical to study subject capability and reliability to alleviate their effects in the VQA task. Based on the Just-Noticeable-Difference (JND) criterion, a VQA dataset, called the VideoSet [72], was constructed recently. Being motivated by [36], we develop a prob- abilistic VQA model to quantify the influence of subject and content factors on JND- based VQA scores. Then, we will show that this new method provides a powerful data cleaning tool for JND-based VQA datasets. Furthermore, we propose a user model that takes subject bias and inconsistency into account. The perceived quality of compressed video is characterized by the satisfied user ratio (SUR). The SUR value is a continuous random variable depending on subject and content factors. We study the SUR difference as it varies with user profile as well as content with variable level of difficulty. The proposed model aggregates quality ratings per user group to address inter-group difference. The proposed user model is validated on the data collected in the VideoSet[72]. It is demonstrated that the model is flexible to predict SUR distribution of a specific user group. 5.2 Related Work There were several popular datasets available in the video quality assessment commu- nity, such as LIVE [62], VQEG-HD [15], MCL-V [38], and NETFLIX-TEST [36], using the MOS aggregation approach. Recently, efforts have been made to examine MOS-based subjective test methods. Various methods were proposed from different perspectives to address the limitations mentioned in Section 5.1. The impacts of subject and content variabilities on video quality scores are often analyzed separately. Az-score consistency test was used as a preprocessing step to iden- tify unreliable subjects in the VideoSet. Another method was proposed in [41], which 72 built a probabilistic model for the quality evaluation process and then estimated model parameters with a standard inference approach. A perceptually weighted rank correla- tion indicator [85] was proposed, which rewarded the capability of corrected ranking high-quality images and suppressed the attention towards insensitive rank mistakes. A subject model was proposed in [26] to study the influence of subjects on test scores. An additive model was adopted and model parameters were estimated using real data obtained by repetitive experiments on the same content. They model three major factors that influence MOS accuracy: subject bias, subject inaccuracy, and stimu- lus scoring difficulty. It was reported that the distribution of these three factors spanned about25% of the rating scale. Especially, the subject error terms explained previ- ously observed inconsistencies both within a single subject’s data and also the lab-to- lab differences. More recently, a generative model was proposed in [36] that treated content and subject factors jointly by solving a maximum likelihood estimation (MLE) problem. Their model was developed targeting on the traditional mean-opinion-scores (MOS) data acquisition process with continuous degradation category rating. However, these models were proposed for the traditional MOS-based approaches. Recently, there has been a large amount of efforts in JND-based video quality anal- ysis. The human visual system (HVS) cannot perceive small pixel variation in coded video until the difference reaches a certain level. However, the difference of selected contents for ranking in traditional MOS-based framework was sufficiently large for the majority of subjects. We could conduct fine-grained quality analysis by directly mea- suring the JND threshold of each subject. The JND-based VOA methodology provides a new framework for fine-grained video quality scores acquisition. Several JND-based VQA datasets were constructed [72, 29, 70], and JND location prediction methods were examined in [24, 71]. However, the 73 JND location was analyzed in a data-driven fashion. It was simply modeled by the mean value of multiple JND samples with heuristic subject rejection approach. Being inspired by [36], we develop a JND-based VQA model that considers sub- ject and content variabilities jointly. The proposed generative model decomposed JND- based video quality score into subject and content factors. A close-form expression was derived to estimate the JND location by aggregating multiple binary decisions. It was shown that the JND samples followed Normal distribution which was parameterized by the subject and content factors. These unknown factors were jointly optimized by solving a maximum likelihood estimation (MLE) problem. 5.3 Derivation of JND-based VQA Model Consider a VQA dataset containingC video contents, where each source video clip is denoted byc,c = 1; ;C. Each source clip is encoded into a set of coded clipsd i ,i = 0; 1; 2; ; 51, wherei is the quantization parameter (QP) index used in the H.264/A VC standard. By design, clipd i has a higher PSNR value than clipd j , ifi<j, andd 0 is the losslessly coded copy ofc. The JND of this set of coded clips characterizes the distortion visibility threshold with respect to a given anchor,d i . Through subjective experiments, JND points can be obtained from a sequence of consecutive noticeable/unnoticeable difference tests between clips pair (d i ;d j ), wherej2fi + 1; ; 51g. 5.3.1 Binary Decisions in Subjective JND Tests The anchor, d i , is fixed while searching for the JND location. With a binary search procedure to updated j , it takes at mostL = 6 rounds to find the JND location. Here, we usel,l = 1; ;L, to indicate the round number ands,s = 1; ;S, to indicate the subject index, respectively. The test result obtained from subject s at round l on 74 contentc is a binary decision - noticeable or unnoticeable differences. This is denoted by random variable X c;s;l 2f0; 1g. If the decision is unnoticeable difference, we set X c;s;l = 1. Otherwise,X c;s;l = 0. The probability ofX c;s;l can be written as Pr(X c;s;l = 1) =p c;s;l andPr(X c;s;l = 0) = 1p c;s;l ; (5.1) where random variable p c;s;l 2 [0; 1] is used to model the probability of making the “unnoticeable difference” decision at a given comparison. We say that a decision was made confidently if all subjects made the same decision, no matter it was “noticeable difference” or “unnoticeable difference”. One the other hand, a decision was made least confidently if two decisions had equal sample size. In light of these observations,p c;s;l should be closer to zero for smallerl since the quality difference between two clips is more obvious in earlier test rounds. It is close to 0.5 for largerl as the coded clip approaches the final JND location. 5.3.2 JND Localization by Integrating Multiple Binary Decisions During the subjective test, a JND sample was obtained through multiple binary deci- sions. Let X c;s = [X c;s;1 ; ;X c;s;L ] denote a sequence of decisions made by subject s on contentc. Random variableX c;s;l is assumed to be independently identically dis- tributed (i.i.d) in subject indexs. Furthermore,X c;s;l is independent of content indexc since the binary search approaches the ultimate JND location at the same rate regardless of the content. The search interval at roundl, denoted by QP l , can be expressed as QP l = QP 0 ( 1 2 ) l ; (5.2) 75 where QP 0 = 51 is the initial JND interval for the first JND. The anchor location is QP 0 , i.e. the reference and the JND is searched between [QP 1 ;QP 51 ]. We skip comparison between clips pair (QP 0 ;QP 51 ) since it is a trivial one. By definition, the JND location is the coded clip that is the transition point from the unnoticeable difference to the noticeable difference against the anchor. It is located at the last round after a sequence of “noticeable difference” decisions. Thus, the JND location on content c for subject s can be obtained by integrating searching intervals based on decision sequencesX c;s as Y c;s = L X l=1 X c;s;l QP l ; (5.3) since we need to add QP l to the offset (or the left ending) point of the current searching interval ifX c;s;l = 1 and keep the same offset ifX c;s;l = 0. The distance between the left end point of the search interval and the JND location is no larger than one QP when the search procedure converges. Then, the JND location could be expressed as a function of the confidence of subjects on contentc: Y c;s = QP 0 L X l=1 p c;s;l ( 1 2 ) l : (5.4) 5.3.3 Decomposing JND into Content and Subject Factors The JND locations depend on several causal factors: 1) the bias of the subject, 2) the consistency of a subject, 3) the averaged JND location, 4) the difficulty of a content to evaluate. To provide a closed-form expression of the JND location, we adopt the following probabilistic model for the confidence random variable: p c;s;l = l + c + s ; (5.5) 76 011 026 041 056 071 086 101 116 131 146 161 175 189 203 217 Figure 5.1: Representative frames from 15 source contents. where l = 1 2 (1e l ) is the average confidence, c N ( c ; 2 c ) and s N ( s ; 2 s ) are two Gaussian random variables to capture content and subject factors, respectively. and are weights to control the effects of mentioned factors. We set = 0:7, = 1 and = 1 empirically. By plugging Eq. (5.5) into Eq. (5.4), we can express the JND location as Y c;s = QP 0 L X l=1 p c;s;l ( 1 2 ) l = QP 0 L X l=1 ( 1 2 ) l l + QP 0 L X l=1 ( 1 2 ) l c + QP 0 L X l=1 ( 1 2 ) l N(0; 2 c ) + QP 0 L X l=1 ( 1 2 ) l N( s ; 2 s ) =y c +N (0;v 2 c ) +N (b s ;v 2 s ) (5.6) wherey c = QP 0 P L l=1 ( 1 2 ) l ( l + c ) andv 2 c = 2 2 c are content factors. b s = s andv 2 s = 2 2 s are subject factors. = QP 0 P L l=1 ( 1 2 ) l 50 is a constant. 77 5.4 Proposed JND-based User Model In this section, we present the proposed user model based on the JND methodol- ogy. Consider a VQA dataset consisting of C contents and S subjects, the JND data matrix is modeled as Y 2 R CS . Individual JND location Y c;s for s = 1; ;S and c = 1; ;C, is obtained through six rounds of comparison. The following analysis is conducted on the data matrix to recover underlying subject and content factors. It was demonstrated in Section 5.3.3 that the perceived video quality depends on several causal factors: 1) the bias of the subject bias, 2) the inconsistency of a subject, 3) the average JND location, 4) the difficulty of a content to evaluate. The JND location of contentc from subjects can be expressed as Y c;s =y c +N (0;v 2 c ) +N (b s ;v 2 s ); (5.7) wherey c andv 2 c are content factors whileb s andv 2 s are subject factors. The difficulty of a content is modeled byv 2 c 2 [0;1). A largerv 2 c value means that its masking effects are stronger and the most experienced experts still have difficulty in spotting artifacts in compressed clips. The bias of a subject is modeled by parameterb s 2 (1; +1). If b s < 0, the subject is more sensitive to quality degradation in compressed video clips. If b s > 0, the subject is less sensitive to distortions. The sensitivity of an averaged subject has a bias around b s = 0. Moreover, the subject variance, v 2 s , captures the inconsistency of the quality votings from subject s. A consistent subject evaluate all sequences attentively. Under the assumption that content and subject are independent factors on perceived video quality, the JND position can be expressed by a Gaussian distribution in form of Y c;s N ( Y ; 2 Y ); (5.8) 78 where Y = y c + b s and 2 Y = v 2 c + v 2 s . The unknown parameters are = (fy c g;fv c g;fb s g;fv s g) for c = 1; ;C and s = 1; ;S, wherefg denotes the corresponding parameter set. All unknown parameters can be jointly estimated via the Maximum Likelihood Estimation (MLE) method given the subjective data matrix Y 2R CS . This is a well-formulated parameter inference approach and we refer inter- ested viewers to [36, 73] for more details. 5.4.1 Parameter Inference and User Profiling The JND-based VQA model in Eq. (5.6) has a set of parameters to determine; namely, = (fy c g;fv c g;fb s g;fv s g) withc = 1; ;C ands = 1; ;S. It was shown that the JND location can be expressed by the Gaussian distribution in form of Y c;s N ( c;s ; 2 c;s ); (5.9) where c;s = y c +b s and 2 c;s = v 2 c +v 2 s . The task is to estimate unknown parameters jointly given observations on a set of contents from a group of subjects. A standard inference method to recover the true MOS score was studied in [36]. Here, we extend the procedure to estimate the parameters in the JND-based VQA model. 79 Let L() = logp(fy c;s gj) be the log-likelihood function. One can show that the optimal estimator of is given by ^ = arg max L(). By omitting constant terms, we can express the log-likelihood function as L() = logp(fy c;s j) = log Y c;s p(y c;s jy c ;b s ;v c ;v s ) = X c;s logp(y c;s jy c ;b s ;v c ;v s ) log(v 2 c +v 2 s ) (y c;s y c b s ) 2 v 2 c +v 2 s : (5.10) The first and second order derivatives of L() with respect to each parameter can be derived. They are used to update the parameters at each iteration according to the Newton-Raphson rule, i.e. @L=@ @ 2 L=@ 2 . Among the four parameters = (fy c g;fv c g;fb s g;fv s g), we have limited control on content factors, i.e. y c andv c . Content factors should be independent parameters that are input to a quality model. In practice, it is difficult, sometimes even impossible, to model subject inconsistency (i.e., thev s term), as it is viewer’s freedom to decide how much attention to pay to the video content. On the other hand, the subject bias term (i.e.b s ) is a consistent prior of each subject. It is reasonable to model the subject bias and integrate it into a SUR model. We can roughly classify users into three groups based on the bias estimated from MLE. The user model aims to provide a flexible system to accommodate different viewer groups: Viewers who are easy-to-satisfy (ES), corresponding to a largerb s ; Viewers who have normal sensitivity (NS), corresponding to a neuralb s ; Viewers who are hard-to-satisfy (HS), corresponding to a smallerb s . 80 Figure 5.2: Consecutive frames of contents #11 (top) and #203 (bottom), respectively. 5.4.2 Satisfied User Ratio on a Specific User Group Furthermore, a viewer is said to be satisfied if one cannot perceive quality difference between the compressed clip and its anchor. The Satisfied User Ratio (SUR) of video clipe i on user groupj can be expressed as Z i;j = 1 1 jS j j X s2S j 1 s (e i ); (5.11) where S j is the jth group of subjects andjj denotes the cardinality. 1 s (e i ) = 1 or 0 if the sth subject can or cannot see the difference between compressed clip e i and its anchor, respectively. The summation term in the right-hand-side of Eq. (5.11) is the empirical cumulative distribution function (CDF) of random variableY c;s . Then, by substituting Eq. (5.9) into Eq. (5.11), we obtain a compact expression for the SUR curve as Z i;j =Q(e i j Y ; 2 Y ) =Q(e i jy c +b s ;v 2 c +v 2 s ); for s2S j ; (5.12) whereQ() is the Q-function of the normal distribution. By dividing users into different groups, the model achieves small intra-group variance and large inter-group variance. We can model JND and SUR more precisely. Alternatively, a universal model could be generalized by replacingS j by the union of all subjects, i.e.S = S j S j . 81 5.5 Experiment Results In this section, we evaluate the performance of the proposed model using real JND data from the VideoSet and compare it with another commonly used method. For repro- ducibility, the source code of the proposed model is available at: https://github. com/JohnhqWang/sureal. 5.5.1 Experiment Settings The VideoSet contains 220 video contents in 4 resolutions and three JND points per resolution per content. During the subjective test, the dataset was split into 15 subsets and each subset was evaluated independently by a group of subjects. We adopt a subset of the first JND points on 720p video in the experiment. The subset contains 15 video contents evaluated by 37 subjects. One representative frame from each of the 15 video clips is shown in Fig. 5.2. The measured raw JND scores are shown in Fig. 5.3a. Standard procedures have been provided by the ITU for subject screening and data modeling. For example, a subject rejection method was proposed in the ITU-R BT.500 Recommendation [60]. The differential MOS was defined in the ITU-T P.910 Rec- ommendation [25] to alleviate the influence of subject and content factors. However, these procedures do not directly apply to the collected JND VQA data due to a different methodology. Traditional VQA subjective tests evaluate video quality by a score while the JND-based VQA subjective tests target at the distortion visibility threshold. Here, we compare the proposed VQA model whose parameters are estimated by the MLE method with those estimated by the standard MOS approach [36] in two different settings. First, we compare them based on raw JND data without any cleaning process. Second, we clean unreliable data using the proposed VQA model and compare these two models with the cleaned JND data. 82 5.5.2 Experiments on Raw JND Data #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 Test Subjects (s) 011 026 041 056 071 086 101 116 131 146 161 175 189 203 217 Video Index (c) (a) −20 0 20 Subject Bias (bs ) MLE #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 Subject Inde (s) 0 10 Subject Inconsisency (vs ) MLE (b) 011 026 041 056 071 086 101 116 131 146 161 175 189 203 217 Video Index (c) 1 0 1 2 3 4 5 6 Content Difficulty (vc ) MLE (c) 011 026 041 056 071 086 101 116 131 146 161 175 189 203 217 Video Index (c) 22 24 26 28 30 32 34 36 38 Recovered JND Location (yc ) MLE MOS (d) Figure 5.3: Experimental results: (a) raw JND data, where each pixel represents one JND location and a brighter pixel means the JND occurs at a larger QP; (b) estimated subject bias and inconsistency on raw JND data; (c) estimated content difficulty based on raw JND data, using the proposed VQA+MLE method; (d) estimated JND locations based on raw JND data, using both the proposed VQA+MLE method and the MOS method. Error bars in all subfigures represent the 95% confidence interval. The first experiment was conducted on the raw JND data without outlier removal. Some subjects completed the subjective test hastily without sufficient attention. By jointly estimating the content and subject factors, a good VQA data model can identify such outlying quality ratings from unreliable subjects. The estimated subject bias and inconsistency are shown in Fig. 5.3b. The proposed JND-based VQA model indicates that the bias of subjects #04 and #16 is very significant (more than 10 QPs) as compared with others. Furthermore, the proposed model suggests that subjects #16, #26 and #36 83 exhibit large inconsistency. The observation was evidenced by the noticeable dark dots on some contents. Fig. 5.3c shows the estimated content difficulty. Content #11 is a scene about tod- dlers playing in a fountain. The masking effect is strong due to water drops in the background and moving objects. Thus, compression artifacts are difficult to perceive and it has the highest content difficulty. On the other hand, content #203 is a scene captured with still camera. It focuses on speakers with still blurred background. The content difficulty is low as the masking effect is weak and compression artifacts are more noticeable. Representative frame thumbnails are given in Fig. 5.2. The estimated JND locations using the proposed VQA+MLE method and the MOS method are compared in Fig. 5.3d. The proposed method offers more confident esti- mates as its confidence intervals are much tighter than those estimated by the MOS method. More importantly, the estimates by the proposed method are significantly dif- ferent from those by the MOS method. It is well known that the mean value is vulnerable to outliers for a fixed sample size. The proposed method is more robust to noisy subjec- tive scores, which tend to have a negative bias in general. 5.5.3 Experiments on Cleaned JND Data Then, we remove the outlying JND samples detected by the proposed model. They are from subjects with a larger bias value or inconsistent measures. The cleaned JND scores are shown in Figure 5.4a and the estimated subject bias and inconsistency are shown in Figure 5.4b, respectively. Please note that 5 subjects were identified as unreliable subjects and their quality votings were removed. These subjects have a larger bias value or inconsistent measures. The estimated JND location using the MLE method and the MOS method are compared in Figure 5.4d. The MLE approach offers more reliable 84 #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 Test Subjects (s) 011 026 041 056 071 086 101 116 131 146 161 175 189 203 217 Video Index (c) (a) −20 0 20 Subject Bias (bs ) MLE #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 Subject Inde (s) 0 10 Subject Inconsisency (vs ) MLE (b) 011 026 041 056 071 086 101 116 131 146 161 175 189 203 217 Video Index (c) 1 0 1 2 3 4 5 6 Content Difficulty (vc ) MLE (c) 011 026 041 056 071 086 101 116 131 146 161 175 189 203 217 Video Index (c) 22 24 26 28 30 32 34 36 38 Recovered JND Location (yc ) MLE MOS (d) Figure 5.4: Experimental results: (a) cleaned JND data, where each pixel represents one JND location and a brighter pixel means the JND occurs at a larger QP; (b) esti- mated subject bias and inconsistency (i.e. v s ) on cleaned JND data; (c) estimated con- tent difficulty (i.e. v c ) based on cleaned JND data using the proposed VQA+MLE method; (d) estimated JND locations based on cleaned JND data, using both the pro- posed VQA+MLE method and the MOS method. Error bars in all subfigures represent the 95% confidence interval. estimation as its confidence intervals are much tighter than those estimated by the MOS method. We show the estimated content difficulty in Fig. 5.4c and compare the estimated JND locations of the proposed method and the MOS method in Fig. 5.4d on the cleaned dataset. We see that the proposed VQA+MLE method can estimate the relative content difficulty accurately. We also notice that the estimation changed a lot for some contents. The reason is that considerable portion (5=37 = 13:5%) of the subjects were removed. 85 The bias and inconsistency of the removed scores have great influence on the conclusion of these contents. The estimated JND location using the MLE method and the MOS method are com- pared in Figure 5.4d. The MLE approach offers more reliable estimation as its confi- dence intervals are much tighter than those estimated by the MOS method. By compar- ing Figs. 5.3d and 5.4d, we observe that outlying samples changed the distribution of recovered JND locations in both methods. First, the confidence intervals of the MOS method decrease a lot. It reflects the vulnerability of the MOS method due to noisy samples. In contrast, the proposed VQA+MLE method is more robust. Second, the recovered JND location increases by 0.5 to 1 QP in both methods after removing noisy samples. It demonstrates the importance of building a good VQA model and using it to filter out noisy samples. 5.5.4 SUR on Different User Groups We classify viewers into different viewer groups based on the estimated subject bias from cleaned JND data. The distribution of subject bias and inconsistency are given in Figure 5.5. The left and middle figures are the histogram of their statistics, respectively. For a large percentage of viewers, their bias and inconsistency are in a reasonable range (i.e. [4; 4] for the subject bias and [0; 3:5] for subject inconsistency, respectively). The right figure is the scatter plot of these two factors. We do not observe strong correlation between them. In the following, we use video #11 and #203 as input contents to demonstrate the effectiveness of the proposed user model. Under fixed content factors, we compare the SUR differences between different viewer groups. Content #11 has a strong masking effect so that it is difficult to evaluate (HC, “Hard Content”). Content #203 has a weak masking effect and it is easy to evaluate (EC, “Easy Content”). 86 −5 0 5 Subject Bias 0 1 2 3 4 5 6 7 Frequency 0 1 2 3 4 5 Subject Inconsistency 0 1 2 3 4 5 6 7 Frequency −5 0 5 Subject Bias 0 1 2 3 4 5 Subject Inconsistency Figure 5.5: Illustration of subject factors. Left: the histogram of the subject bias. Mid- dle: the histogram of subject inconsistency. Right: the scatter plot of subject inconsis- tency versus the subject bias. Figure 5.6 shows the effect of subject factors on the SUR curve. There are 6 SUR curves by combining different content factors and subject factors. The input parameters are obtained from MLE. They are set as follows. The subject bias is set to -4, 0, 4 for HS, NS and ES, respectively. Subject inconsistency is set to 2 for all subjects. The averaged JND locations are set to 31.7 and 30.39 for clip #11 and #203, respectively. The content difficulty levels are set to 3.962 and 1.326 for clip #11 and #203, respectively. We have the following two observations. 1. SUR difference for normal users Consider the middle curves of EC and HC con- tents. Subjects in this group have normal sensitivity and we use this group to represent the majority. Intuitively, the content diversity is large if we visually examine those two clips. However, if we target at SUR = 0:75, which is the counterpart of the mean value in the MOS method, the QP location from modeled 87 SUR curve is pretty close. The difference increases when the SUR deviates from theSUR = 0:75 location. For contents that have a weak masking effect (shown in blue curve), they are less resistant to compression distortion and SUR drops sharply once artifacts become noticeable. In contrast, for contents that have a strong masking effect (shown in red curve), they have better discriminatory power on subject capability so that the SUR curve drops slowly. Given the same extra bitrate quota, we could expect a higher SUR gain from EC than HC. It takes much more effort to satisfy critical users when the content has a strong masking effect. We conclude that it is essential to study content difficulty and subject capability to better model perceived quality of compressed video. 2. SUR difference for different user groups The SUR difference is considerably large among different user groups on the same content. We observe a gap between the three curves for both contents. The SUR curve of normal users is shifted by the subject biasb s in Eq. (5.12). Although the neutral user group covers the majority of users, we believe that a quality model would better characterize QoE by taking the user capability into consideration. The above observations can be easily explained using the proposed user model. It shows the value and power of our study. 5.6 Conclusion A JND-based VQA model was proposed to analyze measured JND-based VQA data. The model considered subject and content variabilities, and determined its parameters by solving an MLE problem iteratively. The technique can be used to remove biased 88 0 5 10 15 20 25 30 35 40 45 50 QP 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 SUR EC,HS EC,NS EC,ES HC,HS HC,NS HC,ES Figure 5.6: Illustration of the proposed user model. The blue and red curves demonstrate the SUR of EC and HC contents, respectively. For each content, the three curves show the SUR difference between different user groups. and inconsistent samples and estimate the content difficulty and JND locations. It was shown by experimental results that the proposed methodology is more robust to noisy subjects than the traditional MOS method. A flexible user model was proposed by considering the subject and content factors in the JND framework. The QoE of a group of users was characterized by the Satisfied User Ratio (SUR) while the JND location of contentc from subjects was modeled as a random variable parameterized by subject and content factors. The model parameters can be estimated by the MLE method using a set of JND-based subjective test data. As an application of the proposed user model, we studied SUR curves that are influenced by different user profiles and contents of different difficult levels. It was shown that the subject capability significantly affects the SUR curves, especially at the middle range of the quality curve. 89 The MLE optimization problem may have multiple local maxima and the iterative optimization procedure may not converge to the global maximum. We would like to investigate this problem deeper in the future. Also, we may look for other parameter estimation methods that are more efficient and robust. Apparently, the proposed user model provides valuable insights on the quality assessment problem. We would like to explore these insights for better SUR prediction for new contents in the future. 90 Chapter 6 Conclusion and Future Work 6.1 Summary of the Research In this dissertation, we study the problems related to human-centric video quality assess- ment. 1) we propose a JND-based methodology, which measures the distortion threshold and provides fine-grained analysis on perceived video quality. 2) we propose a Satis- fied User Ratio (SUR) prediction system. It is a machine learning based framework that incorporates video quality degradation features and masking effects. 3) We propose a probabilistic quality model to quantify the influence of subject variabilities and content variabilities. We also build a user model to based on subjects’ capability to capture video quality difference. The construction of a large-scale compressed video quality dataset based on the JND measurement, called the VideoSet, is described in Chapter 3. The VideoSet contains 220 5-second source video clips in four resolutions. We conduct a large-scale subjective test to measure the first 3 JND points of each source video. The subjective test procedure, detection and removal of outlying measured data, and the properties of collected JND data was detailed. The satisfied-user-ratio (SUR) curve regression and the JND point prediction are discussed in Chapter 4. The proposed method takes local quality degradation as well as the masking effect into consideration to extract a compact feature vector. A support vector regressor is trained which is able to predict the SUR curve of new contents. Then, the JND points can be derived accordingly. The proposed method achieves good 91 performance in all video sequences with different resolutions with the mean absolute error (MAE) of the SUR smaller than 0.05 on average. A JND-based VQA model is proposed to analyze measured JND-based VQA data in Chapter 5. The model considers subject and content variabilities and model parameters are estimated by the MLE method using a set of JND-based subjective test data. The model parameters provide lots of insights on the JND framework. We propose a user model based on users capability to discern the quality difference. We study SUR curves that are influenced by different user profiles and contents of different difficult levels. It is shown that the subject capability significantly affects the SUR curves, especially at the middle range of the quality curve. 6.2 Future Research Directions Recent developments with Video Quality Assessments (VQA) have achieved good per- formance which correlate well with human perception. The proposed JND-based VQA methodology in this dissertation provides an alternative to the traditional MOS-based methods. However, the pristine reference is not always available to the system in many practical applications, such as online video conference, real-time video streaming, live broadcasting, just to name a few. On the other hand, video quality is analyzed on a resolution by resolution basis. In video streaming services, scaling is as common as compression. It is interesting and important to extend the JND prediction model when we consider video contents are captured in “distorted” format and delivered in multiple resolutions. There are several research directions to extend the proposed JND-based framework. 1) Flexible reference JND analysis, which is able to provide practical solution to the adaptive video streaming problem by determining a suitable bitrate with any target SUR. 92 2) Cross-resolution JND analysis that takes scaling factor into consideration. We bring up the following specific problems for future research. 1. Flexible reference JND analysis: incorporate the modeled three SUR curves into one global SUR curve. Then we could take any point from the aggregated SUR curve as a new “reference” and derive corresponding SUR curves. 2. Cross-resolution JND analysis: introduce a scaling factor into the JND VQA model. The scaling factor could be estimated by studying the relationship of SUR curves between different resolutions. 6.2.1 JND Against Flexible References The JND/SUR measure conducted in the VideoSet adopted the fixed reference location. For example, the first JND uses losslessly coded clip (QP=0) as reference. Then, the first JND serves as the reference for the second JND and the second JND serves as the reference for the third JND, etc. However, it is desirable in practice to quantify the upper and lower JND locations with respect to the current operational bitrate. The “reference” in this case can be at any bitrate. It is not feasible to conduct subjective test for all interested operational points. Thus, it is an important task to generalize the current JND/SUR framework with respect to a flexible reference. The idea is illustrated in Fig. 6.1. In the fixed reference model, multiple JND points are obtained from small to large QP values in a sequential manner, i.e. we need to obtain JND #1 before approaching to JND #2. This is a special case of the flexible reference model, where the reference location could be any bitrate. The left JND point gives the bitrate target to go beyond if we want to improve the current subjective video quality. On the other hand, we should keep the bitrate slightly higher than the right JND point in 93 Fixed reference model Flexible reference model Figure 6.1: Illustration of fixed reference model and flexible reference model. bitrate reduction. We would like to extend our current JND prediction system to handle the flexible reference case and demonstrate its effectiveness in experiments. 6.2.2 Cross-Resolution JND Analysis We have so far focused on JND analysis and prediction among video clips of the same resolution. Videos in different resolutions are processed independently. In video stream- ing services, scaling is as common as compression. It is interesting and important to extend the JND prediction model when we consider video contents to be delivered in multiple resolutions. The cross-resolution JND study is motivated by the following observations. First, it is not feasible to conduct JND subjective test on every resolution. A good prediction model should be able to generalize well to new resolutions. The VideoSet 94 Figure 6.2: Scaling factors in practical video streaming service pipeline. has JND data for 4 resolutions - 360p, 540p, 720p and 1080p. It is meaningful to study the correlations between the observed JND points for the same content but of different resolutions. Second, it is common to deliver a low-resolution content by a streaming service provider. Then, the clip is up-scaled to match the resolution of the monitor at the user end. Fig. 6.2 gives the pipeline where up-scaling is introduced. We may compare the following scenarios: 1. A content with original resolution 1080p and encoded at two bitrates (say, 3 Mbps and 10 Mbps) and then displayed with resolution 1080p; 2. The same content down-scaled to 720p and encoded at two bitrates (say, 3 Mbps and 10 Mbps) and then up-scaled to 1080p for display. At a high bitrate (10 Mbps), video clip with higher resolution will yield better qual- ity. However, this is not necessarily true at a low bitrate (3Mbps) due to the overhead in higher resolution content. Thus, it is essential to incorporate the scaling factor in the JND prediction model. 95 Bibliography [1] L. Aimar, L. Merritt, E. Petit, M. Chen, J. Clay, M. Rullgrd, C. Heine, and A. Izvorski. X264-a free H.264/A VC encoder.http://www.videolan.org/ developers/x264.html, 2005. Accessed: 04/01/07. [2] U. Ansorge, G. Francis, M. H. Herzog, and H. ¨ O˘ gmen. Visual masking and the dynamics of human perception, cognition, and consciousness a century of progress, a contemporary synthesis, and future directions. Advances in Cognitive Psychology, 3(1-2):1, 2007. [3] C. Blender Foundation. mango.blender.org. Accessed: 2016-11-22. [4] B. G. Breitmeyer and L. Ganz. Implications of sustained and transient channels for theories of visual pattern masking, saccadic suppression, and information pro- cessing. Psychological Review, 83(1):1, 1976. [5] I.-R. BT2022. General viewing conditions for subjective assessment of quality of SDTV and HDTV television pictures on flat panel displays. International Telecom- munication Union, 2012. [6] CableLabs. http://www.cablelabs.com/resources/4k/. Accessed: 2016-11-22. [7] K.-T. Chen, C.-C. Wu, Y .-C. Chang, and C.-L. Lei. A crowdsourceable QoE eval- uation framework for multimedia content. In Proceedings of the 17th ACM inter- national conference on Multimedia, pages 491–500. ACM, 2009. [8] M.-J. Chen and A. C. Bovik. Fast structural similarity index algorithm. Journal of Real-Time Image Processing, 6(4):281–287, 2011. [9] Z. Chen and C. Guillemot. Perceptually-friendly H.264/A VC video coding based on foveated just-noticeable-distortion model. IEEE Transactions on Circuits and Systems for Video Technology, 20(6):806–819, 2010. [10] V . D. E. Committee et al. Generic coding of moving pictures and associated audio. Recommendation H, 262:13818–2, 1994. 96 [11] P. Corriveau and A. Webster. Final report from the video quality experts group on the validation of objective models of video quality assessment, phase ii. Tech. Rep., 2003. [12] D. Culibrk, M. Mirkovic, V . Zlokolica, M. Pokric, V . Crnojevic, and D. Kukolj. Salient motion features for video quality assessment. IEEE Transactions on Image Processing, 20(4):948–958, 2011. [13] S. J. Daly. Visible differences predictor: an algorithm for the assessment of image fidelity. In Human Vision, Visual Processing, and Digital Display III, volume 1666, pages 2–16. International Society for Optics and Photonics, 1992. [14] J. M. Foley. Human luminance pattern-vision mechanisms: masking experiments require a new model. JOSA A, 11(6):1710–1719, 1994. [15] V . Q. E. Group et al. Report on the validation of video quality models for high def- inition video content. VQEG, Geneva, Switzerland, Tech. Rep.[Online]. Available: http://www. its. bldrdoc. gov/vqeg/projects/hdtv/hdtv. aspx, 2010. [16] F. E. Grubbs. Sample criteria for testing outlying observations. The Annals of Mathematical Statistics, pages 27–58, 1950. [17] I. P. Gunawan and M. Ghanbari. Reduced-reference video quality assessment using discriminative local harmonic strength with motion consideration. IEEE Transac- tions on Circuits and Systems for Video Technology, 18(1):71–83, 2008. [18] H. Hadizadeh and I. V . Bajic. Saliency-aware video compression. IEEE Transac- tions on Image Processing, 23(1):19–33, 2014. [19] F. Hermens, G. Luksys, W. Gerstner, M. H. Herzog, and U. Ernst. Modeling spatial and temporal aspects of visual backward masking. Psychological Review, 115(1):83, 2008. [20] J. E. Hoffman. Visual attention and eye movements. Attention, 31:119–153, 1998. [21] S. Hu, L. Jin, H. Wang, Y . Zhang, S. Kwong, and C.-C. J. Kuo. Compressed image quality metric based on perceptually weighted distortion. IEEE Transactions on Image Processing, 24(12):5594–5608, 2015. [22] S. Hu, L. Jin, H. Wang, Y . Zhang, S. Kwong, and C.-C. J. Kuo. Objective video quality assessment based on perceptually weighted mean squared error. IEEE Transactions on Circuits and Systems for Video Technology, 27(9):1844–1855, 2017. 97 [23] S. Hu, H. Wang, and C.-C. J. Kuo. A GMM-based stair quality model for human perceived JPEG images. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 1070–1074. IEEE, 2016. [24] Q. Huang, H. Wang, S. C. Lim, H. Y . Kim, S. Y . Jeong, and C.-C. J. Kuo. Measure and prediction of HEVC perceptually lossy/lossless boundary QP values. In Data Compression Conference (DCC), 2017, pages 42–51. IEEE, 2017. [25] ITU-T P.910. Subjective video quality assessment methods for multimedia appli- cations. 1999. [26] L. Janowski and M. Pinson. The accuracy of subjects in a quality experiment: A theoretical subject model. IEEE Transactions on Multimedia, 17(12):2210–2224, 2015. [27] C. M. Jarque and A. K. Bera. A test for normality of observations and regres- sion residuals. International Statistical Review/Revue Internationale de Statis- tique, pages 163–172, 1987. [28] Y . Jia, W. Lin, and A. A. Kassim. Estimating just-noticeable distortion for video. IEEE Transactions on Circuits and Systems for Video Technology, 16(7):820–829, 2006. [29] L. Jin, J. Y . Lin, S. Hu, H. Wang, P. Wang, I. Katsavounidis, A. Aaron, and C.- C. J. Kuo. Statistical study on perceived JPEG image quality via MCL-JCI dataset construction and analysis. Electronic Imaging, 2016(13):1–9, 2016. [30] J. Kim, S.-H. Bae, and M. Kim. An HEVC-compliant perceptual video coding scheme based on JND models for variable block-sized transform kernels. IEEE Transactions on Circuits and Systems for Video Technology, 25(11):1786–1800, 2015. [31] Q. Li, Y . Li, J. Gao, L. Su, B. Zhao, M. Demirbas, W. Fan, and J. Han. A confidence-aware approach for truth discovery on long-tail data. Proceedings of the VLDB Endowment, 8(4):425–436, 2014. [32] S. Li, F. Zhang, L. Ma, and K. N. Ngan. Image quality assessment by separately evaluating detail losses and additive impairments. IEEE Transactions on Multime- dia, 13(5):935–949, 2011. [33] X. Li, Q. Guo, and X. Lu. Spatiotemporal statistics for video quality assessment. IEEE Transactions on Image Processing, 25(7):3329–3342, 2016. [34] Y . Li, L.-M. Po, C.-H. Cheung, X. Xu, L. Feng, F. Yuan, and K.-W. Cheung. No- reference video quality assessment with 3D shearlet transform and convolutional 98 neural networks. IEEE Transactions on Circuits and Systems for Video Technology, 26(6):1044–1057, 2016. [35] Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy, and M. Manohara. Toward a prac- tical perceptual video quality metric. The Netflix Tech Blog, 6, 2016. [36] Z. Li and C. G. Bampis. Recover subjective quality scores from noisy measure- ments. In Data Compression Conference (DCC), 2017, pages 52–61. IEEE, 2017. [37] J. Y . Lin, L. Jin, S. Hu, I. Katsavounidis, Z. Li, A. Aaron, and C.-C. J. Kuo. Exper- imental design and analysis of JND test on coded image/video. In SPIE Opti- cal Engineering+ Applications, pages 95990Z–95990Z. International Society for Optics and Photonics, 2015. [38] J. Y . Lin, R. Song, C.-H. Wu, T. Liu, H. Wang, and C.-C. J. Kuo. MCL-V: A streaming video quality assessment database. Journal of Visual Communication and Image Representation, 30:1 – 9, 2015. [39] W. Lin and C.-C. J. Kuo. Perceptual visual quality metrics: A survey. Journal of Visual Communication and Image Representation, 22(4):297–312, 2011. [40] Q. Liu, A. T. Ihler, and M. Steyvers. Scoring workers in crowdsourcing: How many control questions are enough? In Advances in Neural Information Processing Systems, pages 1914–1922, 2013. [41] Q. Liu, J. Peng, and A. T. Ihler. Variational inference for crowdsourcing. In Advances in neural information processing systems, pages 692–700, 2012. [42] T.-J. Liu, W. Lin, and C.-C. J. Kuo. Image quality assessment using multi-method fusion. IEEE Transactions on Image Processing, 22(5):1793–1807, 2013. [43] Z. Luo, L. Song, S. Zheng, and N. Ling. H. 264/advanced video control perceptual optimization coding based on JND-directed coefficient suppression. IEEE Trans- actions on Circuits and Systems for Video Technology, 23(6):935–948, 2013. [44] L. Ma, S. Li, and K. N. Ngan. Reduced-reference video quality assessment of com- pressed video sequences. IEEE Transactions on Circuits and Systems for Video Technology, 22(10):1441–1456, 2012. [45] L. Ma, K. N. Ngan, F. Zhang, and S. Li. Adaptive block-size transform based just-noticeable difference model for images/videos. Signal Processing: Image Communication, 26(3):162–174, 2011. [46] R. Mantiuk, K. J. Kim, A. G. Rempel, and W. Heidrich. HDR-VDP-2: A calibrated visual metric for visibility and quality predictions in all luminance conditions. In ACM Transactions on Graphics (TOG), volume 30, page 40. ACM, 2011. 99 [47] A. K. Moorthy, L. K. Choi, A. C. Bovik, and G. De Veciana. Video quality assess- ment on mobile devices: Subjective, behavioral and objective studies. IEEE Jour- nal of Selected Topics in Signal Processing, 6(6):652–671, 2012. [48] M. Narwaria and W. Lin. Objective image quality assessment based on support vector regression. IEEE Transactions on Neural Networks, 21(3):515–519, 2010. [49] M. Narwaria, R. K. Mantiuk, M. P. Da Silva, and P. Le Callet. HDR-VDP-2.2: a calibrated method for objective quality prediction of high-dynamic range and standard images. Journal of Electronic Imaging, 24(1):010501–010501, 2015. [50] A. Ninassi, O. Le Meur, P. Le Callet, and D. Barba. Considering temporal vari- ations of spatial visual distortions in video quality assessment. IEEE Journal of Selected Topics in Signal Processing, 3(2):253–265, 2009. [51] T.-S. Ou, Y .-H. Huang, and H. H. Chen. SSIM-based perceptual rate control for video coding. IEEE Transactions on Circuits and Systems for Video Technology, 21(5):682–691, 2011. [52] Y .-F. Ou, Y . Xue, and Y . Wang. Q-star: a perceptual video quality model consider- ing impact of spatial, temporal, and amplitude resolutions. IEEE Transactions on Image Processing, 23(6):2473–2486, 2014. [53] M. H. Pinson. The consumer digital video library [best of the web]. http: //www.cdvl.org/resources/index.php, 2013. [54] M. H. Pinson and S. Wolf. A new standardized method for objectively measuring video quality. IEEE Transactions on Broadcasting, 50(3):312–322, 2004. [55] N. Ponomarenko, F. Silvestri, K. Egiazarian, M. Carli, J. Astola, and V . Lukin. On between-coefficient contrast masking of DCT basis functions. In Proceedings of the Third International Workshop on Video Processing and Quality Metrics, volume 4, 2007. [56] I. Recommendation. P. 910,subjective video quality assessment methods for mul- timedia applications,. International Telecommunication Union, Tech. Rep, 2008. [57] I.-T. Recommendation. H.264: Advanced video coding for generic audiovisual services, 2003. [58] M. A. Saad, A. C. Bovik, and C. Charrier. Blind prediction of natural video quality. IEEE Transactions on Image Processing, 23(3):1352–1365, 2014. [59] I. SANDVINE. Global internet phenomena report. 2016. 100 [60] B. Series. Methodology for the subjective assessment of the quality of television pictures. Recommendation ITU-R BT, pages 500–13, 2012. [61] K. Seshadrinathan and A. C. Bovik. Motion tuned spatio-temporal quality assess- ment of natural videos. IEEE Transactions on Image Processing, 19(2):335–350, 2010. [62] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack. Study of subjective and objective quality assessment of video. IEEE Transactions on Image Processing, 19(6):1427–1441, 2010. [63] G. Sharma and R. Bala. Digital color imaging handbook. CRC press, 2002. [64] H. R. Sheikh and A. C. Bovik. Image information and visual quality. IEEE Trans- actions on Image Processing, 15(2):430–444, 2006. [65] A. J. Smola and B. Sch¨ olkopf. A tutorial on support vector regression. Statistics and Computing, 14(3):199–222, 2004. [66] R. Soundararajan and A. C. Bovik. Video quality assessment by reduced reference spatio-temporal entropic differencing. IEEE Transactions on Circuits and Systems for Video Technology, 23(4):684–694, 2013. [67] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overview of the high efficiency video coding (HEVC) standard. IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1649–1668, 2012. [68] K. Turkowski. Graphics gems. chapter Filters for Common Resampling Tasks, pages 147–165. Academic Press Professional, Inc., San Diego, CA, USA, 1990. [69] R. Vanrullen and S. J. Thorpe. The time course of visual processing: from early perception to decision-making. Journal of Cognitive Neuroscience, 13(4):454– 461, 2001. [70] H. Wang, W. Gan, S. Hu, J. Y . Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo. MCL-JCV: a JND-based H.264/A VC video quality assessment dataset. In Image Processing (ICIP), 2016 IEEE International Confer- ence on, pages 1509–1513. IEEE, 2016. [71] H. Wang, I. Katsavounidis, Q. Huang, X. Zhou, and C.-C. J. Kuo. Prediction of satisfied user ratio for compressed video. arXiv preprint arXiv:1710.11090, 2017. [72] H. Wang, I. Katsavounidis, J. Zhou, J. Park, S. Lei, X. Zhou, M.-O. Pun, X. Jin, R. Wang, X. Wang, et al. VideoSet: A large-scale compressed video quality dataset based on JND measurement. Journal of Visual Communication and Image Repre- sentation, 46:292–302, 2017. 101 [73] H. Wang, X. Zhang, C. Yang, and C.-C. J. Kuo. A JND-based video quality assess- ment model and its application. arXiv preprint arXiv:1807.00920, 2018. [74] S. Wang, A. Rehman, Z. Wang, S. Ma, and W. Gao. SSIM-motivated rate- distortion optimization for video coding. IEEE Transactions on Circuits and Sys- tems for Video Technology, 22(4):516–529, 2012. [75] S. Wang, A. Rehman, Z. Wang, S. Ma, and W. Gao. Perceptual video coding based on SSIM-inspired divisive normalization. IEEE Transactions on Image Process- ing, 22(4):1418–1429, 2013. [76] Y . Wang, T. Jiang, S. Ma, and W. Gao. Novel spatio-temporal structural infor- mation based video quality metric. IEEE transactions on circuits and systems for video technology, 22(7):989–998, 2012. [77] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assess- ment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. [78] Z. Wang, L. Lu, and A. C. Bovik. Video quality assessment based on structural distortion measurement. Signal Processing: Image Communication, 19(2):121– 132, 2004. [79] Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, volume 2, pages 1398– 1402. IEEE, 2003. [80] F. L. Wauthier and M. I. Jordan. Bayesian bias mitigation for crowdsourcing. In Advances in neural information processing systems, pages 1800–1808, 2011. [81] Z. Wei and K. N. Ngan. Spatio-temporal just noticeable distortion profile for grey scale image/video in DCT domain. IEEE Transactions on Circuits and Systems for Video Technology, 19(3):337–346, 2009. [82] J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035– 2043, 2009. [83] S. Wolf and M. Pinson. Video quality model for variable frame delay (VQM VFD). National Telecommunications and Information Administration NTIA Tech- nical Memorandum TM-11-482, 2011. 102 [84] H. R. Wu, A. R. Reibman, W. Lin, F. Pereira, and S. S. Hemami. Perceptual visual signal compression and transmission. Proceedings of the IEEE, 101(9):2025– 2043, 2013. [85] Q. Wu, H. Li, F. Meng, and K. N. Ngan. A perceptually weighted rank correlation indicator for objective image quality assessment. IEEE Transactions on Image Processing, 27(5):2499–2513, May 2018. [86] Z. Xu, Y . Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1798–1807, 2015. [87] P. Ye and D. Doermann. Active sampling for subjective image quality assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 4249–4256, 2014. [88] J. You, T. Ebrahimi, and A. Perkis. Attention driven foveated video quality assess- ment. IEEE Transactions on Image Processing, 23(1):200–213, 2014. [89] L. Zhang, L. Zhang, X. Mou, and D. Zhang. FSIM: A feature similarity index for image quality assessment. IEEE Transactions on Image Processing, 20(8):2378– 2386, 2011. [90] W. Zhang and H. Liu. Study of saliency in objective video quality assessment. IEEE Transactions on Image Processing, 26(3):1275–1288, 2017. [91] W. Zhang, R. R. Martin, and H. Liu. A saliency dispersion measure for improving saliency-based image quality metrics. IEEE Transactions on Circuits and Systems for Video Technology, 28(6):1462–1466, 2018. [92] X. Zhang, S. Wang, K. Gu, W. Lin, S. Ma, and W. Gao. Just-noticeable difference based perceptual optimization for JPEG compression. IEEE Signal Processing Letters, 24(1):96–100, 2017. [93] D. Zhou, S. Basu, Y . Mao, and J. C. Platt. Learning from the wisdom of crowds by minimax entropy. In Advances in neural information processing systems, pages 2195–2203, 2012. 103
Abstract (if available)
Abstract
The problem of human-centric compressed video quality assessment (VQA) is studied in this research. Our studies include three major topics: 1) proposing a new methodology for compressed video quality measurement and assessment based on the just-noticeable-difference (JND) notion and building a large-scale dataset accordingly, 2) measuring the JND-based video quality using the satisfied user ratio (SUR) curve and designing an SUR prediction method with video quality degradation features and masking features, and 3) proposing a probabilistic JND-based video quality model to quantify the influence of subject variabilities as well as content variabilities and building a user model based on viewers' capability to address inter-group difference. ❧ For the first topic, the process of building a large-scale coded H.264/AVC video quality dataset, which measures human subjective experience based on the just-noticeable-difference (JND), is described in Chapter 3. The dataset, called the VideoSet, measures the first three JND points of 220 5-second sequences, each at four resolution (i.e., 1920 × 1080, 1280 × 720, 960 × 540 and 640 × 360). Each of these 880 video clips was encoded by the H.264/AVC standard with QP = 1, ..., 51. An improved bisection search algorithm was adopted to speed up subjective test without loss of robustness. We present the subjective test procedure, detection and removal of outlying measured data, and the properties of collected JND data. ❧ For the second topic, we propose a machine learning method to predict the satisfied-user-ratio (SUR) curves based on the VideoSet and then derive the JND points accordingly. Our method consists of the following steps. First, we partition a video clip into local spatial-temporal segments and evaluate the quality of each segment using the VMAF quality index. Next, we aggregate these local VMAF measures to derive a global index. Then, significant segments are selected based on the slope of quality scores between neighboring coded clips. After that, we incorporate the masking effect that reflects the unique characteristics of each video clip. Finally, we use the support vector regression (SVR) to minimize the L₂ distance of the SUR curves, and derive the JND point accordingly. ❧ For the third topic, we propose a JND-based VQA model that takes subject variabilities and content variabilities into account. The model parameters used to describe subject and content variabilities are jointly optimized by solving a maximum likelihood estimation (MLE) problem. We use subject inconsistency to filter out unreliable video quality scores. Moreover, we build a user model by utilizing user's capability to discern the quality difference. We study the SUR difference as it varies with user profile as well as content with variable level of difficulty. The proposed model aggregates quality ratings per user group to address inter-group difference.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Experimental design and evaluation methodology for human-centric visual quality assessment
PDF
A learning‐based approach to image quality assessment
PDF
A data-driven approach to image splicing localization
PDF
Techniques for compressed visual data quality assessment and advanced video coding
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Advanced techniques for stereoscopic image rectification and quality assessment
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Advanced visual processing techniques for latent fingerprint detection and video retargeting
PDF
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Depth inference and visual saliency detection from 2D images
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Efficient template representation for face recognition: image sampling from face collections
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
3D deep learning for perception and modeling
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Data-driven 3D hair digitization
PDF
Rapid creation of photorealistic virtual reality content with consumer depth cameras
Asset Metadata
Creator
Wang, Haiqiang
(author)
Core Title
A data-driven approach to compressed video quality assessment using just noticeable difference
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/09/2018
Defense Date
08/21/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
just noticeable difference,OAI-PMH Harvest,satisfied user ratio,video quality assessment
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Sawchuk, Alexander (
committee member
)
Creator Email
haiqianw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-88217
Unique identifier
UC11676689
Identifier
etd-WangHaiqia-6807.pdf (filename),usctheses-c89-88217 (legacy record id)
Legacy Identifier
etd-WangHaiqia-6807.pdf
Dmrecord
88217
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wang, Haiqiang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
just noticeable difference
satisfied user ratio
video quality assessment