Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Focus mismatch compensation and complexity reduction techniques for multiview video coding
(USC Thesis Other)
Focus mismatch compensation and complexity reduction techniques for multiview video coding
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
FOCUS MISMATCH COMPENSATION AND COMPLEXITY REDUCTION TECHNIQUES FOR MULTIVIEW VIDEO CODING by PoLin Lai A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2009 Copyright 2009 PoLin Lai Dedication To my family. ii Acknowledgements I would like to thank my advisor Prof. Antonio Ortega who, through the years, shapes my idea of what research is all about. I would like to thank Prof. C.-C. Jay Kuo and Prof. Ulrich Neumann for being the members in my dissertation committee, and Prof. Alexander A. Sawchuk and Prof. Ramakant Nevatia for being the members in my qualifying exam committee. It is a privilege to have their valuable advices on my work. I would like to thank Yeping Su, Peng Yin, Purvin Pandit, Dong Tian and Cristina Gomila from Thomson CorporateResearch, whoprovidetremendous supportin all these years. I would also like to thank Congxia Dai and Yunfei Zheng for fruitful discussions during my summer interns. For all the colleagues I met in our Compression Group, it is a great pleasure to work with you. Especially, I would like to thank Jae Hoon Kim for unforgettable years of collaborations and for all the sharing. I would also like to thank Ivy Tseng, May-Chen Kuo, C.K. Chen, Helen Ho from USC, and Julan Hsu from UCLA, for the wonderful friendship that strengthens my will. Mom,Dad,andmydearbrother,thankyouforthelove,thebelief,theencouragement, which make me a person with strong faith in life, to stand, to enjoy, and to embrace. iii Table of Contents Dedication ii Acknowledgements iii List Of Tables vi List Of Figures vii Abstract x Chapter 1: Introduction 1 1.1 Multiview Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Multiview Video Coding (MVC) . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions of This Research . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Focus Mismatch in Video Content 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Review: Characteristics of Images Captured with Lens . . . . . . . . . . . 14 2.3 Focus Mismatch Due to Focus Setting Differences . . . . . . . . . . . . . . 20 2.3.1 Multiview with 1D Parallel Camera Arrangement . . . . . . . . . . 28 2.3.2 A Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 3: Adaptive Reference Filtering (ARF) for Video Coding 34 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Filtering Model for Video Coding with Focus Mismatch . . . . . . . . . . 38 3.3 Adaptive Reference Filtering . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 Frame Partition for Adaptive Filter Design . . . . . . . . . . . . . 41 3.3.2 Filter Design by Estimating Mismatch Kernels . . . . . . . . . . . 51 3.3.3 Encoding with Filtered References . . . . . . . . . . . . . . . . . . 57 3.4 ARF for MVC Bi-directional Disparity Compensation with Focus Mismatch 60 3.4.1 Inter-view Bi-directional Prediction with Focus Mismatch . . . . . 60 3.4.2 ARF and Bi-directional Disparity Search . . . . . . . . . . . . . . 62 3.4.2.1 Filter Design for Averaged Bi-predictor . . . . . . . . . . 63 3.4.2.2 Filter Design for Predictors from Each Reference List . . 65 3.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 iv 3.6 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.6.1 Complexity of ARF Filter Design . . . . . . . . . . . . . . . . . . . 78 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Chapter 4: Computationally Efficient ARF for MVC Inter-view Coding Based on Rate-distortion Prediction and Filter Sharing 82 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 Computationally Efficient ARF for Inter-view Coding . . . . . . . . . . . 86 4.2.1 Rate-distortion Analysis and View-wise ARF Adaption . . . . . . 86 4.2.2 Filters Correlation and Filter Updating using Depth Composition 89 4.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 5: Predictive Fast Motion/disparity Search for Multiview Video Coding 100 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2 Predictive Fast Search with Candidate Vectors Obtained from Local Motion/Disparity Information . . . . . . . . . . . . 104 5.2.1 Predictive Search I: Disparity then Motion (DtM) . . . . . . . . . 104 5.2.2 Predictive Search II: Motion then Disparity (MtD) . . . . . . . . . 106 5.3 Displacement Estimation under Illumination Changes . . . . . . . . . . . 107 5.4 Experiments Design and Results . . . . . . . . . . . . . . . . . . . . . . . 113 5.4.1 Investigation of Efficient Search Patterns . . . . . . . . . . . . . . 114 5.4.2 H.264/AVC-based Simulations . . . . . . . . . . . . . . . . . . . . 116 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Chapter 6: Conclusions and Future Work 124 Reference 127 v List Of Tables 3.1 Filter design methods for bi-directional disparity compensation . . . . . . 68 4.1 Encoding selection of the proposed efficient ARF: Ballroom (20 anchors each view. The first four, a1∼ a4, are encoded with two-pass ARF. If it is determined that the remaining anchors also need filtering, the anchors at which filters areupdatedare also listed in thetable, and theother anchors will be encoded by re-using the filters.) . . . . . . . . . . . . . . . . . . . . 96 4.2 Encoding selection of the proposed efficient ARF: Rena (20 anchors each view. For views that require filtering, since the depth composition remain unchanged, the filters are re-used throughout the remaining anchors.) . . 97 4.3 Encoding selection of the proposed efficient ARF: Race1 (35 anchors each view. a1 ∼ a4 and a21 ∼ a24 are encoded with two-pass ARF. If it is determinedthattheremaininganchorsneedfiltering, theanchorsatwhich filters are updated are also listed in the table, and the other anchors will be encoded by re-using the filters.) . . . . . . . . . . . . . . . . . . . . . . 98 5.1 Compare different sets of MV candidates, PSNR of residue images . . . . 106 5.2 Parameters of the sequences and simulation settings for Section 5.4 . . . . 118 vi List Of Figures 1.1 Simulcast for multiview video coding . . . . . . . . . . . . . . . . . . . . 4 1.2 An example of MVC structurewhich utilizes joint MCP/DCP. Thearrows indicate prediction direction. . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Camera arrangement that causes localized focus mismatch . . . . . . . . 11 2.2 Portions of frame 15 from different views of Race1 . . . . . . . . . . . . . 12 2.3 Exampleoffocuschangeinmonoscopicvideo. (ImagesfromStanfordCom- puterGraphicsLab.,LightFieldPhotographywithaHand-HeldPlenoptic Camera, c Ren Ng) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 The model of a camera equipped with a lens. . . . . . . . . . . . . . . . . 15 2.5 (a)Projectedimagesviaalensforpointsatdifferentdepths,(b)Variations of Z ∗ , (c) Variations of β . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 β as functions of k and 1/k (Z ∗ V1 =0.7Z ∗ V2 , c = 0.7) . . . . . . . . . . . . 22 2.7 The magnitude of OTF at different k (Z ∗ V1 =0.7Z ∗ V2 , i.e. c = 0.7) . . . . 23 2.8 Filter responses at different k ′ , with c = 0.7, 1 2 (1+ 1 c )≈1.2143 . . . . . . 26 2.9 Variation of the±3dB frequencies . . . . . . . . . . . . . . . . . . . . . . 27 2.10 An numerical example of focus mismatch in multiview system . . . . . . . 30 3.1 Flowchart of Adaptive Reference Filtering for video coding . . . . . . . . 41 3.2 Disparity vectors from view 6 to view 7 at the 1st frame in Ballroom: Histogram and GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 The corresponding frame partition result of Fig. 3.2 . . . . . . . . . . . . 45 vii 3.4 Frame partition result: Breakdancer View 1 frame 0 . . . . . . . . . . . . 46 3.5 Frame partition result: Race1 View 2 frame 30 . . . . . . . . . . . . . . . 47 3.6 Variations of MSE when shifting f mb parameters away from the MMSE solution (3.4) by (Δa,Δb,Δc) = n~ u i . The four curves represent results with different ~ u i . ×: ~ u 1 = (−0.05,0.025,0.1), ◦: ~ u 2 = (0.025,−0.05,0.1), : ~ u 3 = (−0.07,0.0513,0.0748), and △: ~ u 4 = (0.07,−0.0805,0.0419). Notethat(i)thenormsofthese ~ u i arethesame,and(ii)4Δa+4Δb+Δc= 0 such that the shifted filters will still be on P f . . . . . . . . . . . . . . . 50 3.7 An example of frame partition based on focus changes . . . . . . . . . . . 51 3.8 Performance of ARF using different filter sizes/constraints (QP22) . . . . 55 3.9 Frequency responses of estimated filters when performing inter-view pre- diction from Race1 V2 to V3 at Anchor 9. V3 is slightly blurred w.r.t reference V2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.10 Frequency responses of estimated filters when performing inter-view pre- diction from Race1 V4 to V3 at Anchor 3. Reference V4 is blurred w.r.t V3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.11 Frequency responses of estimated filters when performing inter-view pre- diction from Race1 V6 to V5 at Anchor 7. V5 is strongly blurred w.r.t reference V6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.12 Encoding selection with adaptive filtering . . . . . . . . . . . . . . . . . . 59 3.13 An example of focus mismatch in multiview bi-prediction, with Z ∗ V1 = 1.9m, Z ∗ V2 = 2.0m, and Z ∗ V3 = 2.3m. We consider image sensor type 1/2” (H×W = 6.4mm×4.8mm) with a resolution of 640×480 pixels, i.e. the spacing between pixels is 0.01mm (Nyquist rate 100/2 = 50 cycles/mm). In polar system, q = √ 50 2 +50 2 ≈ 70.71, which corresponds to Ω =π in (b) and (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.14 Comparison between two ARF methods on inter-view coding . . . . . . . 70 3.15 Comparison of different techniques applied to inter-view coding . . . . . . 71 3.16 Rate-distortion performance of the proposed ARF . . . . . . . . . . . . . 73 3.17 Performance of constrained fast search for ARF-B . . . . . . . . . . . . . 75 3.18 Rate-Distortion comparison of different approaches . . . . . . . . . . . . . 76 viii 4.1 ARF performance in different views of Ballroom . . . . . . . . . . . . . . 84 4.2 ARF performance in different views of Race1 . . . . . . . . . . . . . . . . 84 4.3 Frame-wise RD-cost reduction provided by ARF . . . . . . . . . . . . . . 87 4.4 Correlation of estimated filter coefficients at different timestamps . . . . . 90 4.5 Examples of RD performance when filters are re-used . . . . . . . . . . . 93 4.6 Encoding results of the proposed coding scheme (QP = 37, 32, 27, 22) . . 96 4.7 Images at different anchor timestamps for the sequences tested . . . . . . 99 5.1 The regularization constraint on motion/disparity fields . . . . . . . . . . 101 5.2 (a)Left: The structure of predicting the motion field from the reference view (DtM) (b)Right: Obtaining MV candidates to predict the motion field105 5.3 (a)Left: Thestructureofpredictingthedisparityfieldfromatimeinstance (MtD) (b)Right: Obtaining DV candidates to predict the disparity field . 107 5.4 The search pattern that uses all candidate vectors . . . . . . . . . . . . . 115 5.5 Comparison of different search patterns. . . . . . . . . . . . . . . . . . . 116 5.6 The effect of changing QP for the anchor frames . . . . . . . . . . . . . . 117 5.7 Predictive search simulation results for Aqua sequence . . . . . . . . . . . 121 5.8 Predictive search simulation results for Ballroom sequence . . . . . . . . . 122 5.9 Predictive search simulation results for ST sequence . . . . . . . . . . . . 123 ix Abstract Multiview video systems utilize multiple cameras to simultaneously capture the scene from different viewpoints. They provide video data for new applications such as 3D television and free-viewpoint video. The amount of data in multiview video is very large as compared to monoscopic video. Multiview video coding (MVC) is an emerging researchfieldthatfocusesoncompressionofmultiviewvideodata. Inthisdissertation,by exploiting special characteristics of multiview video, we develop techniques that improve MVC efficiency while also taking complexity into account. First, we analyze focus mismatch exhibited in video content, which is caused by cam- erafocussettingdifferences. Weusegeometrical opticstodemonstratehowfocussettings will affect the captured images. We show that the focus mismatch can be represented in terms of the focus setting parameters (camera-dependency) and the depths of ob- jects (depth-dependency). For 1D parallel camera arrangements in multiview systems, we relate the focus mismatch to the disparity exhibited in frames from different views. The analytical results provide properties that can be exploited to design focus mismatch compensation techniques. Based on this analysis, we propose a novel adaptive reference filtering (ARF) ap- proach. For MVC inter-view prediction, we exploit the depth-dependency property by x utilizing disparity information to partition frames into depth levels, which are prone to suffer from different types of focus mismatch. For each level, a 2D filter is designed by minimizingthepredictionerror. Filteredreferencesarethengeneratedforpredictivecod- ing. WealsoextendARFtomonoscopicvideowherenodisparityinformationisavailable. Simulationresultsdemonstratehighercodingefficiencyascomparedtomultiple-reference prediction and adaptive interpolation filtering methods. The third part of this thesis presents complexity reduction techniques for MVC. By analyzing ARF results, we propose i) View-wise ARF adaptation based on RD-cost pre- diction, and ii) Filter updating based on depth-composition change, to achieve compu- tationally efficient ARF schemes. By exploiting the relationship between motion and disparity, we propose predictive fast search algorithms that can be used when one of the fields is available and we wish to estimate the other field efficiently. Simulation results show that significant complexity reduction can be achieved without significant impact on coding efficiency. xi Chapter 1 Introduction 1.1 Multiview Video In multiview video systems, scenes are captured simultaneously by multiple cameras. These cameras are set to shoot the scenes from different viewpoints. Depending on the application scenario, multiview systems can be built with cameras arranged on a horizontal line with parallel orientation, or on an arc with viewing angles converging to the center of the scene, or even with cameras mounted as a two-dimensional camera array. These systems provide digital video data that is essential for two main application categories: 3D television (3DTV) and free-viewpoint video (FVV) [40]. 3DTV aims toprovide3Dperceptionexperiences usingadvanced displaytechnologies which take multiview video as input. A general approach to achieve this goal is to simultaneously render multiple video from different viewing angles, such that when the user is viewing the display device from different physical locations, different stereoscopic views are perceived. Examples of such display systems include the one proposed by Mitsubishi Electric Research Laboratories (MERL) which uses multiple projectors [33], 1 and a special TV system designed by Philips Electronics [9] which processes multiview video plus depthdata to rendervideo for different viewing angles. Themain applications for 3DTV are live-event broadcasting, advanced theater, and immersive virtual reality. AsforFVV applications, at thereceiver side, userscannavigate across differentview- points by changing the viewing angle to be displayed on a conventional screen. Assume N views are captured by the multiview system. If the requested viewing angle coincides with one of the captured views, the corresponding video can be displayed directly. On the other hand, if the requested viewing angle differs from any of the captured views, a view synthesis process has to be applied, which takes n ≤ N adjacent views to render and display that particular viewpoint. Applications of such FVV systems include edu- cation, such as archives and medical surgery; entertainment, such as sport broadcasting and gaming systems; and security, such as surveillance and monitoring tasks. To provide the content for these applications, multiview video has to be captured, stored, and transmitted. It contains very large amounts of data as compared to conven- tional monoscopic video. If each view is considered independently, the amount of added datawillincreasewiththenumberofviewsinmultiviewsystems. Multiviewvideocoding (MVC), which focuses on compression for efficient storage and transmission of multiview video data, has been recognized as a key technology to support any applications that require multiview video. In the next subsection, we will briefly review the recent devel- opment of MVC, and introduce terminologies that describe the coding schemes that will be used throughout this dissertation. 2 1.2 Multiview Video Coding (MVC) Simulcast is a straightforward approach for multiview video coding, in which each view sequence is encoded independently (see Fig. 1.1). Thisallows temporal redundancyto be exploited using conventional block-based motion compensated prediction (MCP) tech- niques. In a multiview video scenario, because different views are capturing the same scene, there exists an additional source of redundancy, namely, inter-view redundancy. Similar to motion compensated prediction, we can use a block matching procedure to find block correspondences between neighboring views and encode the displacement vec- tor andpredictiondifference(residue), throughdisparitycompensated prediction(DCP). This inter-view redundancy is not exploited in the straightforward simulcast scheme. A MVC structure that exploits both temporal and inter-view redundancy can be con- structedasfollows: Agivenframeinviewv attimet,canusereconstructedframeswithin view v as temporal reference(s) for MCP, while using reconstructed frames from other views as inter-view reference(s) for DCP. We will denote such coding method as joint MCP/DCP. In the MPEG-2 standard, a multiview profile (MVP) [8] was defined which utilizes a two-layer coding scheme (base layer and enhancement layer). View sequences in the enhancement layer can use either MCP, DCP, or joint MCP/DCP. It has been demonstrated that allowing joint MCP/DCP in MPEG-2 MVP achieves higher coding efficiency as compared to simulcast [8]. This coding scheme can be extended to MPEG- 4, with its temporal scalability tool [56]. The coding efficiency is further improved by introducing weighted average combined with block partition for joint MCP/DCP [11]. 3 V0 V1 V2 V3 P P P P P P P P P P P I P t0 t1 t2 t3 I I I Figure 1.1: Simulcast for multiview video coding With coding tools such as variable block size MCP, quarter pixel motion estimation, andnewcontentadaptiveentropycoding,theH.264/AVCstandard[54]establishedbythe joint video team (JVT), achieves superior coding efficiency as compared to the preceding video coding standards. For multiview video coding, Li and He [30] proposed to use the multiple reference prediction tool in H.264/AVC to facilitate temporal and inter- view references, thus supportingjoint MCP/DCP. This became a very populartechnique which has been used in recent literature to construct various MVC structures based on H.264/AVC [3,12,24,32,34], including our work in [24], confirming that utilizing joint MCP/DCP leads to higher coding efficiency than simulcast. Furthermore, the study in [13] and [37] show that, for most cases, the joint MCP/DCP method which utilizes inter-view references from the immediately neighboring views at same timestamp, can achieve comparable coding efficiency to other schemes that use additional farther away inter-viewreferences. In2005, recognizingtheimportanceofMVCforfutureapplications, theMPEGcommitteeissuedaCallforProposalsonMulti-viewVideoCoding[18],aiming to establish a standard for MVC. Conforming to recent research trends, all the received MVC proposals were based on the H.264/AVC framework [19]. While differing in the 4 . . . . . . . . . . . . P P P P t7 V0 V1 V2 V3 P P P P P P I P t0 t1 Anchor frames P P P P P P I P Anchor frames t8 t9 P P P P t2 Figure 1.2: An example of MVC structure which utilizes joint MCP/DCP. The arrows indicate prediction direction. prediction structures, they all feature joint MCP/DCP in order to exploit temporal and inter-view redundancy. Fig. 1.2 illustrates an MVC coding structure with 32 frames forming one group of pictures (GOP). This GOP contains 4 views in spatial direction and 8 frames in tempo- ral direction. To facilitate random access, within one GOP the first frame of each view is encoded with only inter-view DCP, i.e., no temporal references are needed. We de- note these frames anchor frames. Similar to I frames in monoscopic video, these anchor frames serve as temporal access points for multiview video. The remaining frames in a GOP, which we will denote as non-anchor frames, can use both temporal and inter- view references to exploit the redundancy in two directions. For example in Fig. 1.2, a blockinframe(V2,t1) canswitchbetweenthetwocorrespondingMCP/DCPblocksfrom frame (V2,t0) and frame (V1,t1) respectively. Note that for simplicity, this diagram only 5 includes frames coded as I-pictures and P-pictures. However the joint MCP/DCP con- cept can be easily extended to B-pictures with both temporal and inter-view references available for prediction. Ascompared toMCP, themain difficultyinDCP isthat, duetodifferences incamera settings and shooting positions/orientations, frames from different views are more likely to exhibit non-translational discrepancy, such as illumination / color mismatch andfocus mismatch. To compensate for these mismatches, advanced coding tools such as illumi- nation compensation [31] and our adaptive reference filtering method [23,25–28] (which will be explained in detail in Chapters 3 and 4) have been developed. Significant coding gain can be achieved when applying them to inter-view prediction in MVC. 1.3 Contributions of This Research In this research, we investigate the special characteristics exhibited in multiview video data, and develop coding techniques that improve MVC coding efficiency while taking into account the complexity. In particular, coding techniques are designed by exploiting multiview video properties. The main contributions of this thesis can be summarized as the following: • We consider the problem of focus mismatch. To understand its characteristics in videocoding,weanalyzefocusmismatchbasedongeometricaloptics. Weshowthat focus differences between images can be represented in terms of the focus setting differences and the depths of objects. The focus mismatch kernels are circular symmetricwiththeir shapesvaryingacross different depths. For multiview systems 6 with a 1D parallel camera arrangement, we further demonstrate that the disparity exhibited in the images can be utilized to reliably identify different types of depth- dependent focus mismatches. Our analytical results provide insight that can be exploited to design coding techniques to compensate for focus mismatch in video content. • We propose adaptive reference filtering (ARF) approaches to compensate for the depth-dependent focus mismatch. Based on our analysis, we model predictive cod- ing with focus mismatch using point spread functions. Depending on the coding scenario (inter-view or temporal), we estimate different block-wise parameters as featuresforclassificationsuchthatanimagecanbepartitionedintoregionssuffering from different types of focus mismatch. Since focus mismatch is depth-dependent, for MVC inter-view coding, we exploit the block-wise disparity field to partition frames into regions corresponding to different depth levels. As for monoscopic video that undergoes camera focus changes across time, we propose to actually es- timate the localized focus changes and partition frames into regions consisting of macroblocks that suffer from a similar type of focus change. In both methods, a 2D Wiener filter isdesigned foreach region (class) tominimizetheprediction error. Filtered references are generated for the encoder to perform rate-distortion (RD) optimized codingselection. For video sources containing stronglocalized focus mis- matches, the proposed methods provide higher coding efficiency as compared to other techniques such as multiple-reference prediction and adaptive interpolation filtering. 7 • The ARF method is extended to inter-view bi-prediction (B-frames) in MVC, in which the two predictors from different views may exhibit different types of fo- cus mismatch. We investigate the interaction between filter estimation and bi- directional search. We show that designingfiltersonly for theaveraged bi-predictor couldleadtoasuboptimalsolutionwhencombinedwithconventional bi-directional searchschemes. Instead,weproposeafilterdesignapproachthatestimatestwosets of depth-related filters, each set compensating for the focus mismatch occurring in one of the two references used for bi-directional prediction. • We analyze the encoding results when using ARF for inter-view prediction. The coding gains demonstrate a strong view dependency, while the estimated filters at different timestamps exhibit strong correlation when the objects’ depths remain similarinthecorrespondingcapturedscene. Theseresultsconformwiththeanalysis based on geometrical optics. We propose i) a rate-distortion prediction method to achieve view-wise ARF adaptation, such that ARF will be applied only to views in whichsubstantialgaincanbeachieved,andii)afilterupdatingmechanismbasedon depth-composition change, which allows thesameset of filters tobeusedby several consecutive frames until there is significant change inthe depth-composition within the scene. These two techniques lead to significantly reduced complexity while preserving coding efficiency. • WeproposefastpredictivesearchalgorithmsforjointMCP/DCP,whichexploitthe relationship between motion and disparity fields. After one of the motion/disparity 8 fields is estimated, our method obtains good candidate vectors to perform the es- timation on the other field with very low complexity. We construct a model and analytically demonstrate how mismatches, such as illumination change, will affect the accuracy of the first estimated displacement field (motion or disparity), which consequentlywillaffectthereliabilityofthecandidatevectorsobtainedwithourfast predictive search methods. Analysis and simulations results both indicate that it is more efficient to perform motion estimation first and then apply our fast predictive search to disparity estimation, instead of the other way around. The rest of this dissertation is organized as follows. First in Chapter 2, we analysis focus mismatch based on geometric optics and derive properties of focus mismatch ker- nels. Chapter 3 describes in detail the proposed adaptive reference filtering approach to compensate for focus mismatch in video content. Then, in Chapter 4, we study the per- formance of ARF and introduce methods to reduce ARF complexity while maintaining coding efficiency. In Chapter 5, for joint MCP/DCP, we propose predictive fast search algorithms with new candidate vectors. Finally, conclusions and possible future work are presented in Chapter 6. 9 Chapter 2 Focus Mismatch in Video Content 2.1 Introduction Multiview video systems utilize multiple cameras to simultaneously capture scenes from different viewpoints. As compared to conventional monoscopic video, frames from differ- ent views are prone to suffer from mismatches other than simple displacement, due to differences in camera settings and/or shooting positions/orientations. These mismatches across different views could be obstacles to achieving high coding efficiency, as conven- tional block matching in video coding may not be effective to compensate for them. In this dissertation, we consider one particular type of non-translational mismatch exhibited in video content: focus mismatch, which results in blurriness / sharpness dis- crepancy among frames from different views. With multiple cameras in MVC systems, focus mismatch is most likely to becaused by heterogeneous cameras settings. For exam- ple, camera parameters may be inconsistent, so that the focus settings may be different from view to view. This can cause localized focus mismatch, as objects may not always be in sharp focus across different views. Another source of focus mismatch in multiview 10 A B Camera 2 Camera 3 Camera 1 Z1 Z3 Figure 2.1: Camera arrangement that causes localized focus mismatch systemscouldbethecameraarrangement. ConsidertheexampleinFig.2.1, whereobject A appears at a greater depth (z 1 ) in View 1 than in View 3 (z 3 ). Assume all cameras are set with the same perfect-in-focus depth at z 1 , then objectA may appear in focus in View 1 while it will likely be out of focus (blurred) in View 3. On the other hand, object B will appear sharper in View 3 as compared to View 1. The efficiency of inter-view disparity compensation could deteriorate in the presence of these mismatches. Among the multiview video test sequences provided in the initial MVC Call for Pro- posals document [18], the sequence Race1 exhibits the most clearly perceivable focus discrepancy among frames from different views. It consists of 8 parallel views captured by cameras with a 20cm spacing between each, which we will denote as View 0∼ View 7. The frames in View 3 are blurred as compared to the frames in View 2; similarly, the framesinView5areblurredascomparedtotheframesinView6. Fig.2.2showsportions of the frames from different views in Race1. It can be seen that, besides displacement of the scene, frames from different views also exhibit blurriness mismatch. Inadditiontointer-view predictioninMVC, focusmismatch may alsooccurinmono- scopic video. One often observed example of focus change across time happens in dialog 11 View 2 View 3 (Blurred w.r.t. View 2) View 5 (Blurred w.r.t. View 6) View 6 Figure 2.2: Portions of frame 15 from different views of Race1 scenes, in which the camera shifts its focus from one character to another one at a dif- ferent scene depth. The first character becomes blurred (out of focus) while the second gets sharpened (in focus). In other occasions, focus changes are created for transition during scene changes. Fig. 2.3 demonstrates an example of focus change in monoscopic video. From the left image to the right one, we can see that the focused-depth is shifting from the back to the front. In particular, the second and third person(counting from the front) are becoming more “in-focus” while people in the back are getting blurred. A focus mismatch compensation technique for video coding can be established by first estimating different focus mismatch kernels and then creating filtered versions of the reference frame in order to provide a better match to the current frame. Higher coding efficiencycanbeachievedbyusingthesefilteredreferencesforpredictionandtransmitting 12 Figure 2.3: Example of focus change in monoscopic video. (Images from Stanford Com- puter Graphics Lab., Light Field Photography with a Hand-Held Plenoptic Camera, c Ren Ng) thefiltercoefficientsassideinformation,sothatfilteredreferenceframescanbegenerated at the decoder. However, without prior knowledge about focus mismatches, estimating the mismatch kernels can be very complicated as we will have to consider a very large solution space (in terms of the shape and support of the kernels). Thus in order to reliably and efficiently estimate focus mismatch kernels, it is necessary to understand how focus setting differences affect the focus mismatches. Furthermore, as described in the aforementioned example, objects at different depths could undergo different types of focus mismatches. Thus, we also need to investigate how focus mismatch changes at different depths. Intheliterature, in orderto model how images arecapturedby acamera, a simplified model is typically used to approximate a camera as an imaging system with a single lens. Geometricalopticsiswidelyutilizedtoderivesomewell-knownpropertiesoftheprojected images under a fixed focus setting with the simplified single lens model [29,36]. In this chapter, we will further extend these previous results to analyze how focus setting differ- ences will affect the captured images. We will show that the mismatch exhibited in the images can be represented in terms of the focus setting parameters (camera-dependency) 13 andthedepthsof objects (depth-dependency). Thenweconsider multiplecameras which capture the scene from different viewpoints. For a 1D parallel camera arrangement in a multiview system, we relate the focus mismatch to the disparity exhibited in frames from different views. The analytical results provide a better understanding of the focus mismatch problem and can be exploited to design coding tools aiming to compensate for such mismatch. The remainder of this chapter is organized as follows: In Section 2.2 we first review the characteristics of images captured with a lens. Then in Section 2.3, we will derive special properties of images under the influence focus setting differences. We will discuss a multiview system with 1D parallel camera arrangement, and also provide a numerical example to illustrate the analytical results. Finally, useful characteristics that can be exploited to design focus compensation tools are summarized in Section 2.4. 2.2 Review: Characteristics of Images Captured with Lens A camera is typically modeledas an imaging system consisting of afilm, alens with focal lengthf, and an aperture with diametera. For digital cameras, the film is made up with an array of image sensors. The plane which contains the film is called the image plane, which is parallel to the lens with distance d to it. In Fig. 2.4, we construct a coordinate system with its origin located at the center of the lens and its xy-plane parallel to the image plane. The z-axis, which passes through the lens center, is also called the optical axis. Note that the coordinate system in Fig. 2.4 uses left-handed orientation such that points in front of the camera will have positive depth values Z. The two points on the 14 optical axis with |z| = f are called the focal points, which have special properties that will be discussed shortly. x y a o Opitcal axis (same as z−axis) P’ at (X’,Y’,Z’) ( d away from the lens) Image plane P at (X,Y,Z) Plane z = Z Lens z Figure 2.4: The model of a camera equipped with a lens We analyze the effects of light via geometrical optics, which treats light as rays [29]. In Fig. 2.4, let us consider a pointP at location (X,Y,Z) which is visible to the camera. Light rays passing through the lens center will not be refracted. Therefore, the light ray that originates from P, passes through O, will be projected on the image plane at point P ′ with coordinates (X ′ ,Y ′ ,X ′ ) such that (based on the congruence of triangles): (X ′ ,Y ′ ,Z ′ ) =(− d Z X,− d Z Y,−d) (2.1) The minus signs in (2.1) represents the fact that the projected images on the image planewillbereversedandupsidedown(rotated 180degrees), ascomparedtotheoriginal appearance in the scene. Similarly, the projections of other visible points produced by light rays passing through O, can also be determined by (2.1). For a given object, its projection on the image plane will become smaller as it moves away from the camera (increasing Z). The principle in (2.1), which describes the light rays passing through lens center, is called the perspective projection. It is widely used in geometric camera 15 models when the focusing effect of the lens is ignored [15]. However, in this work, it is our goal to analyze the effect of focus introduced by the lens. To take it into account, light passing through other parts of the lens has to be considered as well. According to geometrical optics, light rays parallel to the optical axis on one side of the lens will be refracted to pass through the focal point on the other side of the lens (see examples in Fig. 2.5). Furthermore, light rays originating from a pointP with depthZ will converge to a point ˆ P on the other side of the lens with distance ˆ Z that satisfies: 1 Z + 1 ˆ Z = 1 f (2.2) Fig. 2.5(a) depicts three pointsP,P 1 ,P 2 at different depths, the projectionsP ′ ,P ′ 1 ,P ′ 2 of light rays from thesepointspassingthroughthelens center, andtheir converged image points ˆ P, ˆ P 1 , ˆ P 2 . The dashed light paths are determined after finding the converging points based on the light paths depicted with solid lines. As a result, on the image plane with distance d to the lens, a visible point will produce a point projection (perfectly focused) only if it is at a particular depthZ ∗ that satisfies: 1 Z ∗ + 1 d = 1 f ⇒ Z ∗ = d·f d−f (2.3) When capturing the scene, under a fixed zoom parameter set byf, we can focus on a specific distance Z ∗ by fine tuningd (d≥f). Operating in a very narrow range, a slight change ind can cause relatively large variation in Z ∗ (Refer to Fig. 2.5(b)). This can be achieved by using auto-focus (AF) or by manually adjusting the focus ring. For points at distances other than Z ∗ , their projections on the image plane will be uniform circles 16 with diameter β, as depicted in Fig. 2.5(a). Using again the congruence of triangles, β can be calculated as: Depth smaller than Z ∗ (Fig. 2.5(a) P 1 ): β a = ˆ Z 1 −d ˆ Z 1 β a = Z 1 f Z 1 −f − Z ∗ f Z ∗ −f Z 1 f Z 1 −f β = af(Z ∗ −Z 1 ) Z 1 (Z ∗ −f) (2.4) Depth greater than Z ∗ (Fig. 2.5(a) P 2 ): β a = d− ˆ Z 2 ˆ Z 2 β a = Z ∗ f Z ∗ −f − Z 2 f Z 2 −f Z 2 f Z 2 −f β = af(Z 2 −Z ∗ ) Z 2 (Z ∗ −f) (2.5) Combining (2.4) and (2.5), we can get: β = af|Z−Z ∗ | Z(Z ∗ −f) (2.6) It can be seen from (2.3) and (2.6) that the value of β is determined by parameters a,d, f, and the object depthZ. To demonstrate howβ varies with different scene depth Z, under the given focus setting, we define: k = Z Z ∗ →Z =k·Z ∗ , i.e. k is the depth normalized by Z ∗ (2.7) 17 P P d f a O P’ o f = O a P1 d P1’ P1 a P2 d P2 P2’ f (a) Z1 Z2 Z* | Z2 | | Z1 | 0 10 20 30 40 50 0 2 4 6 8 10 12 Fig.2(b) Z* as a function of d d: mm Z*: m f = 10 mm f = 20 mm f = 40 mm 0 1 2 3 4 5 0 1 2 3 4 5 Fig.2(c) β as a function of Depth Normalized depth k (Z = k ⋅ Z * ) β ( af/Z * ) Figure 2.5: (a) Projected images via a lens for points at different depths, (b) Variations of Z ∗ , (c) Variations of β Then from (2.6), β can be represented as: β = af|Z−Z ∗ | Z(Z ∗ −f) = af|k−1|Z ∗ kZ ∗ (Z ∗ −f) = af|k−1| k(Z ∗ −f) ≈ af Z ∗ · |k−1| k (2.8) The approximation in (2.8) is based on the fact that, typically, f ≪ Z ∗ (e.g., f can be less than 100mm while Z ∗ is of the order of several meters). Fig. 2.5(c) illustrates the variation of β as a function of the normalized depth. As the depth Z deviates from Z ∗ (thus k 6= 1), the diameter β increases, leading to stronger out-of-focus blurriness: Instead of forming a point image, on the image plane, the image intensity is spread over 18 a circle with diameterβ (areaπ(β/2) 2 ). Therefore, for a point with a depthZ, the point spread function, PSF, is a circular disk: PSF Z (x,y) = 4/(πβ 2 ), if x 2 +y 2 ≤(β/2) 2 0, otherwise , where β = af|Z−Z ∗ | Z(Z ∗ −f) (2.9) From Fig. 2.5(a), the center of the point spread function, produced by a visible point P at (X,Y,Z), will be located at coordinate (X ′ ,Y ′ ) on the image plane, as specified by (2.1). The intensity of the projection resulting from P, denoted as I P , can then be described as: I P (x,y) =K P ·PSF Z (x−X ′ ,y−Y ′ ), (2.10) where K P represents the light intensity produced by P at converging point ˆ P, i.e., the intensity if perfectly focused. On the image plane, this value is spread over a disk as described by (2.10). The total light intensity at (x,y) on the image plane, denoted as J(x,y), is the superposition of all the projection circles centered at different locations that contribute non-zero values at position (x,y). In general, projections centered at nearby locations canhavedifferent diameters, asβ dependsonthedepthofthevisiblepointthat producestheprojection. However fortypical scenesbeingcaptured, pointswithinasmall visibleregionwilloftenhaveverysimilardepthZ,sothattheircorrespondingprojections canbewellapproximatedbythesame pointspreadfunction(correspondingtoafixedZ). Also, the converged light intensity K P produced by points with similar depth Z, can be 19 approximated asafunctionofonlyX andY, whichcanthenberepresentedasafunction of the projected locations (X ′ ,Y ′ ): K P = K(X,Y,Z) ≈ K Z (X,Y) = K Z (X ′ ,Y ′ ). The subscript indicates that this representation is valid for points that are at a given Z. The projected image of this region with approximately fixed Z can then be derived as: J Z (x,y) = Z Z K Z (X ′ ,Y ′ )·PSF Z (x−X ′ ,y−Y ′ )dX ′ dY ′ = K Z (u,v)∗PSF Z (u,v), where (u,v) are dummy variables (2.11) SinceK Z (X ′ ,Y ′ ) represents the light intensity that would be observed under perfect focus, (2.11) indicates that for a region with depth Z, the effect of focus setting can be modeled as the perfectly focused image convolved with the point spread function PSF Z . 2.3 Focus Mismatch Due to Focus Setting Differences In Section 2.2, we reviewed imaging characteristics of a single camera equipped with a lens. We now discuss how to model the effect of capturing the same scene underdifferent focus settings. Let us consider two cameras V1 and V2 in a multiview system. Assume they have the same focal length setting f (same zoom) and same aperture setting a 1 . However, their perfect in-focus depths Z ∗ are not equal, Z ∗ V1 6=Z ∗ V2 . For example, if Z ∗ is determined by auto-focus, it will depend on the scene contents, which are not exactly the same for different cameras. From (2.6), this will result in differences in how β varies as a function of Z. Follow the derivation in (2.7), we define: 1 For most cameras currently available in the market, both f anda can be set to a specific value from a finite/discrete values to choose from (e.g., f = 35mm, 70mm... and a = f/2.7, f/5.6... ) 20 k = Z Z ∗ V2 (Z =k·Z ∗ V2 ) (2.12) c = Z ∗ V1 Z ∗ V2 (Z ∗ V1 =cZ ∗ V2 ), (2.13) where k is the normalized depth using Z ∗ V2 as a reference, and c measures the degree of focus setting mismatch. For the two cameras, their respective β V1 and β V2 can be written as: β V2 = af|Z−Z ∗ V2 | Z(Z ∗ V2 −f) = af|k−1|Z ∗ V2 kZ ∗ V2 (Z ∗ V2 −f) ≈ af Z ∗ V2 · |k−1| k = af Z ∗ V2 |1− 1 k | (2.14) β V1 = af|Z−Z ∗ V1 | Z(Z ∗ V1 −f) = af|k−c|Z ∗ V2 kZ ∗ V2 (cZ ∗ V2 −f) ≈ af Z ∗ V2 · |k−c| ck = af Z ∗ V2 | 1 c − 1 k | (2.15) where the approximations in (2.14) and (2.15) are based on the assumptions as for (2.8). Note that here we also represent β in terms of the reciprocal of the normalized depth, i.e., 1 k . Fig. 2.6 below demonstrates how the two β’s change as functions of depth k and its reciprocal 1 k , in the case where c=0.7 (Z ∗ V1 =0.7Z ∗ V2 ). From Fig. 2.6, we can see that the twoβ curves change rapidly whenk is small while they vary more slowly as k increases. On the other hand, from (2.14) and (2.15), the value of β is approximately a linear function of 1 k . The intersection of the two β curves, which occurs at 1 k = 1 2 (1+ 1 c ) (see Fig. 2.6), divides the depth range into two regions: β V1 > β V2 and β V1 < β V2 . An object at given depth Z (i.e., at a given k or 1 k ) will appeardifferentlyinV1andV2, undertheinfluenceofdifferentβ. From(2.9) and(2.11), 21 0 1 2 3 4 5 6 0 1 2 3 4 Normalized depth k (Z = k ⋅ Z * V2 ) β ( af/Z * V2 ) 0 1 2 3 4 5 0 1 2 3 4 1/k β ( af/Z * V2 ) V1 (c=0.7) V2 V1 (c=0.7) V2 c 1/c Far Near Far Near Blurring Sharpening Figure 2.6: β as functions of k and 1/k (Z ∗ V1 =0.7Z ∗ V2 , c = 0.7) a larger β indicates that the light is spread over a larger area, resulting in stronger out- of-focus blurriness. If we want to match an image from V1 to the corresponding image fromV2atthesametimestamp,objectswithdepthsinthefirstregionrequiresharpening filters as they are more blurred in V1 than in V2, while objects at depths in the second region need lowpass filtering. To better illustrate how such difference in β will affect the images, we plot the optical transfer function, OTF, which is the the frequency transform of PSF Z . That is, Fr{PSF Z (x,y)} = OTF(v x ,v y ), where Fr{·} denotes the transform from spatial to frequency domain. The following transform pair can be derived [5] by applying the Hankel transform to obtain a frequency domain representation in the polar coordinate system: 22 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 1/k = 2.5 (k = 0.4) |OTF| Δ q 0 1 2 3 4 5 0 0.2 0.4 0.6 0.8 1 1/k = 0.25 (k = 4) |OTF| Δ q ↑ Enhancement V2 V2 V1 V1 ↓ Lowpass Figure 2.7: The magnitude of OTF at different k (Z ∗ V1 =0.7Z ∗ V2 , i.e. c = 0.7) PSF Z (r) = 4/(πβ 2 ), if r 2 ≤(β/2) 2 0, otherwise → OTF Z (q)= 2J 1 (πβq) πβq (2.16) In(2.16),r = p (x 2 +y 2 ),q = q (v 2 x +v 2 y ), withv x andv y representingthehorizontal and vertical frequencies, and J 1 is the Bessel function of the first kind of order 1. By plugging (2.14) and (2.15) into (2.16), and defining Δ= πaf Z ∗ V2 , we get: OTF V2 Z (q) = 2J 1 (|1− 1 k |Δq) |1− 1 k |Δq (2.17) OTF V1 Z (q) = 2J 1 (| 1 c − 1 k |Δq) | 1 c − 1 k |Δq (2.18) Fig. 2.7 illustrates examples of the magnitude of OTF with 1 k values from the two regions in Fig. 2.6. It can be seem clearly that in the first region a sharpening filter is requiredtomatchOTF V1 toOTF V2 , whileinthesecondregionweneedablurringfilter. 23 For simplicity, let us denote k ′ = 1 k . For a given k ′ (a given depth Z), to match the OTF from V1 to V2, the frequency response H Z of the filter required can be represented as: H Z (q) = OTF V2 Z (q) OTF V1 Z (q) = β V1 β V2 J 1 (πβ V2 q) J 1 (πβ V1 q) = | 1 c −k ′ | |1−k ′ | J 1 (|1−k ′ |Δq) J 1 (| 1 c −k ′ |Δq) (2.19) From Fig. 2.6, it can be observed that the relationship between the two β curves is symmetric with respect to k ′ = 1 2 (1+ 1 c ), if we exchange the roles of β V1 and β V2 when crossing 1 2 (1+ 1 c ). This leads to the following interesting property: the filter responses are reciprocals on the two sides of 1 2 (1+ 1 c ). In the following, we will derive such result. Let us define two pointsk ′ + andk ′ − with the same distance κ to 1 2 (1+ 1 c ), but are on the opposite sides, i.e. k ′ + = 1 2 (1+ 1 c )+κ andk ′ − = 1 2 (1+ 1 c )−κ (κ>0): For k ′ + : 1 c −k ′ + = 1 c − 1 2 (1+ 1 c )−κ = 1 2c − 1 2 −κ 1−k ′ + = 1− 1 2 (1+ 1 c )−κ = −( 1 2c − 1 2 )−κ H k ′ + (q) = 1 2c − 1 2 −κ −( 1 2c − 1 2 )−κ J 1 −( 1 2c − 1 2 )−κ Δq J 1 1 2c − 1 2 −κ Δq = |α−κ| |−α−κ| J 1 (|−α−κ|Δq) J 1 (|α−κ|Δq) where α= 1 2c − 1 2 (2.20) For k ′ − : 1 c −k ′ − = 1 2c − 1 2 +κ =|α+κ| 1−k ′ − = −( 1 2c − 1 2 )+κ =|−α+κ| H k ′ − (q) = |α+κ| |−α+κ| J 1 (|−α+κ|Δq) J 1 (|α+κ|Δq) = 1 H k ′ + (q) (2.21) 24 Thereciprocal relationshipH k ′ − =H −1 k ′ + can readily beseen as|α+κ|=|−α−κ| and |−α+κ| =|α−κ|. In Fig. 2.8, we show an example of the filter responses at different depths represented in terms of k ′ . Note that for illustration purposes, for each response, only the portion within the main lobe (see examples in Fig. 2.7) is shown. We can see that for k ′ < 1 2 (1+ 1 c ) the filters perform sharpening and fork ′ > 1 2 (1+ 1 c ) the filters are lowpassones 2 . Thefilter shapealsochanges acrossdifferent depthk ′ . Tofurtheranalyze the change in filter responses, with different values of k ′ , we solve for the -3dB/+3dB frequencies within the main lobe (as those shown in Fig. 2.8) of the blurring/sharpening filters. Polynomial approximation, J 1 (x)≈ P n m=0 b m x m , is used to represent J 1 (x) from x =0 to its first zero-crossing location (see Fig. 2.7), such that the±3dB frequency can be calculated as (X = p 1/2 for lowpass filter, √ 2 for sharpening filter): H Z (q) = | 1 c −k ′ | |1−k ′ | J 1 (|1−k ′ |Δq) J 1 (| 1 c −k ′ |Δq) ≈ | 1 c −k ′ | |1−k ′ | P n m=0 b m (|1−k ′ |Δq) m P n m=0 b m | 1 c −k ′ |Δq m =X n X m=0 b m |1−k ′ | m−1 (Δq) m = X n X m=0 b m 1 c −k ′ m−1 (Δq) m n X m=0 b m |1−k ′ | m−1 −X 1 c −k ′ m−1 ! (Δq) m = 0 (2.22) Coefficients b m can be determined using approximations proposed in the literature [35], or simply by solving a least-square regression for samples of J 1 (x) taken between 0 and the first zero-crossing x. We investigated both methods and found that the lowest order polynomial to achieve very close approximation over the entire range between 0 2 Note here we demonstrate an example with c < 1 (c = 0.7). It is straightforward to demonstrate from Fig. 2.6 that forc> 1 we will instead have lowpass H k ′ for k ′ < 1 2 (1+ 1 c ) and sharpening filters for k ′ > 1 2 (1+ 1 c ). 25 0 1 2 3 4 5 6 7 8 0.6 0.8 1 1.2 1.4 Δ q |H K’ | +3dB: 2 1/2 −3dB: 0.5 1/2 k’=1.1 k’=1.25 k’=2 k’=2.5 k’=0.8 k’=0.5 k’=0.2 k’=1.5 Figure 2.8: Filter responses at different k ′ , with c = 0.7, 1 2 (1+ 1 c )≈1.2143 and the first zero-crossing of J 1 (x), has order four (n = 4 in (2.22)). However, there is no direct analytical form to represent the roots of a 4th-order polynomial using its coefficients. Thusweillustratehowthe±3dB frequencychangeswithk ′ ,calculated using (2.22) with 4th-order approximation, as curves in Fig. 2.9. As shown by the analysis of (2.20) and (2.21), in Fig. 2.9(a), for a givenc, the curves on the two sides of 1 2 (1+ 1 c ) are symmetric,withonebeingthe3dBfrequencyandtheotherbeingthe-3dBfrequency. The ±3dB frequency changes much more significantly ask ′ approaches 1 2 (1+ 1 c ). Subtracting 1 2 (1+ 1 c ) from k ′ , Fig. 2.9(b) illustrates clearly the symmetry property. It also reveals two important properties of focus mismatches: (i) Larger focus setting difference (c away from 1) results in mismatch kernels with lower±3dB frequencies as compared to smaller focus setting difference (c closer to 1), leading to stronger impact on the images (i.e., stronger focus mismatch between two views). (ii) Smaller focus setting difference results in morerapid change in filter responsesask ′ changes, as compared to larger focus setting difference. (Note however that, as described in (i), this change corresponds to higher ±3dB frequencies.) 26 1 2 3 4 0 2 4 6 8 10 12 14 16 18 20 22 k’ Δ q −1 0 1 2 3 0 2 4 6 8 10 12 14 16 18 20 22 k’ − 1/2(1−1/c) c = 0.9 c = 0.7 c = 0.5 c = 0.9 c = 0.7 c = 0.5 Lowpass −3dB Enhancement +3dB Δ q (a) (b) Near object Far object ± 3dB freq Figure 2.9: Variation of the±3dB frequencies While we have used multiview as example, the analysis above can be easily applied to monoscopic video where focus changes occur over time. Instead of considering the two curves in Fig. 2.6 as corresponding to two different views, we can simply regard them as the β curves of the same camera at different time t 1 and t 2 , while the focus setting is changing (e.g., adjust d to focus on different depths as in the dialog example in Section 2.1). Thus, from t 1 to t 2 , objects at k ′ < 1 2 (1 + 1 c ) are getting sharpened as their β decreases. As for objects at k ′ > 1 2 (1+ 1 c ), they are getting blurred (β increases). In the following subsection, we will consider a particular case in multiview system in which the cameras are arranged on a one-dimensional horizontal line with parallel shootingorientation. Inthisscenario, wewilldemonstratethatthedisparityexhibitedby images fromdifferent views canbeexploited as anindication to identifydepth-dependent focus mismatch. 27 2.3.1 Multiview with 1D Parallel Camera Arrangement One of the most common multiview settings uses a 1D horizontal camera arrangement: Cameras are positioned along a horizontal line with equal spacingb between neighboring cameras, and their optical axes (Fig. 2.4) are parallel. Consider two neighboring cameras V1 and V2, where V1 is located to the left of V2. We once again assume they have the same f and a, and the focus mismatch is caused by a very slight difference in d which results in different Z ∗ and PSF Z . Since V2 is to the right of V1 at a distance b, as compared to the scene captured by V1, we can regard the scene as shifted by −b along thex-axis for the coordinate system centered at the lens of V2. From (2.1), due to the shift of−b, for a visible point with depthZ, the centers of its projections on the image planes of V1 and V2 will be located at P ′ V1 and P ′ V2 respectively as: P at (X,Y,Z) V1 , the projectionP ′ V1 at (X ′ ,Y ′ ) = − d V1 Z X,− d V1 Z Y P at (X−b,Y,Z) V2 ,P ′ V2 at − d V2 Z (X−b),− d V2 Z Y = − d V2 Z X +δ Z ,− d V2 Z Y where δ Z = b·d V2 Z (2.23) If the difference betweend V1 andd V2 is negligible, which is most likely the case, then from (2.23) theprojection centers of apoint with depthZ will appear with a disparity of δ Z between images from V1 and V2. The disparity δ Z and depth Z are reciprocals: Objects closer to the cameras (smaller depth) will have a larger disparity; while objects far away (larger depth) will possessasmaller disparity. Therelationship between thetwo 28 projections in V1 and V2 can be represented as follows, where K(x,y) is the perfectly focused light intensity as in (2.11): K Z,V2 (x,y) =K Z,V1 (x−δ Z ,y) (2.24) Since the focal settings of the two cameras are not identical, (hence they focus on different depthsZ ∗ V1 6=Z ∗ V2 ), they will have differentPSF Z due to the differentβ. For a visible region with depthZ, from (2.11) and (2.24) the corresponding images will be: J Z,V1 (x,y) = K Z,V1 (x,y)∗PSF Z,V1 (x,y) J Z,V2 (x,y) = K Z,V1 (x−δ Z ,y)∗PSF Z,V2 (x,y) (2.25) (2.25) means that for a 1D parallel camera arrangement with focus setting difference, the corresponding images in V1 and V2 for a region with depth Z will be displaced with disparity δ Z , and will be affected by two different PSF Z . If we can align the two images (e.g., via block matching) so that regions with depth Z match each other after disparity compensation, thenthe focusmismatch can beanalyzed as describedin Section 2.3. Furthermore, since the disparity δ Z is a reciprocal of depth Z, it is proportional to the quantity 1 k as defined in (2.8). From (2.14) and (2.15), we can see that the value of β, which determines the PSF/OTF and consequently the filter H Z , is approximately a linear function ofδ Z . Thus, difference in disparity will translate linearly into a difference in β, and therefore we can identify regions with different focus mismatch based on their disparity δ Z , i.e., regions with similar disparity are likely to suffer from similar focus 29 1 2 3 4 5 0 0.005 0.01 0.015 0.02 0.025 0.03 β as function of Z (f = 20mm, a = f/8) Z: m (a) β: mm β V1 : 0.0155 β V1 : 0.014 β V2 : 0.017 β V2 : 0.0126 β V3 : 0.020 β V3 : 0.009 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Ω: π (b) ← V1 V3 → Z = 1.2m |OTF| V1 V2 V3 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Ω: π (c) ← V3 V1 → Z = 4m |OTF| V1 V2 V3 (V1+V3)/2 V1: Z* = 1.9m (d = 20.21mm) V2: Z* = 2.0m (d = 20.20mm) V3: Z* = 2.3m (d = 20.175mm) Figure 2.10: An numerical example of focus mismatch in multiview system mismatch. Sincethereisnodirectmeasurementofdepthavailable, thispropertyprovides usareliablemethodtopartitionanimageintodepthlevelsthatsufferfromdifferenttypes of focus mismatch. 2.3.2 A Numerical Example In this subsection, we introduce numerical example of focus mismatch in which specific valuescanbecalculatedforthereadertofacilitateunderstandingoftheanalyticalresults. LetusconsideramultiviewsystemwiththreecamerasV1,V2,andV3. Theyaresetwith the same focal length f =20mm (same zoom), and the same aperture settings a =f/8. However, the fine tuning of their perfect in-focus depth Z ∗ was not done perfectly, with Z ∗ V1 =1.9m,Z ∗ V2 =2.0m, andZ ∗ V3 = 2.3m. Wecan calculate their image planedistances d using (2.3), and the corresponding values are shown in Fig. 2.10(a). Once again we see that a slight change ind (-0.17% from 20.21 to 20.175) results in significant change inZ ∗ (+21.05% from 1.9 to 2.3). As in Fig. 2.10(a), the focus setting mismatch results in differences of the β values of the cameras as functions of Z. An object at given depth Z will appear differently in V1, V2 and V3 under the influence of differentβ parameters. Before plotting the optical 30 transfer functions (OTF) at different depth, let us describe the frequency range we need to consider. Since for digital cameras, the image intensity is sampled by image sensors, only frequencies up to the Nyquist rate have to be taken into account. For a 1/2” sensor type (H×W = 6.4mm×4.8mm), a resolution of 640×480 pixels leads to a sample-spacing of 0.01mm between pixels. The Nyquist rate is 100/2 = 50(cycles/mm). In the polar system, q = √ 50 2 +50 2 ≈ 70.71. Thus, if we plot OTF Z as (2.16) with β expressed in the unit of mm, we only need to consider the range up to q = 70.71. This value corresponds to Ω=π in the digital domain. Fig. 2.10(b) and (c) show the differences in thecorrespondingOTF.IfweencodeV2usingV1asareference(c=Z ∗ V1 /Z ∗ V2 =1.9/2 = 0.95, 1 2 (1+ 1 c )≈1.026), for image portions corresponding to visible regions at Z = 1.2m (k ′ =2/1.2≈1.67), weneedtoperformlowpassfilteringonthedatafromV1. For visible regions at Z = 4m (k ′ = 2/4 = 0.5), the corresponding image portions in V1 need to be slightly sharpened. On the other hand, if we instead use V3 as a reference to encode V2 (c=Z ∗ V3 /Z ∗ V2 =2.3/2 =1.15, 1 2 (1+ 1 c )≈0.935), image portions corresponding to depth Z = 1.2m need some sharpening in order to match V2, while regions at Z = 4m have to undergo a significant amount of lowpass filtering. Now if thecameras arearrangedon a1D horizontal linewithparallel orientation, and thespacing between two cameras is 10cm, thenfrom (2.23) the disparity between V1 and V2 for objects at depth 1.2m and 4m can be calculated as: δ 1.2m = 10 120 ×20.20 ≈ 1.68 (mm), which corresponds to 168 pixels δ 4m = 10 400 ×20.20 ≈ 0.5 (mm), which corresponds to 50 pixels 31 Disparity between V2 and V3 can be calculated in a similar manner. By performing disparity estimation, we can identify regions within a frame that correspond to different depth levels. Based on the depth-dependency property, each level is anticipated to have different type of focus mismatch kernels such as depicted in Fig. 2.10. 2.4 Summary In this chapter, we analyzed focus mismatches occur in video content which are caused by focus setting differences. We utilize geometrical optics to derive characteristics of the images captured with a lens, and then demonstrate how images will be affected under different focus settings. The following analytical results provide useful insight for us to design coding techniques to compensate for the focus mismatch: • For lens-based imaging systems, PSF/OTFs are determined by the blur diameter β, which is a function of depthZ and camera parameters a, f, d. • β is approximately a linear function of the reciprocal of depth ( 1 k ). • In the presence of focus setting differences, the corresponding images will exhibit a depth-dependent mismatch. To deal with different types of focus mismatch at different depths, compensation kernels should be designed according to the depth- composition within the scene. • For a given depth, the mismatch can be modeled using a convolution kernel which capturesthedifferencebetweentwopointspreadfunctions(orinfrequencydomain, between two optical transfer functions). 32 • The focus mismatch kernels can be represented as blurring/sharpening filters that are circular symmetric in spatial domain. • Foragivenpairofviews,ifZ ∗ V2 andZ ∗ V1 (perfect-in-focusdepth)aremadeavailable, we can determine c and consequently the value of 1 2 (1 + 1 c ), which divides the disparity (depth) into two regions that require different filter types: blurring and sharpening. • In terms of k ′ , on the two sides of 1 2 (1+ 1 c ), the filter responses are reciprocal. • For1Dparallelcameraarrangement, thedisparityδ Z betweenframesfromdifferent views is also a reciprocal of depthZ (same as the blur diameter β). This property can be exploited in order to identify regions within the image that suffer from different types of mismatch. • Larger focus setting difference (c away from 1) results in mismatch kernels with lower ±3dB frequencies, leading to stronger focus mismatches. • Across different depth/disparity values, smaller focus setting difference (c closer to 1) results in more rapid change in filter responses as compared to stronger focus setting mismatch. In the next chapter, we will describe our adaptive reference filtering (ARF) methods to compensate for focus mismatch, which are designed based on the analytical results above. 33 Chapter 3 Adaptive Reference Filtering (ARF) for Video Coding 3.1 Introduction In this chapter, we consider the problem of encoding video content that exhibits focus mismatch, such as focus setting differences in inter-view prediction in MVC and focus change over time in monoscopic video. Based on the analysis in Chapter 2, systems for efficient focus mismatch compensation in video coding shall be designed with the following requirements in mind. First, local compensation is useful in addressing depth- dependent focus mismatches, as different portions of a video frame can undergo different blurriness/sharpness changes with respect to the corresponding areas in frames used as reference. Second, since the characteristics of focus mismatch are determined by camera parameters and depth-composition within the scene, they will change from view to view (in MVC) as well as over time. Thus compensation process should be adaptive to the differentmismatchesexhibitedbythevideoframes. Finally, tooptimizeoverall codingef- ficiency, thedecisionsonwhetherornottousemismatchcompensation, andtheselection of mismatch compensation kernels, should be based on rate-distortion (R-D) criteria. 34 To improve motion compensation performance in monoscopic video coding, different approaches have beenproposedin whichthe referenceframes arefiltered to generate new predictors [6,46,51,52]. The general idea behind these methods is that predictors better matched to the current frame can be created after filtering. Budagaviproposedblurcompensation[6],whereafixedsetofblurring(lowpass)filters are used to generate blurred reference frames. For focus changes which lead to circular symmetric mismatch kernels, as we demonstrated in Chapter 2, Budagavi utilizes simple n×n uniform averaging filters with different sizesn; as for camera panning (which leads to directional mismatch kernel), 1×n horizontal and n×1 vertical averaging filters are considered. This technique has two shortcomings for the focus mismatch scenarios we consider. First, the filter selection is made only at the frame-level, i.e., applying different filters to different parts of a frame is not considered. However, as noted in Chapter 2, for thedepth-dependentfocusmismatches, adaptive local compensation shouldbeexploited. Second, this method relies on a very limited predefinedfilter set. Sharpeningfilters (high frequency enhancement), for example, are not included. As a result, the usefulness of the predefined filter set is very constrained as it will only be able to cover particular types of mismatch. Instead, our approach estimates the mismatch kernels between the reference frame and the current frame such that the compensation filters are designed adaptively. In the final motion/disparity search, we allow each block to select the filtered version of predictor that gives the lowest R-D cost to ensure optimized coding efficiency. In [46,51,52], adaptive filtering methods have been proposed for generating subpixel references for motion compensation. Vatis et al. [46], after an initial motion search using the 6-tap interpolation filters defined in H.264/AVC [54], divide blocks in the current 35 frame into groups exclusively based on the subpixel positions of their motion vectors. 1 This partition method is used in order to generate different subpixel interpolation filters for different positions. For each group, an filter is adaptively estimated by minimizing squaredpredictionerror. Thesubpixelsofthereferenceframewillthenbegeneratedusing theseadaptedinterpolationfilters. Inthefinalmotioncompensation, theencoderchooses the best match by testing different subpixel positions on the same reference frame. This approach, which wewill refer to as adaptive interpolation filtering(AIF), aims to address the aliasing problem and motion estimation error when generating subpixel references. Instead, in our work we design filters by identifying blocks suffering from different types of focus mismatch. Filtered reference frames will first be generated by applying the estimated filters. Then, on each of these filtered reference frames, sub-pixel interpolation (suchasinH.264/AVC)willbeperformed,leadingtoadditionalcodinggainsfordisparity compensation. In this chapter, we propose a novel adaptive reference filtering (ARF) method for encoding video with focus mismatch. Based on the analysis in Section 2.3, we first model predictive coding with focus mismatch using point spread functions and provide a derivation of how the proposed approach is designed. The main contribution is that, to compensate for depth-dependent focus mismatch, we adaptively design multiple filters by estimating the mismatch kernels. In our approach, video frames are first divided into regions that suffer from different types of focus mismatch. For MVC inter-view prediction, we exploit block-wise disparity vectors as feature to roughly classify image blocks into different scene-depth levels. As for monoscopic video with temporal focus 1 For example, (1 3 4 , 23 1 2 ) and (45 3 4 , 6 1 2 ) will be assigned to the same group. 36 change, given that depth information is not directly available, we propose an encoding method that estimates localized focus changes. Frames will be then partitioned into regions, each consisting of macroblocks (MB) that suffer from a similar type of focus change(e.g., blurringorsharpening). Afterframepartitioning, foreachregion, a2Dfilter is calculated to compensate for the focus mismatch by minimizing the prediction residue energy (MMSE filter). These filters can be regarded as estimators of the focus mismatch kernels. Toprovidebettercodingefficiency, wegeneratemultiplefilteredreferenceframes by applying the obtained filters, and allow each block to be predicted from the reference that provides lowest R-D cost. We also extend ARF to MVC inter-view bi-directional prediction cases (B-frames), in which predictive coding is performed by using reference frames from two reference lists (denoted as List 0 and List 1). A straightforward extension of ARF to B-frames can be achieved by designing depth-dependent filters that minimize the prediction error between current blocks and the chosen bi-predictors, which will beobtained by averaging two reference blocks, one from each reference list. Note that such an extension would be analogous to that selected for bi-prediction in adaptive interpolation filtering (AIF) [49], in which for a given interpolation position, only one filter is designed and is applied to generate interpolated pixel values for references in both List 0 and List 1. This approach does not separately consider the possibility that different types of mismatch may exist with respect to reference frames in the two lists. Instead, mismatches from the two lists are considered jointly in the filter design. Estimating / applying more than one set of filterstodifferent lists isnotpossibleinthisframework. For bi-directional prediction, the key observation is that with the above described approach, joint filter design is followed 37 by conventional independent search for predictors in each list. Because of this mismatch between filter design and search, the gain with respect to un-filtered bi-prediction is can bereduced. Totacklethisproblem,weproposeafilterdesignapproachthatestimatestwo sets of depth-dependent filters, each set compensating for the focus mismatch affecting one of the two references used for bi-directional prediction. This leads to increased gains as compared to the straightforward filter design for the averaged predictors. The rest of this chapter is organized as follows: In Section 3.2, based on the analysis in Chapter 2, we first provide a focus mismatch model for predictive video coding. The proposed adaptive reference filtering (ARF) method will then be described in detail in Section 3.3. We discuss two scenarios with focus mismatch: Inter-view coding in MVC, and focus change in monoscopic video. The extension to MVC inter-view B-frames is presented in Section 3.4. Simulation results based on H.264/AVC are summarized in Section 3.5. In Section 3.6, we analyze the complexity of filter estimation in our ARF approach. Finally, conclusions are provided in Section 3.7. 3.2 FilteringModelforVideoCodingwithFocusMismatch Let us denote I C (x,y) the luminance pixel value of the current frame to be encoded at pixel position (x,y), and let I R (x,y) denote the corresponding luminance pixel value in the reconstructed reference frame. Let (dv x ,dv y ) denote the displacement vector (i.e., a disparity vector for inter-view prediction or a motion vector in temporal prediction). From the analysis in Chapter 2, in the presence of differences in focus settings, the corresponding images will exhibit depth-dependent mismatch which for a given depth, 38 can bemodeled usinga convolution kernel H Z whichcaptures thedifferencebetween two point spread functions (or in frequency domain, between two optical transfer functions, as described in (2.19) ). In this chapter, we model predictive video coding with focus mismatch for pixel (x,y) corresponding to an object at depth Z as: I C (x,y) =H Z ∗I R (x+dv x ,y+dv y ) where∗ denotes the convolution. (3.1) For depth-dependent focus mismatch, multiple mismatch kernels H Z might be re- quired in order to model blurriness/sharpness mismatch in different regions within a frame, corresponding to different depths Z, as depicted for example in Fig. 2.8. Based on this model, we propose a coding method in which the reference frame is first filtered by estimators of the mismatch kernels H Z chosen to minimize the prediction error with respect to the current frame. Inan ideal scenario, for a region at depthZ that undergoes a certain type of focus change, minimum mean-squared error (MMSE) estimation can be derived by jointly optimizing over both mismatch and displacement: min ψ,dvx,dvy X x,y (I C (x,y)−ψ Z ∗I R (x+dv x ,y+dv y )) 2 (3.2) The filter ψ Z will be an estimator of the mismatch kernel H Z for a given region with depth Z. However, an encoding system with such joint optimization will require excessive computation. Instead, we adopted a procedure similar to that proposed in adaptive interpolation filtering [46,51,52], i.e., such that the displacement is estimated first (e.g. using block-based motion/disparity search), and then the filter coefficients of 39 ψ Z aredetermined. Inthisapproach, thefilterwillbedesignedbasedonthedisplacement compensated prediction error between the reference frame and the current frame. In the next section, wewill describeourproposedadaptivereferencefilteringapproachforvideo coding. 3.3 Adaptive Reference Filtering To design an adaptive filtering approach for situations in which different regions within a frame may suffer from different types of blurriness/sharpness changes, locally adaptive compensation has to be enabled. In this work, we propose a coding method in which, after performing an initial motion/disparity search to obtain displacement vectors and establish block correspondence, the current image is partitioned into regions suffering from different types of focus mismatch 2 . Each region D k (k =1, 2, 3...) will then be associated with one adaptive filter ψ k to be designed in the next step. We call this process frame partition for adaptive filter design. The filter ψ k for each D k is optimized to minimize the residual energy for all pixels within D k , i.e., min ψ k X (x,y)∈D k (I C (x,y)−ψ k ∗I R (x+dv x ,y+dv y )) 2 , (3.3) This approach will allow multiple filters to be estimated for different parts of a video framethatundergodifferentblurriness/sharpnesschangeswithrespecttothecorrespond- ing areas in the reference frames. These filters will be applied to the reference frame to generate filtered references that provide better matches. Then the final motion/disparity 2 Here we describe our approach in general. Specifics of frame partitions for the multiview and mono- scopics cases will be explained in Section 3.3.1. 40 compensatedpredictionisperformedusingbothoriginalandfilteredframesasreferences. At this stage each block is allowed to select the reference that provides the lowest R-D cost, regardless of what the initial classification of the block was. Fig. 3.1 provides a flowchart of adaptive reference filtering for video coding. In the following subsections, we describe each step in detail. Estimate block−wise parameters Initial search Final search / encoding Reference frame Current frame Filtered references Process flow Data flow Estimate filter coefficients Classify block−wise parameters Generate filtered references Frame−partition Figure 3.1: Flowchart of Adaptive Reference Filtering for video coding 3.3.1 Frame Partition for Adaptive Filter Design The first step is to identify different types of blurriness/sharpness changes in different parts of the current frame. An exhaustive approach could be, after having computed the motion/disparity vectors, to assign adaptively to each block in the frame a filter 41 that minimizes the motion/disparity compensated prediction error. This approach is optimal in the sense that for every block the residual energy is minimized. However, it would significantly increase the bitrate since we would have to transmit filter coefficients for every single block. To limit the amount of side information (filter coefficients) while maintainingtheabilitytocompensateforfocusmismatches,weuseaproceduretoroughly partition a frame into regions, each containing blocks of pixels that suffer from a similar type of focus mismatch. This can be regarded as a region-based approach, in contrast to the global (frame-wise) approach of blur compensation in [6] and to the more local, block based approach we just mentioned. Thekey issuehereis to achieve framepartition such that different types of focus mismatch within a frame can be reliably estimated. In this section, we will propose solutions to two focus mismatch examples: (i) Inter-view predictioninMVCwith1Dparallelcameraarrangement, and(ii)Monoscopicvideowhen focus setting changes over time. Inter-view prediction in MVC As discussed in Section 2.3, under a given camera setting, the type of focus mismatch dependsonthedepthofthescene. Topartitionimageintoregionssufferingfromdifferent focusmismatch, it isreasonabletoconsiderprocedurestoidentify imageregions withdif- ferent depth levels. When multiple cameras are employed, such as in multiview systems, disparity information has been widely used as an estimation of scene depth [15]. As dis- cussedinSection2.3.1, formultiviewsystemswitha1Dparallelcameraarrangement, the disparityδ Z between different views is the reciprocal of depthZ: δ Z = b·d Z (againd andb representingtheimage planedistanceandthespacingbetweentwo consecutivecameras). 42 Based on this relationship, numerous approaches have been proposed to estimate scene depth by computing the disparity [1,16,20]. While in some existing approaches the goal is tofindan accurate/smooth disparity mapat thepixel-level, herewesimply aim at sep- arating objects with different scene depths bymodelingtheirdisparities. Thisissimilarto video object segmentation methods in which the motion field is used to identify moving objects [58]. To reduce complexity, compressed domain fast segmentation methods have been proposed that use the block-wise motion vectors, obtained with video coding tools, as input features to classify image blocks [2,21,50]. Similarly, we consider procedures to classify blocks into depthlevels based on their correspondingdisparity vectors. As shown in Chapter 2, the blur diameter β can be closely approximated as a linear function ofδ Z (Fig. 2.6). Thus, partition a frame into regions of similar δ Z leads to regions for which a similar β can be applied, which serves well for our goal of identifying different types of focus mismatch. For multiview systems with cameras arranged in par- allel on the same horizontal line, classification can be achieved by considering only the x component of the disparity vectors. For a 2D camera arrangement as can be found in a camera array, the classification could be extended by taking bothx andy components as input features. WeproposetouseclassificationalgorithmsbasedonGaussianmixturemodels(GMM) to separate blocks into depth-level classes. We adopted expectation-maximization algo- rithm(EM)basedontheGMM[38]toclassifythedisparityvectorsandtheircorrespond- ing blocks [2,50,55]. In this work, an unsupervised EM classification tool developed by Bouman [4] is employed. To automatically estimate the number of Gaussian components in the mixture (thus making the approach unsupervised), the software tool performs an 43 −200 −100 0 100 200 300 0 200 400 600 800 1000 1200 Histogram of DV x DV x number of blocks −200 −100 0 100 200 300 0 0.005 0.01 0.015 0.02 0.025 GMM model of the histogram DV x p Depth level 1 Depth level 2 Depth level 3 Figure 3.2: Disparity vectors from view 6 to view 7 at the 1st frame in Ballroom: His- togram and GMM order estimation based on the minimum description length (MDL) criterion. The only required parameter to be specified is the maximum number of Gaussian components (K) allowed intheGMM. Thistool appliesMDLtoselect thenumberofGaussiancomponent (K, K−1, ... to 1). We refer to [4,10,39] for details about such techniques. Parameters of Gaussian components are estimated using an iterative EM algorithm. Each Gaussian component is used to construct a Gaussian probability density function (pdf) that mod- els one class for classification. Likelihood functions can be calculated based on these Gaussian pdfs. Disparity vectors are classified into different groups by comparing their correspondinglikelihood value in each Gaussian component. Blocks are classified accord- ingly based on the class label of their correspondingdisparity vectors. Refining processes can also be considered, such as eliminating a class to which a very small number of blocks has been assigned. In the classification result, each class represents a depth level within the current frame, and blocks classified into a certain level will be associated with one adaptive filter. To illustrate this frame partition based on classification of disparity vectors, we provide a segmentation result in Figs. 3.2 and 3.3. 44 Current frame Depth level 1 Depth level 2 Depth level 3 Figure 3.3: The corresponding frame partition result of Fig. 3.2 Fig. 3.2 shows the histogram of the x-component of disparity vectors, obtained from the initial disparity estimation. A corresponding GMM is constructed with a number of components estimated to be 3 (K was set to 4). In Fig. 3.3, the corresponding blocks within each class are shown. It can be observed that after disparity-based classification, depth class 1 corresponds to the far background; class 2 captures two dancing couples and some audience in the mid-range, along with their reflection on the floor; and class 3 includes the couple in the front. Note that intra-coded blocks in the initial disparity estimation are not involved in the filter association process. In this example, the classifi- cation tool successfully separates objects with different depths in the current frame. For each depth level, we will then estimate a focus mismatch kernel. (More examples of the proposed disparity-based frame partition are provided in Fig 3.4 and Fig 3.5.) 45 Frame to be encoded Intra−coded blocks Depth class 2 Depth class 3 Depth class 1 Figure 3.4: Frame partition result: Breakdancer View 1 frame 0 Temporal prediction in monoscopic video Now we consider compensating focus change in monoscopic video. Again our goal is toidentify regions inavideo framethat sufferfromdifferenttypes of focusmismatch and design filters which estimate the mismatch kernels. For instance, in the dialog example described in Section 2.1 where the focus is changing from the first characters to the second one, blocks corresponding to the two characters at different depth levels (D 1 and D 2 ) will be associated with two different filters, ψ 1 and ψ 2 , which will produce blurring and sharpening, respectively. To achieve such classification without disparity information available to estimate depth, we considered two possible approaches. First, a set of predefined filters can be chosen to operate on the reference frame. During an initial motion compensation, each block selects the filter that provides the lowest matching error. Blocks with similar filter selections can then be grouped into a class. This approach has a drawback in that it is 46 Frame to be encoded Intra coded blocks Depth level 1 Depth level 2 Figure 3.5: Frame partition result: Race1 View 2 frame 30 possible that the predefined filter set is not complete enough to model all types of focus change within the frame. Thus, blocks exhibiting focus changes that are not covered by thepredefinedfilterset, may begroupedwithblockshavingverydifferentcharacteristics, leadingtosuboptimalcompensationfilters. Withonlylimitedknowledgeofthefocusset- tings (for exampletheperfect infocus depthZ ∗ is not likely to beknown), it will behard to build a satisfactory filter set, unless a very large set of predefined filters is used. To avoid this problem, we investigate a second approach: During the initial motion compensation, inordertomodelthelocalized focusmismatch, asimplefilteris estimated for each MB to minimize the prediction residual energy (MMSE filter). The collection of all these MB-wise MMSE filters provides a more comprehensive description of various focus changes present in the current frame. MBs will then be separated into groups 47 by clustering based on the similarity of their respective filters. This procedure can be summarized as follows: 1. Initial search to obtain displacement vector (dv x ,dv y ). 2. For each MB, calculate MMSE filter f mb such that: min f mb X (x,y)∈MB (I C (x,y)−f mb ∗I R (x+dv x ,y+dv y )) 2 (3.4) where f mb = a b a b c b a b a 3. Classify f mb into groups. Filter coefficients are considered as features for the clas- sification algorithm. Each MB belongs to the class to which its corresponding filter was assigned. For each class, one adaptive filter will be estimated in the next stage to compensate for the focus mismatch. Theroleoff mb istocapturelocalfocuschangesfromthereferenceframetothecurrent frame. We selected a 3×3 filter with circular symmetry for the following reasons. First, thesef mb areestimatorsofthepointspreadfunctionsrepresentingdifferentfocuschanges, which are isotropic as shown in Chapter 2. Second, we are using their coefficient as the input features for classification. Larger filters with more coefficients will result in a much higher dimensional problem, which increases significantly the classification complexity. More importantly, this could also lead to an over-specified classification, which could be sensitive to filter variations, and may not be suited to our goal of identifying a few rough classes of focus changes within each frame. 48 Taking the coefficients as features, we can group the f mb into classes. Such classifi- cation can be visualized by plotting each set of f mb coefficients (a,b,c) as a point in 3D space. We have observed that the filter points all lie very close to the 4a+4b+c = 1 plane (which we denote asP f ). This is reasonable since the MMSE system is attempting to find a weighted average for pixel values. By performing principal component analysis (PCA), we observea system withavery insignificant thirdeigenvalue as compared to the first two (in the order of 10 −15 ), which indicates that the assumption that (a,b,c) belong to a plane is a reasonable one. To select a classification tool, we performedthe following study onf mb : On planeP f , we shift the filter coefficients away from the MMSE point (a,b,c) by (Δa,Δb,Δc) and record how the MSE changes with different shifts. Statistics are gathered on a frame by frame basis. Fig. 3.6 shows some results for the sequence fondue-multi 3 in which the camera focus is changing back and forth among people at different scene depths. We observe that the increase in MSE away from the optimal point has different gradients in different directions. These findings suggest that the classification algorithm should take directional information intoaccount, inadditiontoconsideringthedistancebetweendata points. Simply using the Euclidean distance to cluster the various filters into classes will not be appropriate as this would implicitly assume that the errors generated by changes in the filter coefficients are equal in all directions. We propose once again to use classification algorithms based on multidimensional GMMs to group the f mb into classes. GMM techniques incorporate covariance matri- ces such that the directional information can be modeled. Filters f mb are classified into 3 fondue-multi.wmv by Yi-Ren Ng, Light Field Photography with a Hand-held Plenoptic Camera, Stanford Computer Graphics Lab, http://graphics.stanford.edu/papers/lfcamera/refocus/ 49 −10 −5 0 5 10 75 80 85 90 95 100 105 110 Frame 132 shift (n*u i ) MSE −10 −5 0 5 10 265 270 275 280 285 290 295 Frame 153 shift (n*u i ) MSE Figure 3.6: Variations of MSE when shifting f mb parameters away from the MMSE solution (3.4) by (Δa,Δb,Δc) =n~ u i . The four curves represent results with different ~ u i . ×: ~ u 1 = (−0.05,0.025,0.1), ◦: ~ u 2 = (0.025,−0.05,0.1), : ~ u 3 = (−0.07,0.0513,0.0748), and △: ~ u 4 = (0.07,−0.0805,0.0419). Note that (i) the norms of these ~ u i are the same, and (ii) 4Δa+4Δb+Δc=0 such that the shifted filters will still be on P f . different groupsbycomparingtheircorrespondinglikelihood valueineach Gaussiancom- ponent. Refining processes can also be considered in the classification based on GMM, such as removing points with too low likelihood from the classes, or eliminating a class to which too few points have been assigned. An example of frame partition results using the proposed method on the sequence fondue-multi is provided in Fig. 3.7. In this example, camera focus is shifting from the front to the back. The first three people are becoming increasingly blurred, while the others are becoming more clear. Based on the MB-wise focus change estimators f mb , the classification tool successfully separates these two groups, as can be observed in classes 1 and 2. 50 Figure 3.7: An example of frame partition based on focus changes 3.3.2 Filter Design by Estimating Mismatch Kernels We now discuss how to select a filter for all blocks belonging to a given region/class D k , which are therefore assumed to have similar focus mismatch. We replace the convolution notation in (3.3) by explicitly expressing the filter operation as min ψ k X (x,y)∈D k I C (x,y)− n X j=−n m X i=−m ψ k (i,j)·I R (x+dv x +i,y+dv y +j) 2 (3.5) 51 The size and shape of 2D filters can be specified by changingm andn. In Chapter 2 weestablishedthatthefocusmismatchisexpectedtobeisotropic. Hence,inthisworkwe utilize square shaped filter kernels withm =n. Our analytical results also show that the frequency responses of these filters depend on the camera parameters and object depth, asdepictedforexampleinFig.2.8: Somekernelshaveverysmoothresponseswhileothers havesharptransitions. ItiswellknownthatinthedesignofFIRfilters,smoothresponses can be realized with short length filters while for sharp responses we need longer filters. Thus, in order to select a filter size that can reliably estimate the underlying mismatch kernels, we need knowledge about the setting differences in camera parameters and the depth composition of the scene. Furthermore, as described in the numerical example in Section 2.3.2, the dimension and resolution of the camera sensor array also has to be known such that we can convert the filter responses from analog domain to digital domain, and then calculate the filter size in units of pixels. For example, a given blur diameter β will cover more pixels for sensor arrays with smaller distance between pixels. Agivenanalogfilterresponsewillleadtoasharpertransitionindigitaldomainifthepixel distance is smaller, i.e., higher Nyquist rate. Thus, other things being equal (i.e., samea, f,d, Z, and sensor array dimension), the selected filter length should be proportional to the pixel resolution (inversely proportional to the pixel distance). In practical scenarios, it is very unlikely that we will have full access to all these parameters. Therefore, in this section,weinvestigatefilterswithdifferentsizesandconstraints. Inadaptiveinterpolation filtering (AIF) approaches, even-length (6×6) filters are proposed in order to interpolate subpixels in between integer pixels. In our proposed ARF approach, we apply adaptive filtersdirectlytothereferenceframeinordertogeneratefilteredreferencesthatarebetter 52 matched to the current frame. Odd-length filters centered at the pixel to be filtered are employed in this work. Thefiltercoefficientsψ k (i,j)thatsatisfy(3.5)canbedeterminedbytakingderivatives with respect to each coefficient, i.e.,∀ψ k (I,J) where−m≤I ≤m,−n≤J ≤n, we look for a filter such that: ∂ ∂ψ k (I,J) X (x,y)∈D k I C (x,y)− n X j=−n m X i=−m ψ k (i,j)I R (x+dvx +i,y +dvy +j) 2 = 0 X (x,y)∈D k I C (x,y)− n X j=−n m X i=−m ψ k (i,j)I R (x+dvx +i,y +dvy +j) I R (x+dvx +I,y +dvy +J) = 0 n X j=−n m X i=−m ψ k (i,j) X (x,y)∈D k I R (x+dvx +i,y +dvy +j)I R (x+dvx +I,y +dvy +J) = X (x,y)∈D k I C (x,y)I R (x+dvx +I,y +dvy +J) (3.6) These Wiener-Hopf equations will lead to optimal linear Wiener filters. By defining (˜ x,˜ y)=(x+dv x ,y+dv y ), and denoting ˜ I R as the disparity shifted pixel value atI R (x+ dv x ,y+dv y ), (3.6) can be further written as: n X j=−n m X i=−m ψ k (i,j) E[I R (˜ x+i,˜ y+j)I R (˜ x+I,˜ y+J)] = E[I C (x,y)I R (˜ x+I,˜ y+J)], that is n X j=−n m X i=−m ψ k (i,j) Cor ˜ IR ˜ IR (I−i,J−j) = Cor IC ˜ IR (I,J), (3.7) whereE[·] is the expectation operator andCor is the correlation function. Both E[·] andCor operate over all the blocks that are classified into the depth-level D k . It can be 53 seen that the linear MMSE Wiener filter is optimized based on the autocorrelation of the disparity shifted pixel value ˜ R; and the cross-correlation between the current frame and ˜ R. In this linear system, the number of equations will be equal to the number of co- efficients in ψ k . Filters with more unknowns can be more efficient to compensate for blurring/sharpening and thus reduce residual energy. However, this comes at the ex- pense of having to transmit more filter coefficients. (For example, a circular symmetric 3×3 filter contains only 3 coefficients, while a full 3×3 matrix has 9 coefficients). In Chapter 2 we have shown that the focus mismatch is isotropic. This suggests that we can impose circular symmetry constraint on the filter coefficients. Thus in this section, we consider two examples of 5×5 filters (m=n=2), with a symmetry constraint: ψ 55cir = f e d e f e c b c e d b a b d e c b c e f e d e f (3.8) ψ 55hv = i g e g i h d b d h f c a c f h d b d h i g e g i (3.9) The filter in (3.8) is circular symmetric, with only 6 coefficients to be estimated (a∼f), which we denote as ψ 55cir . The filter in (3.9), which we denote as ψ 55hv , can be viewed as a compromise between a full matrix and the circular symmetric ψ 55cir . It has 9 different coefficients (a∼i). After partitioning the frame using methods described in 54 265 270 275 280 41.8 41.85 41.9 41.95 42 Ballroom: 640x480, View 0~View 7 Kb/frame, IPPPPPPP inter−view PSNR ARF 55HV ARF 55cir ARF 33HV 245 250 255 260 42.8 42.85 42.9 42.95 43 Race1: 640x480, View 0~View 7 Kb/frame, IPPPPPPP inter−view PSNR ARF 55HV ARF 55cir ARF 33HV 110 112 114 116 118 120 46.6 46.65 46.7 46.75 46.8 46.85 Rena: 640x480, View 38~View 45 Kb/frame, IPPPPPPP inter−view PSNR ARF 55HV ARF 55cir ARF 33HV Figure 3.8: Performance of ARF using different filter sizes/constraints (QP22) Section 3.3.1, for each depth region, a filter in one of the above forms can be obtained by solving (3.6). Fig.3.8providessimulationresultsofMVCinter-viewcodingfordifferenttimestamps, using ARF with different filter sizes / constraints. (Refer to Section 3.3.3 and Section 3.5 for details on how the simulations were conducted.) We observe that the differences in coding efficiency between using ψ 55cir and ψ 55hv as filter structures are small, with greater differences at higher bitrates (low QP). Thus in Fig. 3.8 we provide results at QP = 22 for which it is easier to observe the difference in performance. It can be seen that the performance of the two filters ψ 55cir and ψ 55hv is very similar, e.g., less than 0.025 dB difference in sequences tested. More importantly, in two of the three sequences (Ballroom and Rena), reducingthenumberof coefficients byimposingcircular symmetric actually provides a slight coding gain. These results indicate that by exploiting the isotropic properties of focus mismatch, we can simplify the filter structure, reducing the side information (coefficients) to be transmitted, while preserving the ability to reliably estimate mismatch kernels. Inthesimulations inSection 3.5, wewillprovideARF coding 55 resultsusingcircularsymmetricfiltersasψ 55cir . (Asareference,inFig.3.8wealsoinclude results using 3×3 filters with horizontal/vertical constraints similar to those in (3.9). At QP =22, compared to thedifferencebetweenψ 55cir andψ 55hv , usingsuch filter results in much larger degradation in coding efficiency, i.e., 0.05 dB degradation for Ballroom and Race1, and 0.1 dB for Rena.) Figures 3.9, 3.10, and 3.11 provide the frequency responses of the calculated ψ 55cir , when we perform inter-view coding between view pairs in Race1 sequences. In each figure, thethreecurves correspondto filtersestimated for different depthlevels of agiven anchor frame: thefilters, from left to right, correspond to regions ranging from far (small disparity) to near (large disparity). These estimated depth-dependent focus mismatch kernels demonstrate similar behavior to that of the analytical results as depicted for example in Fig. 2.8: For the scenario in which the current frame is blurred as compared to the reference (Figures 3.9 and 3.11), it can be seen that the filters have a low-pass characteristic. Whentheblurringaffecting theframeisstronger (Fig. 3.11), theresulting filters have sharper transitions. In both figures, it can be seen that the responses change gradually from smaller disparity to larger one (see the dot points). On the other hand, when the reference frame is a blurred version of the current frame (Fig. 3.10), the filters emphasize higher frequency ranges so that the reference can be sharpened to create a better match. In this example, the responses rise to a peak at about Ω=0.4π. As for the focus change example in monoscopic video, the filters’ frequency responses areshownininFig.3.7. Forpartsoftheimagethatareblurred(Class2),thecorrespond- ingfilterψ 2 is ablurringfilter (lowpass) with a Gaussian-shapedfrequency response. For parts that are getting sharpened (Class 1), the filter ψ 1 emphasizes more the higher 56 0 0.2 0.4 0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 1 Ω (π) 0 0.2 0.4 0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 1 Ω (π) 0 0.2 0.4 0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 1 Ω (π) Figure3.9: Frequencyresponsesofestimatedfilterswhenperforminginter-viewprediction from Race1 V2 to V3 at Anchor 9. V3 is slightly blurred w.r.t reference V2. 0 0.2 0.4 0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 Ω (π) 0 0.2 0.4 0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 Ω (π) 0 0.2 0.4 0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 Ω (π) Figure 3.10: Frequency responses of estimated filters when performing inter-view predic- tion from Race1 V4 to V3 at Anchor 3. Reference V4 is blurred w.r.t V3. frequency ranges (non-Gaussian shape as compared to ψ 1 ) so that the reference can be sharpened to create better match. 3.3.3 Encoding with Filtered References The optimized filters will be applied to the reference frame in order to provide better matches for predictive video coding. In the reference picture list, the original unfiltered referenceaswell asmultiplefilteredreferencesarestored. Ifsubpixeldisparityestimation is employed, all these references will be interpolated to generate subpixel values using interpolationfiltersspecifiedbythecodec(e.g., 6-tapinterpolationfiltersinH.264/AVC). 57 0 0.2 0.4 0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 Ω (π) 0 0.2 0.4 0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 Ω (π) 0 0.2 0.4 0.6 0.8 1 −0.2 0 0.2 0.4 0.6 0.8 Ω (π) Figure 3.11: Frequency responses of estimated filters when performing inter-view predic- tion from Race1 V6 to V5 at Anchor 7. V5 is strongly blurred w.r.t reference V6. During the final encoding process, original and filtered references can be regarded as inputsforpredictivecodingwithmultiplereferences,suchasspecifiedinH.264/AVC [54]. Thisprovidestwoadvantages: Firstlyandmostimportantly, eachblockcanselectablock in any filtered or original reference frame, based on R-D optimization. 4 This ensures highest codingefficiency. Secondly, thefilter selection of each block can easily behandled by signaling the reference frame index in the bitstream. To correctly decode the video sequence, the filter coefficients also have to be trans- mitted. In this work, we directly extend the method proposed in [45,47], in which the filter coefficients are quantized and encoded as frame level overhead. Using the 5×5 fil- ters with circular symmetric constraints as (3.8), which has 6 coefficients, and assuming there are 4 filters (which is the maximum number of filter allowed in our simulations), there will be a total of 24 coefficients as side information. As comparison, for each frame 4 Note that after this stage, the filter selection could be regarded as a new “frame partition” D k , and filtersψ k could be estimated again based on MBs in different classes. Thus, the estimation ofD k andψ k can be carried iteratively until a stopping criterion is met. The complexity involved in such process will be fairly high. In this work we limited ourselves to an algorithm without any iteration. 58 Select no filter Select ψ 1 Select ψ 2 −1 0 1 −1 0 1 0 0.5 1 1.5 F x Filter ψ 1 F y Magnitude −1 0 1 −1 0 1 0 0.5 1 1.5 F x Filter ψ 2 F y Magnitude Figure 3.12: Encoding selection with adaptive filtering AIF [46] requires54 coefficients to betransmitted inorderto specifytheinterpolation fil- ters. Of course, our method has additional side information as compared to AIF, namely the reference frame index to indicate block-wise filter selection. In Fig. 3.12 we provide the final filter selection result corresponding to the temporal focus change example in Fig. 3.7. We can see that different filters were selected for the front three people and the others (ψ 1 and ψ 2 ). The two people in the very back chose the unfiltered reference frame, as they are not being altered much by the focus changes. One interesting point to note is that for smooth regions such as the first three people’s foreheadsandcheeks, unfilteredreferenceisalsopreferred. Thisisbecausefortheseplain regions, changing focus would not have much effect on the pixel values. We observed this same phenomenon for other frames as well. 59 3.4 ARF for MVC Bi-directional Disparity Compensation with Focus Mismatch Based on the assumption that regions with similar depth will suffer from similar focus mismatch, in the previous section we proposed an adaptive reference filtering (ARF) approach for MVC inter-view coding, in which a frame is partitioned into different depth levels anddepth-adaptive MMSE filters areestimated to compensate forfocus mismatch. Thismethodwas developed forinter-viewP-frames, forwhicha singlereferenceframe is used, taken from one of the neighboring views (IPPP for coding V0∼V3 for example). Inthissection, weextendfocusmismatchcompensationtoB-frames, wherepredictive coding is performed by using reference frames from two reference lists (denoted List 0 andList 1), whichconsistof previouslyencodedframes. InMVC inter-view coding, these lists contain frames from different neighboring views (e.g., frames from the left and right views in List 0 and List 1, respectively). As a result, a B-frame to be encoded may suffer from different types of focus mismatch with respect to the reference frames from List 0 and List 1. In what follows, we will first revisit the focus mismatch example in Chapter 2 and then discuss different approaches to design filters. In particular, we emphasize the interaction between filter design and bi-predictive search when filtered references are generated. 3.4.1 Inter-view Bi-directional Prediction with Focus Mismatch Onceagain, foracameraequippedwithlens, wedenotef thefocallength,atheaperture diameter, and d the “image plane distance”, i.e., the distance between the image plane 60 1 2 3 4 5 0 0.005 0.01 0.015 0.02 0.025 0.03 β as function of Z (f = 20mm, a = f/8) Z: m (a) β: mm β V1 : 0.0155 β V1 : 0.014 β V2 : 0.017 β V2 : 0.0126 β V3 : 0.020 β V3 : 0.009 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Ω: π (b) ← V1 V3 → Z = 1.2m |OTF| V1 V2 V3 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Ω: π (c) ← V3 V1 → Z = 4m |OTF| V1 V2 V3 (V1+V3)/2 V1: Z* = 1.9m (d = 20.21mm) V2: Z* = 2.0m (d = 20.20mm) V3: Z* = 2.3m (d = 20.175mm) Figure 3.13: An example of focus mismatch in multiview bi-prediction, with Z ∗ V1 = 1.9m, Z ∗ V2 = 2.0m, and Z ∗ V3 = 2.3m. We consider image sensor type 1/2” (H×W = 6.4mm×4.8mm) with a resolution of 640×480 pixels, i.e. the spacing between pixels is 0.01mm (Nyquist rate100/2 =50cycles/mm). Inpolarsystem,q = √ 50 2 +50 2 ≈70.71, which corresponds to Ω=π in (b) and (c). and the lens. In Chapter 2, we showed that in the presence of difference in focus setting, frames from different views will exhibit a mismatch that is a function of parameters f, a, d and object depth Z. For points at depth Z, the corresponding projections on the image plane will be uniform circles with diameter β = af(|Z−Z ∗ |) Z(Z ∗ −f) . Z ∗ is a specific depth atwhichanobject willproduceapointprojection (perfectlyfocused)ontheimageplane: Z ∗ = d·f d−f . Now let us revisit the example described in Section 2.3.2, where three cameras V1, V2, and V3 form a multiview system. Assume that the cameras have the same focal length setting f (same zoom), and their aperture settings are also identical: a = f/8. However, assume that the fine tuning of their Z ∗ was not done perfectly (Z ∗ V1 6=Z ∗ V2 6= Z ∗ V3 ), resulting in differences in their β values as functions of Z. Fig. 3.13 shows the same example as in Section 2.3.2 with heterogeneous settings. Figures 3.13(b) and (c) demonstrate the differences in the correspondingoptical transform functions (OTF), i.e., the frequency transform of the point spread function (PSF). If we encode V2 with bi- directional prediction by puttingV1in List 0andV3in List 1as references, for image 61 portions correspond to visible regions at Z = 1.2m, we need to perform blurring on V1 and sharpening on V3 in order to match V2. On the other hand, for visible regions atZ =4m, thecorrespondingimageportionsinV1needtobeslightlysharpenedwhileV3has to undergo a significant amount of blurring. As for the averaged predictor 1 2 (V1+V3) (dotted line) in Fig. 3.13(c), a blurring filter is required to bring down the OTF to that of V2. If V1, V2, and V3 are arranged on a 1-D horizontal line from left to right with equal spacingb between each other, and assuming their image plane distancesd are very similar, we have discussed in Section 2.3.1 that an object at depth Z will result in a disparity δ Z = b·d Z from V1 to V2 and also from V2 to V3. Since the blur diameter β is approximately a linear function of δ Z (see Section 2.3), we can exploit disparity vectors to identify image portions suffering from different types of focus mismatch. In Section 3.4.2, we will discuss adaptive filtering methods using the three-view example we just discussed. 3.4.2 ARF and Bi-directional Disparity Search As in Section 3.3, we propose to utilize a two-pass coding scheme with an initial search (thefirstcodingpass)toobtaintheblock-wisedisparityvectors (DVs) andpredictors,for disparity-based frame partition and for designing filters. In what follows, we will discuss different filter estimation methods, especially emphasizing how filter estimation interacts with bi-predictive search when filtered references are generated. 62 3.4.2.1 Filter Design for Averaged Bi-predictor In B-frames, for a block that chooses to use bi-prediction, its predictor is actually the average of two reference blocks, one from the reference frame in List 0 (I L0 R ) and one from the reference frame in List 1 (I L1 R ). A straightforward filter design approach, which minimizes the prediction error between current blocks and the averaged predictors would be, for pixels within a given depth level D i : min ψ BI i X (x,y)∈D i I C (x,y)−ψ BI i ∗ 1 2 I L0 R (x+dx 0 ,y+dy 0 )+I L1 R (x+dx 1 ,y+dy 1 ) 2 (3.10) In (3.10), (x,y) is the pixel position within a frame, (dx 0 ,dy 0 ) and (dx 1 ,dy 1 ) are the disparity vectors for I L0 R and I L1 R , respectively, and ∗ denotes convolution. Using the approach as described in Section 3.3.1, we can partition a frame into regions with different depth levels, by classifying the DVs in either direction (L0 or L1), or by taking both directions as two input features for classification. Since for each depth-level D i the filter is designed for the averaged predictors, it should be applied to both List 0 and List 1, thus filtered references ψ BI i ∗I L0 R and ψ BI i ∗I L1 R can be generated. In Table 3.1, we summarize this approach as Method A. However, whether theoptimal pair of predictors can befounddependson how the bi- predictivesearchisperformed. Searchingjointly forpairsofvectorsfromList0andList1 wouldlead totheoptimal solution, butwouldrequirehighcomplexity. Typically, simpler search schemes are utilized, such as independent search, which results in degradation of coding efficiency as compared to joint pair-wise search: There is no guarantee that 63 searching independently for the best matching blocks in ψ BI ∗I L0 R and ψ BI ∗I L1 R will lead to an optimal solution to the problem of finding the two blocks in List 0 and List 1 that provide the best prediction after averaging and filtering. Clearly, this is also the case for bi-prediction even if no filtering is used [14]. However, in the following, we will demonstrate that the suboptimality is exacerbated when filtering is used. Consider first the case of independent search, where for each block, the encoder independently searches for the best predictor from references in List 0 and the best pre- dictor from references in List 1. The bi-predictor is formed by simply averaging the two without performing any additional search. As for the example in Fig. 3.13(c) for depth level at Z = 4m, after applying the lowpass filter ψ BI designed for 1 2 (I V1 +I V3 ), the filtered referenceψ BI ∗I V1 will actually have stronger mismatch with respect to V2 as its frequency response is further attenuated as compared to that of V1. During the search within List 0, due to the effect of the lowpass filter ψ BI , the reference ψ BI ∗I V1 may not be preferred over I V1 , i.e., it is less likely to be selected. Consequently, the improved predictor 1 2 ψ BI ∗(I V1 +I V3 ) may not even be tested by the encoder. As an alternative, in aniterative search [14], thesearch is conducted by, iteratively, fixing the obtained predictor from one side (I L0/L1 R ) to estimate the best predictor from the the other side (I L1/L0 R ). This can improve performance as compared to independent search, as some joint estimation is made possible. However the iterative process could still be trapped in a local minimum. For example in Fig. 3.13(c), if the initial selected predictor from List 0 is V1 instead of ψ BI ∗I V1 , the resulting predictor after several iterations may still not converge to the optimal predictor 1 2 ψ BI ∗(I V1 +I V3 ). 64 One possible approach to resolve such problem above, without performingexhaustive searchonpairsofvectors, istomodifybi-predictivesearchsothatwithineachlist,instead of picking only a single “best” predictor, we record the best matched predictors from each filtered/non-filteredreferenceframe{I L0 R , ψ BI 1 ∗I L0 R , ψ BI 2 ∗I L0 R ...}and{I L1 R , ψ BI 1 ∗ I L1 R , ψ BI 2 ∗I L0 R ...}. Withdifferentcombinations ofonepredictorfromeach side,multiple averaged predictors can then be evaluated. This would increase the complexity on top of the independent search, and also increase memory requirement as multiple pairs of vectors have to be recorded for mode decision. In addition to the problems due to the search algorithm, the filter design approach in (3.10) has another drawback that it may not lead to improved predictors within each list. For B-frames, a block is allowed be encoded using predictor from only one of the lists, if the rate-distortion (RD) cost of doing so is lower than using the averaged bi- predictor. However in (3.10) the filters are designed jointly for averaged blocks, so there is no guarantee that after applying them to individual frames they will provide good approximations to the original frame. As we discussed in the example of Fig. 3.13(c), the effect of the lowpass ψ BI in fact leads to the filtered referenceψ BI ∗I V1 having stronger mismatch to compensate for I V2 . As a result, the filtered references in each list, when used by themselves, may not provide better coding options. 3.4.2.2 Filter Design for Predictors from Each Reference List Toovercomethedrawbacks(limitedcodingchoices, integration withbi-predictivesearch) of the method in (3.10), we consider an alternative filter design approach that estimates 65 depth-related filters for each reference list. After the first coding pass, assuming a hor- izontal camera arrangement, we use the same approach as described in Section 3.3.1 to partition the current frame I C into depth-levels D 1 , D 2 ··· D K by taking dx 0 and dx 1 as two features to classify blocks. In other words, we consider a two-dimensional feature space: Objects closer to the cameras have larger disparities dx 0 and dx 1 , pointing to opposite directions; while both disparities will be small for far away objects. Instead of minimizing error with respect to the averaged predictor, two sets of filters are estimated, one for List 0 and the other for List 1: Ψ L0 = ψ L0 i min ψ L0 i X (x,y)∈D i I C (x,y)−ψ L0 i ∗I L0 R (x+dx 0 ,y+dy 0 ) 2 Ψ L1 = ψ L1 i min ψ L1 i X (x,y)∈D i I C (x,y)−ψ L1 i ∗I L1 R (x+dx 1 ,y+dy 1 ) 2 (3.11) This filter design method directly addresses the potentially different types of depth- dependent mismatch exhibited in reference frames from List 0 and List 1 (e.g., as in the example depicted in Fig. 3.13). In (3.11), sets Ψ L0 and Ψ L1 will both contain K filters. They will be applied to List 0 and List 1 respectively to generate filtered references. In this approach, a given block in I C will participate in both filter estimations to minimize prediction errors with respect to references in List 0 and List 1. Note that our approach hereis different fromafully independentdesign, whichwould involve performing the complete ARF design twice: one for L0, one for L1, both with classification and filter estimation as in Section 3.3. In such fully independent approach, which is summarized as Method B in Table 3.1, there are also two sets of filters, but a 66 block in the current frame may belong to two different classes for List 0 and List 1; while in our approach (Method C in Table 3.1) there is a single class (joint classification with two features, dx 0 anddx 1 ) for each block. The proposed two-sets filter design, Method C, has the following advantages: 1. Better integration with conventional bi-predictive search schemes: since filters are optimized independently for each list, the search within each list is likely to obtain better matched predictors. Since the two predictors are both focus compensated, they can be used to form the averaged bi-predictor, or serve as the starting point for iterative search. 2. More coding options: Based on (3.11), the filtered references in each list provide better matched predictors that can be used by themselves, i.e. as P-mode instead of B-mode, leading to more options (predictor from one of the lists, or from the averaged bi-predictor) for encoder to perform RD optimization. 3. Potential speed up for bi-directional search: In our approach, for a given class k, ψ L0 k and ψ L1 k are designed for the same depth level within frame I C . Thus if we observe that a given block selects a particular filtered referenceψ L0 k ′ ∗I L0 R after the search within List 0, it is reasonable to constrain the search in List 1 to the reference ψ L1 k ′ ∗I L1 R . A constrained bi-predictive search can be designed based on the search results from one of the lists. With the advantages above, the joint classification followed by independent filter estimation approach Method C, is preferred over Method A and Method B. 67 Table 3.1: Filter design methods for bi-directional disparity compensation Filter design for averaged bi-predictor Filter design for predictors from each list Method A: Method B: Method C: Joint design Independent design (i.e. perform ARF-P twice) Joint classification fol- lowed by independent filter estimation Frame partition One partition D i by classifyingdx 0 anddx 1 Two partitions: D L0 i by classifyingdx 0 , and D L1 j by classifying dx 1 One partition D i by classifyingdx 0 anddx 1 as two features Filter es- timation For blocks in D i , esti- mateMMSEfilterψ BI i with respect to the av- eraged bi-predictor ForblocksinD L0 i ,esti- mate MMSE filterψ L0 i with respect to the predictor from List 0. ForblocksinD L1 j ,esti- mate MMSE filterψ L1 j with respect to the predictor from List 1 For blocks in D i , esti- mate one MMSE filter ψ L0 i withrespecttothe predictor from List 0, and one MMSE filter ψ L1 i withrespecttothe predictor from List 1 Filtered references ψ BI i ∗I L0 R andψ BI i ∗I L1 R ψ L0 i ∗I L0 R andψ L1 j ∗I L1 R ψ L0 i ∗I L0 R andψ L1 i ∗I L1 R 3.5 Simulation Results We performed simulations based on H.264/AVC coding standard. To partition a frame, theEMclassification tool [4] basedonGMM iscombinedwithourencoder. Theclassifier takes disparityvectors (multiview case) orcoefficients off mb (monoscopic video) as input features. To solve the block-wise MMSE filtersf mb , we extended the code from [45]. For bothmethods,referenceframemanagement functionshavebeenmodifiedtostorefiltered reference frames. Finally, as described in Section 3.3.3, filter coefficients are encoded as in [45,47]. 68 The proposed adaptive reference filtering approach (ARF) is compared with AIF [46] and current H.264/AVC. Our proposed ARF utilizes multiple filtered versions from a single reference frame. If the EM classification generates K classes, each with a corre- sponding filter, there will be N = K +1 references in the reference list, including the originalunfilteredone. InH.264/AVC, motioncompensationwithmultiplereferencesisa coding tool [54] that also aims to improve coding efficiency by providing better matches. In our simulations, the maximum K allowed is 4. Thus, we also compare our method to H.264/AVC with the number of reference frames set to 5. Inter-view prediction in MVC: P-frames In Section 3.3, depending on the information available, we proposed two methods to partition frames into regions suffering from different types of focus mismatch. For MVC inter-view prediction, the depth-dependency characteristics of focus mismatch can be exploited by using disparity information to classify blocks. Here we would like to first verifytheefficiencyofsuchdesignapproach. WeconductsimulationsofARFbasedonthe classification of disparity vectors, denote as ARF-Z, and also the classification based on the estimated coefficients of MB-wise filters f mb , denote as ARF-f mb . Using H.264/AVC reference software JM10.2 [42], we encode anchor frames only at given timestamps using inter-view coding, i.e., we take a sequence of frames captured at the same time from differentcamerasandfeedthistotheencoderasifitwereatemporalsequence. Theintra periodissetequaltothenumberofviewssuchthatthecodingstructureforanchorframes is IPPP, as depicted in Fig. 1.2. Fig. 3.14 shows the rate-distortion (RD) comparison between the two filter design methods, along with H.264/AVC results as references. The 69 40 60 80 100 120 140 160 34 35 36 37 38 39 40 41 Race1: 640x480, View 0~View 7, timestamps 0 10 20 30 40 Kb/frame, IPPPPPPP PSNR ARF−Z ARF−f mb H264: 1 ref 20 30 40 50 60 70 80 38 39 40 41 42 43 44 45 Rena: 640x480, View 38~View 45, timestamps 0 10 20 30 40 Kb/frame, IPPPPPPP PSNR ARF−Z ARF−f mb H264: 1 ref Figure 3.14: Comparison between two ARF methods on inter-view coding four rate points were obtained at (from left to right) QP = 36, 32, 28, and 24. It can be seen that using disparity information to identify different types of focus mismatch achieves higher coding efficiency as compared to classification based on estimated f mb . The gain is larger at higher bitrates: At QP=24, ARF-Z has 0.15dB gain over ARF-f mb for Race1 and 0.25 dB gain for Rena. The results justify that the utilization of disparity is a reliable way of estimating different focus mismatch kernels. Furthermore, ARF-Z is less complex than ARF-f mb . In what follows, all our comparisons will be based on ARF-Z. It can be seen from Fig. 3.15 that for the multiview sequences we tested, ARF-Z provides higher coding efficiency than H.264/AVC, although the gains vary significantly dependingonthetestsequence. IntheBallroomsequence,almostnocross-viewmismatch can be observed. Furthermore, frames from different views have been rectified (properly registered) by applying homography matrices [18]. In this situation, the three enhanced predictivecodingschemes: multiplereferenceframe,AIFandARFallprovideverysimilar coding efficiency, i.e., about 0.3∼ 0.4 dB gain over H.264/AVC with one reference. 70 40 60 80 100 120 140 160 34 35 36 37 38 39 40 41 Race1: 640x480, View 0~View 7, timestamps 0 10 20 30 40 Kb/frame, IPPPPPPP PSNR ARF−Z AIF H264: 5 ref H264: 1 ref 40 60 80 100 120 140 160 33 34 35 36 37 38 39 Ballroom: 640x480, View 0~View 7, timestamps 0 10 20 30 40 Kb/frame, IPPPPPPP PSNR ARF−Z AIF H264: 5 ref H264: 1 ref 20 30 40 50 60 70 80 90 100 110 36.5 37 37.5 38 38.5 39 39.5 Breakdancers: 1024x768, View 0~View 7, time stamps 0 8 16 24 Kb/frame, IPPPPPPP PSNR ARF−Z AIF H264: 5 ref H264: 1 ref 20 30 40 50 60 70 80 38 39 40 41 42 43 44 45 Rena: 640x480, View 38~View 45, timestamps 0 10 20 30 40 Kb/frame, IPPPPPPP PSNR ARF−Z AIF H264: 5 ref H264: 1 ref Figure 3.15: Comparison of different techniques applied to inter-view coding Cameras for the Breakdancers sequence are arranged on an arc. View 4 in this se- quenceissomewhat blurredwhileall theotherviews havevery asimilarsubjective visual quality. From the simulation results, we see that adaptive filtering can be used to pro- vide better reference for predictive coding. Both the proposed method and AIF achieve higher coding efficiency than the multiple references method in H.264/AVC. Our ARF method designed based on scene depth, provides amarginal gain over AIF. However, it is worthnotingthat theachievable gain withtheseenhancedpredictivecodingsis relatively modest as compared to that achievable with other multiview sequences in Fig. 3.15. It is because the frames in the Breakdancers sequence have large homogeneous areas for which intra coding was selected. Fig. 3.4 provides an example of the disparity-based 71 frame partition and intra-coded areas during the initial disparity estimation. Due to the relatively large number of intra-coded blocks, our proposed method provides a 0.2 dB gain at 100Kb/frame over H.264/AVC with one reference. Race1 is the sequence that suffers from most severe focus mismatch across views. As a result, the benefit of adaptive filtering is more prominent: AIF provides a 0.3 dB gain over multiple reference H.264/AVC, and the proposed ARF achieves an additional 0.2∼ 0.3dB gain over AIF(0.7∼ 0.8dB over H.264/AVC withonereference). Focusmismatch is more efficiently compensated with our depth dependent adaptive filtering. As for the Renasequence,inthepresenceofsomedegreeofinter-viewdiscrepancy,ourARFmethod again providesa0.15dB gain over AIF(0.6dB gainwithrespecttoH.264/AVC withone reference). In this sequence, due to a much closer spacing of the cameras as compared to other test sequences (5cm versus 20cm), thestandard multiple reference method achieves coding efficiency similar to that of AIF. Based on the simulation results, it can be concluded that our proposed method is especially helpful for disparity compensation with focus mismatch. It effectively designs filters as estimators of mismatch kernels to compensate for the possible discrepancies associated with scene depth. For sequences with stronger focus mismatch, it provides greater codinggains over single referenceAIF andthemultiple referencemethod without adaptive filtering. Inter-view prediction in MVC: B-frames TheproposedARFwithsupportofbi-directionalprediction(Section3.4)isintegrated with JMVM 5.0, which is a reference software from JVT dedicated for multiview video 72 50 100 150 200 34 35 36 37 38 39 40 41 42 43 Kb/frame, QP 37 32 27 22 PSNR Race1: V1,V3,V5,V7 (B−views) H.264 + ARF Ψ L0 and Ψ L1 H.264/AVC 20 40 60 80 100 38 39 40 41 42 43 44 45 46 Kb/frame, QP 37 32 27 22 PSNR Rena: V39, V41, V43... V51 (B−views) H.264 + ARF Ψ L0 and Ψ L1 H.264/AVC Figure 3.16: Rate-distortion performance of the proposed ARF coding based on H.264/AVC. We partition a frame into up to three depth-levels and estimate the corresponding filters. Filters of size 5×5 with circular symmetry are used (ψ 55cir as equation (3.8)). We encode frames only at given timestamps using inter-view coding with IBPBP structure. The interval between two timestamps is 0.5 sec. (e.g. Inter-view coding at every 12th frame for frame rate 25fps, no temporal prediction.) Without making any modification to the bi-predictive search schemes, we performed simulations based on Method A and Method C in Table 3.1 using iterative search. (Initial search range ±64, plus 4 iterations with refinement search range ±8.) For the sequences tested, the two-set filter design Method C achieves higher coding efficiency than joint filter design Method A. Thus in Fig. 3.16, we provide the corresponding RD results of the two-set filter design approach. Thefour rate points (from low to high) were obtained with QP = 37, 32, 27, and 22. It can be seen that, for views encoded with bi-directional prediction (inter-view B-frames), a 0.6∼0.8 dB gain is achieved on Race1 whenapplyingtheproposedtwo-set ARFdesign,whiletheimprovementisabout0.3∼0.4 dB for Rena. 73 Wealsotestedthefastbi-directionalsearchmethodproposedforfilterdesignMethod C, in which the search over multiple filtered references is only performed within one of the lists, and constrained search will then be performed for the other list based on the filter selection in the first list. Since there are two lists, this procedure can obviously be carried out in two different orders: Multiple reference search in List 0 followed by constrained search in List 1, or vice versa. The multiple search should be performed on the list in which blocks are more likely to select the correct filter, i.e., in the list where choosing different filters will results in greater differences in prediction error, such that the filter selection is more reliable to serve as the constraint for the search in the other list. From the analysis in Section 2.3 and Fig. 2.9, across different disparity values (depths), filter responses resulting from smaller focus setting mismatches have greater differencesascomparedtothoseresultingfromlargerfocussettingmismatches. However, these filters have higher±3dB frequencies (i.e., smaller setting mismatch, less impact on the images). For natural images, energy is mostly concentrated in the low frequency components. As a result, although there are larger differences in filter responses, they may notcorrespondstolargerdifferences whencalculating prediction error. Ontheother hand,largerfocussettingmismatchleadstofiltersthathavestrongereffectontheimages. Butatdifferentdepthsthedifferencebetweenfilterresponsesisnotassignificantasinthe case of filters resulting from smaller setting mismatch. Thus, without actually knowing the exact setting difference, we conduct simulations for both search orders. The results on different views of Race1 are shown in Fig. 3.17. The two constrained search schemes lead to about 0.1∼ 0.2dB degradation as compared to multiple reference search in both lists. 74 140 160 180 200 220 41 41.5 42 42.5 43 43.5 44 Race1: V1 140 160 180 200 220 41 41.5 42 42.5 43 43.5 44 Race1: V3 Multiple ref search in both Ψ L0 and Ψ L1 Multiple ref search in Ψ L0 , constrained search Ψ L1 Multiple ref search in Ψ L1 , constrained search Ψ L0 120 140 160 180 42 42.5 43 43.5 44 44.5 45 Race1: V5 Mismatch L0(V0) > L1 (V2) Mismatch L0(V2) < L1 (V4) Mismatch L0(V4) < L1 (V6) Kb / frame Kb / frame Kb / frame PSNR Figure 3.17: Performance of constrained fast search for ARF-B Temporal prediction in monoscopic video In Fig. 3.18, we first provide the RD coding results for a monoscopic sequence with strong localized focus changes. From the test sequence fondue-multi, we excerpt and encodeframes126∼170duringwhichthecamerafocusischangingbackandforthrapidly amongpeopleatdifferentscenedepths. Itcanbeseenthattheproposedadaptivefiltering approach ARF-f mb provides about 1 dB gain over H.264/AVC with 1 reference, 0.5 dB gain over H.264/AVC with 5 references, and around 0.2 ∼ 0.3 dB gain as compared to AIF. The results demonstrate that, when video undergoes local focus changes, adaptive filtering can be used to provide better reference for predictive coding. The proposed methodandAIFbothachievehighercodingefficiencythanthemultiplereferencemethod in H.264/AVC. Our approach is more effective in modeling the localized focus changes, with multiple versions of filtered references generated for motion compensation. As a reference, we provide another set of simulations results on the sequence Raven, which exhibits global motion blur: At different timestamps, the entire frame may be blurred due to camera movement. It can be seen that for this particular case, the two adaptivefilteringapproaches,ARFandAIF,achievealmostidenticalcodingperformance: 75 200 300 400 500 600 700 800 900 36 37 38 39 40 41 42 43 44 fondue−multi: 288x288, frame 126~170 Kb/sec, IPPPPPPP QP 32,28,24,20 PSNR ARF−f mb AIF H264: 5 ref H264: 1 ref 500 1000 1500 2000 2500 3000 36 37 38 39 40 41 42 43 raven: 480x480, 30 fps frame 0~12 Kb/sec, IPPPPPPP QP 32,28,24,20 PSNR ARF−f mb AIF H264: 5 ref H264: 1 ref Figure 3.18: Rate-Distortion comparison of different approaches about 0.3/0.6 dB gain over H.264 with 5/1 references. This result indicates that the proposed ARF method provides more gain than AIF only when the blurringmismatch is localized. In other video sequences extracted from movie trailers containing some degree of focus mismatch, we also achieved 0.4∼ 0.7dB gain over H.264/AVC with 1 reference. The gain is much smaller when applying the proposed method to regular sequences with no focus changes. Thus, in a practical system, it would be desirable to develop tools to detect the existence of focus mismatch (e.g., applying appropriate criteria to residual blocks), so that mismatch compensation is only explored when a potential coding gain can be achieved. 3.6 Complexity Analysis There are two main factors in our proposed ARF approach that will increase encoding complexity. One is the process of estimating mismatch kernels, which includes initial 76 motion/disparity search, classification of disparity vectors or coefficients of f mb , calcu- lating filter coefficients, and generating filtered references. This will be discussed in detail in Section 3.6.1. The other factor which introduces additional computation is the motion/disparity search loop over multiple filtered references. Note that filters in AIF are applied to different subpixel positions at a single reference frame. When conduct- ing simulations, we measured the motion/disparity estimation time with a profiling tool provided in JM 10.2 software [42]. When full search is applied, the motion/disparity estimation time in our method (including the initial and final search) is similar to that of H.264/AVC with 5 references, and is about 2.5 times as long as in AIF (note that AIF also involves an initial search). However, unlike the multiple reference method in which references are from different views or different timestamps, thus having different disparity/motion, the references in oursystem are simply different filtered versions of the same frame. Takingthis into account, significant complexity reduction could beachieved by reusingmotion/disparity information: As we proceed fromthe unfilteredreference(as in the initial search) to the filtered ones (final search), a much smaller search range could be applied based on previously computed motion/disparity. We performed simulations by changing the final search range from ±64 to ±4 using the vectors obtained from the initial search as predictors. The R-D degradation observed is negligible, while the total encoding time is reduced to about 1 4 [23]. In what follows, we will focus on the complexity associated with the filter design. 77 3.6.1 Complexity of ARF Filter Design The ARF filter design process can be decomposed into the following three parts: (a) classification of block-wise parameters (disparity vectors or coefficients of f mb ) for frame partition, (b) calculation of thefilter coefficients, and(c)generation of filteredreferences. Intheframepartitionprocess(Section 3.3.1), withthemaximumnumberofGaussian components set toK, classification is performed based on EM algorithm fork =K,K-1, ...1 [4], to cluster the block-wise features. The k which provides the lowest minimum description length MDL, denoted as k ′ , will be selected to build the final model. The complexity of this unsupervisedclassification is proportional to the maximum numberK, and the number of input elements to be classified. To speed up this process, one can consider setting a smaller K or performing classification with sub-sampled vectors and classifyingthecorrespondingblocks. SincethevalueK determinesthemaximumpossible number of depth levels in the classification results, ideally it should be set as sequence- dependent by observing the scene of each multiview sequence. In our ARF approach, we do not assume any prior knowledge about the scene, and let the classification tool decide the numberof classes (unsupervised), with a proper upperboundK to start with. We have observed that for Ballroom, forcing the classification tool with a reduced K (from 4 to 3 to 2) led to coding degradation that is not negligible. Another possible approach to reduce classification complexity is to use less elements for classification. For example, in MVC inter-view prediction, we can sub-sample the disparity field. The possible degradation due to the sub-sampling method will only be significant on detailed 78 object boundaries, where more disparity vectors on smaller block size have to be used to differentiate disparity values. To analyze the complexity of filter calculation, let us denoteC the numberof distinct filter coefficients in a given filter (e.g., 6 coefficients for ψ 55cir ), and P D k the number of pixels in depth class D k . As shown by (3.6), constructing the Wiener-Hopf equation for onecoefficient inonedepthclassrequiresP D k C+C (leftside)+P D k (rightside)addition / multiplication operations to calculate the sum of products. Thus, for all filters, each withC coefficients, thetotalnumberofoperationswillbe P k C(P D k C +C +P D k ). This value is upper bounded byC(PC +C +P)≈PC(C+1), as intra coded blocks will not beassignedtoanydepthclass(i.e. P k P D k ≤P). Solvingthelinearsystemforeachfilter with a set of Wiener-Hopf equations (3.6) requires addition / multiplication operations in the order of C 2 (1st-order linear system with C equations and C unknowns). Thus, designingk ′ filters will requirek ′ C 2 operations. For a typical example withk ′ =4,C =6 (Section 3.3.2), and P = 640×480, the value k ′ C 2 is relatively small as compared to the previous term PC(C +1). As a result, the total number of operations in (b) can be approximated by PC(C +1). For the example filter with C = 6, the result is P ×42. Note that this will be similar to the complexity of applying different interpolation filters to generate frame data at all subpixel positions for P pixels, as specified by H.264/AVC (which can be estimated to beP ×66 [48]). Finally, to generate filtered references, the complexity of the convolution depends on the structure of filters: To generate one filtered pixel, the number of multiplications is equal to the number of distinct filter coefficients, and the number of additions is equal to the number of elements in the filter minus one. For ψ cir55 which has 6 coefficients, there 79 will be 6 multiplications and 5×5−1 = 24 additions to generate one pixel. Thus the total number of operations is k ′ P ×30, which will be P ×120 when k ′ = 4 or P ×90 when k ′ = 3. Again the complexity of this step is similar to the calculation of sub-pixel values with interpolation filters as in H.264/AVC. We measure the execution time by performing ARF encoding in three steps: (i) Initial motion/disparity search, (ii) frame partition, filter estimation, and generating filtered references and (iii) final encoding with filtered references. On average, without any complexity reduction method, the processes in (ii) together lead to an increase of about 25% in execution time as compared to a H.264/AVC coding process with one reference,±64 search range. 3.7 Conclusions Wehaveproposedanadaptivefilteringapproachforencodingvideocontent exhibitinglo- calizedfocusmismatchindifferentregionswithinaframe. Ourapproachfirstperformsan initial motion/disparity estimation to obtain motion/disparity information and establish block correspondence. For inter-view prediction in MVC, disparity vectors are exploited to identify regions suffering from different types of focus mismatch. Blocks with similar disparity vectors are grouped into classes (scene-depth levels). As for monoscopic video with focus changes, we first capture the local variation of focus changes by estimating MB-wise filters. MBs with similar filters are then grouped together and associated with adaptive filters tobedesigned. Inbothcases, EM classification algorithm withGMM ba- sis, which consider directional variations, is applied. It automatically decides the number 80 of class based on MDL criterion. Based on the classification result, an filter which is an estimator of focus mismatch kernel, is constructed for each class by minimizing predic- tion error energy. Such filter design approach is adaptive to the focus changes between the current frame and the reference frame. Filtered references are generated by applying the estimated filters. For the sequences we tested, the proposed method provides higher coding efficiency as compared to the current H.264/AVC with multiple reference frames and other adaptive filtering approaches such as AIF. Larger coding gain is achieved for sequences with stronger localized focus mismatch. We also extend the ARF method to MVC inter-view bi-directional prediction with focus mismatch. We show that the filter design approach for the averaged bi-predictor leads to a suboptimal solution when combined with conventional bi-predictive search schemes. Taking into account the interaction between filter design and the bi-predictive search with filtered references, we proposed a filter estimation method which designs a set depth-related filters for each reference list. Simulation results show that for views coded with inter-view bi-directional prediction, the proposed method provides up to 0.8 dB gain over current H.264/AVC in the sequences we tested. We evaluate the efficiency of constrained search within one of the list based on the filter selection in the other list. The degradation as compared to performing multiple search on both lists is about 0.1 dB. 81 Chapter 4 Computationally Efficient ARF for MVC Inter-view Coding Based on Rate-distortion Prediction and Filter Sharing 4.1 Motivation ThegainincodingefficiencyfromtheARFapproachwejustdescribedinChapter3comes at the expense of higher encoding complexity, in particular because this is a two-pass encoding scheme (initial disparity estimation, frame partition, filter estimation, and then final encoding with filtered references). Even though the computation during the final disparity estimation can besignificantly reduced, without sacrificing coding efficiency, by reusing the disparity information obtained from the initial disparity estimation [23], the other components can still introduce significant complexity (Section 3.6.1). Without anypriorknowledgeaboutthemismatch, theinitialsearch andfilterestima- tion are necessary in orderto adaptively design filters. However, for inter-view prediction inMVC,therearecertainmultiviewcharacteristics thatcanbeexploitedtohelpusapply ARFinamoreefficientmanner. InChapter2,wehavedemonstratedthatfocusmismatch in multiview systems is a function of the focus setting difference (view-dependency) and 82 the object depths (depth-dependency). Assume that during the multiview video captur- ing process, the cameras being used, the spacing and the relative shooting orientations between cameras, as well as thefocussettings (parametersa,f andd), aretime invariant (we will refer to this as a “time-invariant multiview setting”). Then the optical transfer function (OTF) of each view will also be time-invariant. The type of focus mismatch will depend on which view pair is being considered, and will also depend on the scene being captured. For a pair of views with larger focus setting mismatch (larger difference in their β curves as functions of disparity/depth), the depth-dependent discrepancy will alsobestronger,leadingtolowercodingefficiencyininter-viewpredictionandpotentially higher coding gain if applying ARF. As for a view pair with no focus setting difference, theoretically the images will only be affected by disparity (as described by (2.25)). The potential benefit of ARF would be limited since there is no depth-dependent mismatch to address. The multiview test sequences used by the JVT-MVC group [18] were captured under a fixed (time-invariant) multiview setting, i.e., there are no camera adjustments while the video is being captured. It has been reported [17] that these sequences exhibit inter- view mismatches that are strongly view dependent. Correspondingly, in MVC inter-view coding, we have observed that the coding gain achieved by our ARF method, varies significantly among different multiview sequences. Figures 4.1 and 4.2 illustrate coding performance of ARF for different views of Ballroom and Race1. For the different QP tested, View 1 of Ballroom has about 0.5 dB gain when ARF is applied. However the coding gain in View 4 is significant only at high bitrates. For views that exhibit strong focusmismatchwithrespecttotheviewsusedforprediction,forexampleView4ofRace1, 83 applyingARFprovidessignificantcodinggain(greaterthan1dB fortestedQPsettings). On the other hand, for a view with no perceivable focus mismatch as compared to its reference view, encoding using ARF leads to very limited (or virtually no) improvement in coding efficiency. Considering the additional complexity due to ARF, it is desirable to develop prediction methods to determine whether ARF is beneficial for a given view, such that we can avoid applying ARF to views that would not achieve much coding gain. 0 50 100 150 200 250 300 350 32 33 34 35 36 37 38 39 40 41 42 Ballroom: 640x480 V1 (Reference V0) Kb/frame, QP: 22, 27, 32, 37 PSNR ARF H264: 1 ref 0 50 100 150 200 250 300 32 33 34 35 36 37 38 39 40 41 42 Ballroom: 640x480 V4 (Reference V3) Kb/frame, QP: 22, 27, 32, 37 PSNR ARF H264: 1 ref Figure 4.1: ARF performance in different views of Ballroom 0 50 100 150 200 250 34 35 36 37 38 39 40 41 42 43 44 Race1: 640x480 V4 (Reference V3) Kb/frame, QP: 22, 27, 32, 37 PSNR ARF H264: 1 ref 0 50 100 150 200 250 300 350 32 33 34 35 36 37 38 39 40 41 42 43 Race1: 640x480 V6 (Reference V5) Kb/frame, QP: 22, 27, 32, 37 PSNR ARF H264: 1 ref Figure 4.2: ARF performance in different views of Race1 Furthermore, since the OTFs are time-invariant, for a given pair of views, the mis- match associated with an object at depth Z will be the same at different times. Thus, 84 across time, when the captured scene is composed of similar levels of depths, the mis- matches present in the images will also be alike, leading to similarity in the ARF filters. It is observed that for a given view, the estimated filters at different timestamps tend to be very similar. Thus, instead of estimating ARF filters at every timestamp, a set of filters can be re-used during a certain time interval until there is significant change in depth-composition within the scene. With time-invariant camera spacing (fixed b in (2.23) ), a given depth Z will correspond to the same disparity δ Z even for images cap- tured at different timestamps. Thus to determine changes in depth-composition, we can compare the distribution of block-wise disparity vectors (DVs) at different timestamps. A more efficient filter estimation/updating scheme can be developed by exploiting this property. In this chapter, we analyze the performance of ARF in MVC inter-view prediction. Besides the variation in gains for different multiview sequences reported in Section 3.5, we further investigate the variations across different views of a given multivew sequence, and across different times for a given view. We show that the gains in coding efficiency can exhibit significant differences from view to view. Furthermore, the estimated filter coefficients at different timestamps demonstrate strong correlation when the depths of objects in the scene remain similar. By exploiting the properties derived in Section 2.3 and from the performance analysis in Section 4.2, we propose two techniques to design an efficient ARF coding scheme, which maintains coding efficiency with significantly reduced complexity: i) view-wise ARF adaptation based on RD-cost prediction, which determines whether ARF is beneficial for a given view, and ii) filter updating based on depth-composition change, in which the same set of filters will beused (i.e., no new filters 85 will be designed) until there is significant change in the depth-composition within the scene. Simulation results in Section 4.3 will show that significant complexity savings are possible (e.g., the complete two-pass ARF encoding process needs to be applied to only 20%∼35% of the frames) with negligible quality degradation (e.g., around 0.05dB loss). 4.2 Computationally Efficient ARF for Inter-view Coding 4.2.1 Rate-distortion Analysis and View-wise ARF Adaption In state of the art video coding techniques, high coding efficiency is achieved by rate- distortion optimization. For each macroblock (MB), the coding mode which provides the lowest rate-distortion cost (RD-cost) will be selected: RD-cost MB = min mode (D mode +λR mode ), (4.1) where D mode is the distortion between the original MB and the reconstructed MB using a given mode, R mode is the bitrate required to encode that mode, and λ is the Lagrange multiplier. In video coding schemes such as H.264, λ is often chosen as a monotonic function of QP [43,53]: larger QP (low bitrate scenario) results in larger λ to put greater penalty on rate. To analyze the performance of ARF, we record the frame- wise R-D cost in the initial disparity estimation (with only unfiltered reference) and in the final encoding (with unfiltered and multiple filtered references). The frame-wise RD- cost is calculated by aggregating MB RD-costs as in (4.1) estimated during MB mode decision. We apply ARF to MVC inter-view coding with IPPPPPPP structure. For multiview data with frame rate 24 fps, inter-view coding is performed only at every 12th 86 5 10 15 20 25 30 0 2 4 6 8 10 12 14 16 Anchor timestamp RD cost reduction (%) Race1 QP32 View 2 View 4 View 5 View 6 2 4 6 8 10 12 14 16 18 0 1 2 3 4 5 6 7 8 9 Anchor timestamp RD cost reduction (%) Ballroom QP22 View 2 View 3 View 4 View 5 View 6 2 4 6 8 10 12 14 16 18 0 1 2 3 4 5 6 7 8 9 Anchor timestamp RD cost reduction (%) Rena QP22 View 39 View 40 View 42 View 43 View 44 Figure 4.3: Frame-wise RD-cost reduction provided by ARF frame: 0, 12, 24......; for data with 30 fps, inter-view coding is performed at every 15th frame: 0, 15, 30...... Thus these timestamps correspond to a half second interval, and we will call them “anchor timestamps” 0, 1, 2... etc. Fig. 4.3 provides a comparison between the frame-wise RD-cost in the initial and final disparity estimation for different views across anchor timestamps. According to the analytical results in Chapter 2, view pairs with no focus setting differences will not producedepth-dependentmismatches which ARF is designed to com- pensate for. However, when there exists a focus setting difference, the type and degree of 87 mismatch will depend on the depths of objects appearing in the scene. This will result in variations in ARF performance. The RD-cost reduction in Fig. 4.3 shows behavior that is consistent with the analytical results. First, the RD-cost reduction achieved by our ARF approach varies significantly from view to view. These variations are consistent with the reported mismatches in multiview test sequences [17]: For views that exhibit strong focus mismatch with respect to the views used for prediction, for example Views 4 and 5 of Race1, applying ARF can provide more than 10% reduction in RD-cost. On the other hand, for a view with no perceivable focus mismatch as compared to its reference view, encoding using ARF leads to very limited improvement in coding efficiency. Sec- ond, views showing higher gains with ARF (i.e., exhibiting focus mismatch) tend to have larger variations in RD-cost reductions across different timestamps, due to the change in depth-composition. Note that while Fig. 4.3 only depicts results at a given QP for each sequence, the same behavior (variations cross view/time) is observed for all three sequences across QP22, 27, 32, and37. However it isworth mentioningthat theRD-cost reductions become smaller as QP increases (low-bitrate scenario). From the analytical results in Section 2.3 and observations above, if for a given view the RD-cost reduction achieved by using ARF is consistently very small over multiple anchor timestamps, it is reasonable to consider not applying ARF for future anchor frames. We propose a predictive ARF adaptation method such that, within a period of N anchor timestamps, the encoder evaluates RD-cost reduction provided by ARF in the firstT anchortimestamps,anddetermineswhethertoapplyARFtotheremaininganchor timestamps. In the next period of N anchors, ARF will be tested again to determine whether it will be efficient. Let μ V (1,T) denote the average RD-cost reduction over anchor 88 timestamps 1 to T (where T ≥1 ) in view V when applying ARF, and σ V (1,T) denote the corresponding standard deviation. This ARF adaptation method can be summarized as: Algorithm 1 ARF adaptation for a given view Divide anchor frames into groups of N anchor frames for Each group of N anchor frames do for 1≤i≤T, i.e. the first T frames in each group do Apply two-pass ARF end for Calculate μ V (1,T) and σ V (1,T) if μ V (1,T) <κ andσ V (1,T) <ǫ then for T +1≤i≤N do Conventional encoding (No ARF for the following frames) end for else for T +1≤i≤N do Apply two-pass ARF end for end if end for We call this approach View-wise ARF adaptation based on RD-cost prediction. In Section 4.3, simulation results will be provided with selected settings of N, T, κ, andǫ. 4.2.2 FiltersCorrelationandFilterUpdatingusingDepthComposition From the analysis in Section 2.3, if multiview setting is time-invariant, the type of mis- match between frames from a given pair of views will depend on the depth-composition within the captured scene. Parts of the scene at depth Z will produce the same type of mismatch at different capturing timestamps, leading to similarity in the correspond- ing estimated ARF filters. To further investigate how filters vary over time, we perform correlation analysis: For a given view, we concatenate filter coefficients from filters es- timated at anchor timestamp t 1 to form a coefficient vector A t 1 , and compared it with 89 2 3 4 5 0.7 0.75 0.8 0.85 0.9 0.95 1 Anchor timestamp Correlation coefficient to Anchor timestamp 1 Ballroom QP27 View 2 View 4 View 6 2 3 4 5 0.7 0.75 0.8 0.85 0.9 0.95 1 Anchor timestamp Correlation coefficient to Anchor timestamp 1 Race1 QP27 View 2 View 4 View 6 2 3 4 5 0.7 0.75 0.8 0.85 0.9 0.95 1 Anchor timestamp Correlation coefficient to Anchor timestamp 1 Rena QP27 View 40 View 42 View 44 Figure 4.4: Correlation of estimated filter coefficients at different timestamps the corresponding filter coefficient vector A t i at another anchor timestamp t i . Fig. 4.4 provides results for some of the analyzed sequences. For Ballroom, there are multiple dancing couples moving in the scene. A couple may appear in some frames at depth Z 0 , while in the preceding frames there is nothing at this particular depth. Due tosuch depth-composition difference, thefilter correlation has larger variation as compared to the other sequences tested. On the other hand, in Rena, the scene composition is consistent across time: A foreground girl and the background remain at same distances to the camera for the entire sequence. The estimated filter coefficients demonstrateveryhighcorrelation evenover 14anchortimestamps(about200 frames). The most interesting case is Race1. In this sequence, the content in the scene is changing due to the viewing angle shift of the camera-set and the carts driving along the runway. However, frames at different timestamps mostly cover the same range of depth (Refer to Fig. 4.7). That is to say, at different timestamps, there is no new “depth level” beingintroduced(ascomparedtotheBallroom case). Asaresult,filtersstilldemonstrate very strong similarity. These results are consistent with the depth-dependency property derived in Chapter 2. 90 When the filters are highly correlated over a certain time interval, e.g., when depth- composition within the scene remains similar, it is not necessary to estimate them at every single timestamp. Applying the same set of filters over multiple timestamps will reduce the effort spent on initial disparity search and filter estimation, while the coding efficiency could be preserved. Moreover, when filters are re-used across time, we do not need to transmit filter coefficients. Our analysis suggests that the time intervals during which the filters are re-used and when to re-estimate/update filter coefficients, can be adaptively determined by comparing the depth-composition at different timestamps. In ARF, disparity vectors are exploited to partition frames into depth-levels by performing classification based on GMM. To determine whether there has been a change in depth- composition, we compare the GMM classification results at different timestamps. Let μ GMM V,i (m) denote the mean of Gaussion componentm in the DV classification for frame iin viewV, and letP GMM V,i (m) bethecorrespondingpercentage of blocks beingclassified into that class. A Gaussian component is defined as “not being covered”, in a reference timestampr, if its mean is at leastD pixels away from any Gaussian mean at timestamp r, μ GMM V,r (n). If the sum of the P GMM V,i (m) of all these “not covered” components is over a certain percentage P, we determine there is a “change in depth composition” and thus apply the two-pass ARF coding for the current frame to update filters, otherwise the filters estimated at the reference timestamp will be re-used. W V,i (m) = ( 1, if∀n|μ GMM V,i (m)−μ GMM V,r (n)|>D 0, otherwise (4.2) 91 If X m W V,i (m)·P GMM V,i (m)>P → Apply two-pass ARF Otherwise → Re-use filters. (4.3) However, there is an issue preventing us from directly using the above scheme: For the current frame being considered, at the point when making the decision to update filters or not, we actually do not have its disparity information yet. Disparity estimation should be performed only after we decided to apply two-pass ARF, otherwise filter will bere-usedandinitial estimation is skipped. Toovercome thisproblem, werefertoaview in the earlier coding order for GMM disparity information. For example, when encoding View 2, the DV GMM for frames in View 1 are exploited to determine whether there has been a change in depth-composition, and then to decide whether to update the filters. (Note that in this scheme, we cannot apply filter re-use to the first view being inter-view coded, as there will be no reference disparity.) This method would be most suitable for 1D parallel camera arrangements with equal spacing among cameras, as the disparity values of different views at same timestamp should be very similar in this scenario. We denote such method as filter updating based on depth-composition change. Fig. 4.5 shows some filter re-use results when we set D = 5 pixels andP = 15% in (4.2)(4.3). It can be seem that the coding efficiency (reduction in RD-cost) can be well-preserved while the no new filter are estimated. Note that since the RD-cost is calculated by accumulating MB-wise RD-cost, the potential bit saving achieved by not sending filter coefficients is not reflected in Fig. 4.5. 92 6 8 10 12 0 2 4 6 8 Race1 View 2, QP22 (Use anchor #4 filters) Anchor timestamp ARF RD−cost reduction (%) 2−pass ARF Filter reuse 16 17 18 19 20 0 2 4 6 8 Race1 View 5, QP22 (Use anchor #14 filters) Anchor timestamp 2−pass ARF Filter reuse 5 6 7 8 9 10 0 2 4 6 8 Rena View 41, QP22 (Use anchor #4 filters) 2−pass ARF Filter reuse Figure 4.5: Examples of RD performance when filters are re-used Combining with the ARF adaptation described in Section 4.2.1, an efficient ARF coding scheme is summarized in Algorithm 2. In this new scheme, within a period of N anchor timestamps, after evaluating ARF for the first T anchor timestamps, there are three possible encoding options for the remaining frames. (i) If the encoder decides not to use ARF, the remaining frames will be encoded normally, followed by a GMM classification on the DVs to provide disparity information for the next view. If it instead decides to apply ARF, (4.2) and (4.3) will be used to choose between (ii) simply re- using filters or (iii) performing the two-pass ARF to update filters. Frames encoded by re-using filters do not need to undergo initial disparity search and filter estimation, but simply perform disparity search with filtered references. Classification on DV will then be applied to generate disparity information for the next view. 4.3 Simulation results We conduct simulations with the proposed efficient ARF techniques. As described in Section 4.2.1, IPPPPPPP inter-view coding is performed at anchor timestamps with half-second time interval between two timestamps. We implemented the ARF coding 93 Algorithm 2Efficient ARFcodingschemeforMVC inter-view prediction(V total num- ber of views: View 0 ∼ View V −1. Inter-view coding performed on anchor frames for View 1∼ View V −1.) for 1≤v≤V −1 do Divide anchor frames into groups of N anchor frames for Each group of N anchor frames do for 1≤i≤T, i.e. the first T frames in each group do Apply two-pass ARF: Initial disparity estimation, classification, filter estima- tion, final encoding end for Calculate μ V (1,T) andσ V (1,T) if μ V (1,T) <κ and σ V (1,T) <ǫ then for T +1≤i≤N do Conventional encoding (No ARF for the following frames) GMM classification for disparity vectors end for else if v=1, i.e. the first inter-view coded view then for T +1≤i≤N do Apply two-pass ARF to update filters end for else Filter reference timestamp r = T for T +1≤i≤N do Calculate W V−1,i (m) as in (4.2) if P m W V−1,i (m)·P GMM V−1,i (m)>P then Apply two-pass ARF to update filters Filter reference timestamp r = i else Re-use the filters at timestamp r Encode with filtered reference GMM classification for disparity vectors end if end for end if end if end for end for 94 scheme on top of the H.264/AVC framework using reference software JMVM 5.0. We set N =20andT =4,i.e., fora10secondperiod(20anchortimestamps),anchortimestamps in the first 2 seconds will be encoded with ARF to evaluate the RD-cost reduction. To set the thresholdsκ andǫ, weobserved ARF performancefor sequences with virtually no perceivable focus mismatch, thus having very limited improvement, such as Exit and Uli. The achieved RD-cost reductions for these sequences are in the range of 0%∼2%. Thus we set κ = 2% and ǫ = 1%: The encoder will disable ARF coding if it observes that the average RD-cost reduction over the first 4 anchor frames is less than 2% with a variation less than 1%. For theremainingframeswhichrequirefiltering, (4.2) and(4.3) areusedtodetermine whetherfilterswill bere-usedor updated(thus performingtheentiretwo-pass ARF). We tested different values of parametersD andP, which resulted in differences in how often filtersareupdated. For exampleforviewsin Ballroom, withD =5pixels, whenchanging P from15%to10%to5%, over aperiodof20anchortimestamps, thefrequencyatwhich the filters areupdated will increase from twice to four times to five/ six times. Updating filters less frequently leads to larger degradation in coding efficiency as compared to performing two-pass ARF at every anchor timestamp. In Fig. 4.6, we present encoding result with D = 5 pixels and P =15%, for which the filters are only updated at most twice 1 over the timestamps tested, as indicated in Table 4.1, 4.2, and 4.3. It can be seen from the results that the proposed techniques are very efficient in preservingARFcodinggainswhilethecomplexity issignificantlyreduced. Afterzooming in on the RD curves, we observe a degradation of less than 0.05dB as compared to the 1 Excluding the initial ARF testing period, i.e., the first 4 anchors for each 20-anchor period. 95 50 100 150 200 250 33 34 35 36 37 38 39 40 41 42 Kb/frame PSNR Ballroom V0~V7 ARF VCIP 07 New ARF H.264: 1 ref 50 100 150 200 250 34 35 36 37 38 39 40 41 42 PSNR Race1 V0~V7 ARF VCIP 07 New ARF H.264: 1 ref 20 40 60 80 100 120 37 38 39 40 41 42 43 44 45 46 Kb/frame PSNR Rena V38~V45 ARF VCIP 07 New ARF H.264: 1 ref Figure 4.6: Encoding results of the proposed coding scheme (QP = 37, 32, 27, 22) Table 4.1: Encoding selection of the proposed efficient ARF: Ballroom (20 anchors each view. The first four, a1 ∼ a4, are encoded with two-pass ARF. If it is determined that the remaining anchors also need filtering, the anchors at which filters are updated are also listed in the table, and the other anchors will be encoded by re-using the filters.) Ballroom (20 anchors) QP Scenario V1 V2 V3 V4 V5 V6 V7 22 Filtering for a5 ∼ a20? Yes Yes No Yes Yes Yes No (Update filter at) (All) (a5, a10) (a5, a10) (a5, a10) (a6, a10) 27 Filtering for a5 ∼ a20? Yes Yes No Yes Yes Yes No (Update filter at) (All) (a5, a8) (a5, a8) (a5, a8) (a5, a8) 32 Filtering for a5 ∼ a20? Yes Yes No No Yes No No (Update filter at) (All) (a5, a8) (a5, a9) 37 Filtering for a5 ∼ a20? Yes No No No No No No (Update filter at) (All) previously proposed ARF results. (Note that the degradation will be even smaller when the filters are updated more frequently than in the example here, e.g. when P < 15%.) Across different QPs, the view-wise ARF adaptation method successfully identifies views for which limited coding gain would be achieved if ARF were applied. With higher QP, ARF is appliedto fewer views sincetheachievable gains tend tobesmaller underthelow bitrate scenario. For example in Table 4.1 Ballroom View 4, ARF is applied only to QP 22 and 27. This behavior matches well with the coding results provided in Fig. 4.1. As 96 Table 4.2: Encodingselection of theproposedefficient ARF: Rena (20 anchors each view. Forviewsthatrequirefiltering, sincethedepthcompositionremainunchanged, thefilters are re-used throughout the remaining anchors.) Rena (20 anchors) QP Scenario V39 V40 V41 V42 V43 V44 V45 22 Filtering for a5 ∼ a20? No Yes Yes Yes No Yes No (Update filter at) 27 Filtering for a5 ∼ a20? No Yes Yes Yes Yes Yes No (Update filter at) 32 Filtering for a5 ∼ a20? No Yes No Yes No No No (Update filter at) 37 Filtering for a5 ∼ a20? No Yes No Yes No No No (Update filter at) for views that indeed utilize ARF, most of the frames are encoded using one-pass coding with filtered references constructed using already estimated filters. 4.4 Conclusions Inthischapter, byexploitingthefocusmismatchcharacteristics derivedfromthetheoret- ical analysis in Section 2.3, we propose techniques to efficiently apply ARF to inter-view prediction in MVC. We analyze the performance of ARF in MVC inter-view prediction. For different views, the gains in coding efficiency demonstrate a strong view-wise vari- ation. For a given view, the estimated filter coefficients at different timestamps exhibit strong correlation when the depth-composition within the scene remain similar. The ob- served properties conform with the results in Chapter 2. The two techniques introduced in the chapter are i) view-wise ARF adaptation based on RD-cost prediction, which de- termines whether ARF is beneficial for a given view, and ii) filter updating based on depth-composition change, in which the same set of filters will be used until there is significant change in the depth-composition within the scene. Simulation results show 97 Table 4.3: Encoding selection of the proposed efficient ARF: Race1 (35 anchors each view. a1 ∼ a4 and a21 ∼ a24 are encoded with two-pass ARF. If it is determined that the remaining anchors need filtering, the anchors at which filters are updated are also listed in the table, and the other anchors will be encoded by re-using the filters.) Race1 (35 anchors) QP Scenario V1 V2 V3 V4 V5 V6 V7 22 Filtering for a5 ∼ a20? Yes Yes Yes Yes Yes No Yes (Update filter at) (All) (a14) (a14) (a14) (a14) (a15) Filtering for a25 ∼ a35? Yes Yes Yes Yes Yes No Yes (Update filter at) (All) 27 Filtering for a5 ∼ a20? Yes Yes Yes Yes Yes No Yes (Update filter at) (All) (a16) (a16) Filtering for a25 ∼ a35? Yes Yes Yes Yes Yes No Yes (Update filter at) (All) 32 Filtering for a5 ∼ a20? Yes Yes Yes Yes Yes No Yes (Update filter at) (All) (a16) Filtering for a25 ∼ a35? Yes Yes Yes Yes Yes No Yes (Update filter at) (All) 37 Filtering for a5 ∼ a20? Yes Yes Yes Yes Yes No Yes (Update filter at) (All) (a15) (a15) (a15) Filtering for a25 ∼ a35? Yes Yes Yes Yes Yes No Yes (Update filter at) (All) significant complexity reduction can be achieved, since the complete two-pass ARF en- coding process needs to be applied to only 20% ∼35% of the frames, while the coding efficiency is not affected significantly (e.g., around 0.05 dB loss). 98 Ballroom, Anchor 4 Ballroom, Anchor 5 Ballroom, Anchor 10 Rena, Anchor 4 Rena, Anchor 10 Rena, Anchor 15 Race1, Anchor 4 Race1, Anchor 9 Race1, Anchor 19 Figure 4.7: Images at different anchor timestamps for the sequences tested 99 Chapter 5 Predictive Fast Motion/disparity Search for Multiview Video Coding 5.1 Introduction AsdescribedinChapter1, temporalandinter-view redundancycanbeexploited inMVC byapplyingblock-basedmotion/disparitycompensatedprediction(jointMCP/DCP).As before,letusdenoteI C (x,y)theluminancepixelvalueofthecurrentframetobeencoded, with(x,y)representingthepixelpositionwithinaframe,andletI R (x,y)denotethelumi- nance pixel value of the reconstructed reference frame. In block-based motion/disparity compensation, foragiven blockwithsizeM×N atposition(x o ,y o )withingI C , theblock matching procedure is performed by evaluating different displacement vectors (dx i ,dy i ), to find the best predictor from the reference I R that minimizes a cost function. The most commonly used cost function is the sum of absolute differences (SAD). The block matching procedure with SAD as a cost function can be summarized as: min (dx i ,dy i ) yo+M−1 X y=yo xo+N−1 X x=xo |I C (x,y)−I R (x+dx i ,y+dy i )| (5.1) 100 For MVC structures with joint MCP/DCP, such as the one depicted in Fig. 1.2 and others in [3,12,34,56], higher coding efficiency is achieved by independently searching the best predictor within each reference (temporal or inter-view), and selecting the one withlower matchingcost. Whileconsistent codinggain isobtained, thesecodingschemes require much higher coding complexity as compared to simulcast: For a frame utilizing joint MCP/DCP, both motion estimation (ME) and disparity estimation (DE) have to be performed, leading to additional search complexity. To reduce the complexity associated with joint MCP/DCP, fast search methods have been proposed which exploit the relationship between motion and disparity fields. They can be classified into two categories: 1. Joint motion and disparity fields estimation, and 2. Predictive fast search algorithms. In the first category, a regularization constraint is utilized to jointly estimate motion and disparity fields. Fig. 5.1 illustrates a possible regularization constraint when a physical point P appears in two views at different time (i.e., P is not occluded). V0 t0 t1 V1 MV MV DV DV V0 V1 t1 t0 P2 P3 P1 P4 Figure 5.1: The regularization constraint on motion/disparity fields In Fig. 5.1, P1,P2,P3,P4 are the pixels corresponding to the same point P in the scene. MV V0 andMV V1 arethedisplacementsfromP1toP3andfromP2toP4(motion 101 vectors); and DV t0 and DV t1 are the displacements from P1 to P2 and from P3 to P4 (disparity vectors). The depicted regularization constraint can also be represented as: MV V0 +DV t1 =DV t0 +MV V1 (5.2) Instead of independently searching for motion and disparity predictors, joint estima- tion ofbothmotionanddisparityfieldscanbeperformedbyimposingsuchregularization constraints to the block matching procedure [57,59]. For example in Fig.5.1, after ob- taining DV t0 and MV V0 , the remaining two vectors DV t1 and MV V1 can be searched jointly by selecting the pair of vectors which provides the minimum matching cost while satisfying (5.2). Although searching complexity can be reduced, these methods have the drawback that, due to the constrained search, obtained predictors might not be as good as those found by independent motion and disparity search. As a result, applying such joint estimation methods causes some noticeable degradation in coding efficiency. Forthesecondcategory (predictivefastsearch), correlationbetweendisparityfieldsat differenttimestamps,orcorrelationbetweenmotionfieldsindifferentviews,areexploited. For example in [11], at certain timestamps, a single global disparity vector between V0 and V1 is estimated by finding the most dominant disparity value in blocks correspond to static background. To perform ME for a current block in a frame withinV1, by using this global disparity, the corresponding block in the frame within V0 is first obtained. Then the MV of this corresponding block is chosen as search candidate for ME of the current block, and a smaller search range applied. However, a single global disparity vector might not be accurate to provide inter-view correspondence for all blocks within a 102 frame. In particular, objects closer to the cameras will have larger disparity as compared to the background. As a result, MV candidates obtained with this method could be less reliable for foreground regions. In this chapter, we propose predictive fast search algorithms (category 2) that can be used when either the motion or the disparity field is available and we wish to estimate the other field efficiently. For a current frame to be encoded, if, say, its disparity field is available after performing DE, and the motion field is available for frames in other views, then instead of using a global disparity vector as in [11], our method uses the estimated block-wise disparity field to locate the corresponding blocks in other views, and exploit their MV as candidates. Likewise, it is also possible to obtain good disparity candidates when the motion field is available for the current frame after ME, and the disparity field is available for frames at other timestamps. Furthermore, we construct a model and analytically demonstrate how mismatches, such as illumination change, will affect the accuracy of the first estimated displacement field (motion or disparity), which consequently will affect the reliability of the candidate vectors obtained with our methods. We also investigate search strategies that can better exploit the characteristics of the candidate vectors obtained with our method. The rest of this chapter is organized as follows. In Section 5.2, we introduce the proposed predictive search algorithms with candidate vectors obtained via local motion/disparity information (in lieu of using a single global vector as in [11]). The analytical model describing the relationship between displacement estimation error and illumination change, is provided in Section 5.3. In Section5.4, simulationdesignsuchasdifferentsearchstrategies, andcodingresultsacross differentbitrates, arediscussed. Inthissection wealsodiscussthebitallocation issue[22] 103 andwedemonstrateimprovedoverallperformanceisachievablewhensimplebitallocation rules are used. Finally, conclusions are presented in Section 5.5. 5.2 Predictive Fast Search with Candidate Vectors Obtained from Local Motion/Disparity Information In this section, we consider the problem of complexity reduction in motion/disparity es- timation for frames using joint MCP/DCP by proposing search algorithms that, after either the motion field or the disparity field has been estimated, obtain with low com- plexity a good set of candidate vectors for the other field. From Fig. 5.1, with small temporal and inter-view distances, MV V0 ≈ MV V1 and DV t0 ≈ DV t1 . The main nov- elty in our method is that, we explicitly exploit the fact that MV V0 and MV V1 can be related via DV t1 , while DV t0 and DV t1 can be related via MV V1 . The proposed scheme performspredictive motion search fromview to view andpredictive disparity search from one timestamp to another time. Details are provided in the next subsections. 5.2.1 Predictive Search I: Disparity then Motion (DtM) In standard video coding, fast motion search algorithms can be designed based on the assumption that the MV in neighboring blocks are highly correlated. We can use the result of ME in causal neighboring blocks to obtain a set of MV candidates that can be used to initialize a search with reduced search range in the current block [7,44]. In a MVC scenario, since there are multiple cameras capturing the same scene, it is possible to predict the motion field of one view using the other view’s motion as reference [11]. 104 We propose that for a given frame, after its disparity field has been estimated, we will be able to find good search candidates for the motion field with very low complexity. Fig. 5.2 depicts the basic procedure for predicting the motion field from view to view by identifying MV candidates. In Fig. 5.2(a), the vector field represented with solid arrows is estimated first. When encodinganon-anchor frame, DE is performedfirst. For a given block in a non-anchor frame of current view, Fig. 5.2(b), we track along its disparity vector to its corresponding block in the reference view. Depending on where this block is located, it could be overlapped with at most 4 blocks in the reference view, each having a MV. These can serve as the candidates, denoted MV 1 ,MV 2 ,MV 3 ,MV 4 in Fig. 5.2(b), and provide initialization points for the search. After predicting the motion field of the current view, the same procedure is used to predict the motion field of the next view, using the current view as the reference. We denote this as the Disparity then Motion (DtM) scheme. V0 V1 V2 V3 DtM from view to view 2 6 7 3 4 5 8 Reference view Disparity vector Candidates for motion vector Curent view 1 t−1 t Figure 5.2: (a)Left: The structure of predicting the motion field from the reference view (DtM) (b)Right: Obtaining MV candidates to predict the motion field To assess the performance of the additional MV candidates obtained using DtM, we definetwosets of candidates (refer toFig.5.2(b)): A={MV 5 ,MV 6 ,MV 7 ,MV 8 , ~ 0}, which 105 contains the motion vectors selected from blocks in a causal neighborhood (as typically used in many fast motion estimation algorithms) andB =A∪{MV 1 ,MV 2 ,MV 3 ,MV 4 }, which also includes the additional candidates obtained from the neighboring view. For each block in a non-anchor frame, we choose the candidate that provides the minimum SAD from each of the sets. The residual image PSNR values are compared in Table 5.1. Forall threetestsequences(Aqua, Ballroom, andST),it canbeseenthathigherqual- ity is achieved if the additional candidates provided by the DtM scheme are considered. Oneofthedrawbacksofthefastmotionsearchusingneighboringblocksisthatthesearch may only identify a motion vector representing a local minimum when the current block happens to have a MV that is different from those of the neighbors. A typical example is in situations wherethe current block is located close to the boundaryof a moving object. Our DtM approach helps to alleviate this problem by providing a block correspondence in the other view to obtain additional MV candidates. Table 5.1: Compare different sets of MV candidates, PSNR of residue images Aqua Set A Set B Ballroom Set A Set B ST Set A Set B V1 29.47 30.65 V1 28.45 29.25 V1 30.70 32.74 V0 29.54 30.79 V2 28.29 29.16 V2 31.14 32.93 V3 27.94 28.84 5.2.2 Predictive Search II: Motion then Disparity (MtD) A similar idea can be applied to predict the disparity field after the motion field has been estimated. We can use the disparity at time t−1 as the reference to predict the disparity at time t. This approach is illustrated in Fig. 5.3. Again, the field shown 106 with a solid arrow is estimated first. This time when encoding a non-anchor frame, motion estimation is performed first. For a given block in a non-anchor frame at time t, Fig. 5.3(b), we track along its motion vector to its corresponding block at time t−1. This would give us at most 4 different candidates DV 1 ,DV 2 ,DV 3 ,DV 4 to initialize the disparity search. After predicting the disparity field at time t, this disparity field will be used as the reference to predict the disparity field at time t+1. We denote this as the Motion then Disparity (MtD) scheme. 1 2 3 4 V0 V1 V2 V3 time t−1 t 8 5 6 7 disparity vector Candidates for Motion vector Figure 5.3: (a)Left: The structure of predicting the disparity field from a time instance (MtD) (b)Right: Obtaining DV candidates to predict the disparity field 5.3 Displacement Estimation under Illumination Changes The accuracy of the first estimated field will affect the reliability of the obtained candi- dates: If the first field failed to find accurate corresponding blocks, the candidate vectors areless trustworthy asthey arefromregions that donot correspondtotheblock wewant to encode. There are two primary factors that will affect the accuracy of the estimated block correspondence. Firstly, instead of pixel-based estimation, motion/disparity com- pensated prediction is performed in a block-based manner. Different block sizes/shapes andthesignalcharacteristicswithintheblockwillhaveinfluenceonthematchingprocess. 107 Secondly, the accuracy can also be affected by mismatches other than simple displace- ment, For example, it has been observed that frames from different views in multiview systems can suffer from illumination and focus mismatches [17]. Inter-view disparity compensation could fail to obtain reliable correspondence due to these mismatches. In [31], a block-based illumination model is proposed by first decomposing a block signal into its mean and a mean-removed signal. In this section, we adopt such decom- position to demonstrate how the accuracy of the estimated displacement will be affected by theillumination mismatch whenusingSADas cost function, thus justifyingtheuseof mean-removed block matching utilized in [31]. Moreover, we will derive specific proper- ties for planar regions which relate the estimation error to the signal characteristics such as mean and gradient, and to the block matching size. Following from (5.1), we denote the current block to be matched as B C , and the i th candidate reference block with displacement vector (dx i ,dy i ) as B i R . That is: For 0≤x≤N−1, 0≤y≤M−1: B C (x,y) = I C (x o +x,y o +y) B i R (x,y) = I R (x o +x+dx i ,y o +y+dy i ) (5.3) A block of pixels can be written as the sum of its mean μ and a zero-mean signal w [31] (which we denote as a mean-removed structure), i.e., B C (x,y) =μ C +w C (x,y) and B i R (x,y) =μ i R +w i R (x,y), (5.4) The block matching procedure with SAD cost function can then be represented as: 108 min i M−1 X y=0 N−1 X x=0 B C (x,y)−B i R (x,y) min i M−1 X y=0 N−1 X x=0 (μ C −μ i R )+ w C (x,y)−w i R (x,y) (5.5) The reference block which provides the minimum SAD is denoted asB ∗ R . We assume that, when there is no other type of mismatch between I C and I R , and no occlusion for B C inI R , this reference blockB ∗ R gives the most accurate correspondence toB C , i.e. the difference of means is Δμ i R = (μ C −μ ∗ R )≈ 0 and the mean-removed structure difference is (w C −w ∗ R )≈0 (zero matrix). Now, assume that there are illumination mismatches between the current frame and the reference frame, such that in the neighborhood of the most accurate correspondence B ∗ R , the reference blocks are affected by a local illumination change so that: ˆ B i R (x,y) = S·B i R (x,y)+C = (S·μ i R +C)+S·w i R (x,y), (5.6) where(5.6),S is amultiplicative scale factor andC is an additive offset. Substituting (5.6) into (5.5), we get: min i M−1 X y=0 N−1 X x=0 μ C −S·μ i R −C + w C (x,y)−S·w i R (x,y) (5.7) 109 The block matching with SAD cost function will be affected by S and C. For the special case with only additive illumination offset, i.e., S =1, (5.7) will be simplified as: min i M−1 X y=0 N−1 X x=0 μ C −μ i R −C + w C (x,y)−w i R (x,y) (5.8) In (5.8), a larger |C| will have stronger effect on the estimated displacement, leading to less reliable block correspondence. While the matching for mean is affected by the offset C, matching for the mean-removed structure remains the same. Thus for localized additive illumination offset, we can modify the cost function as a mean-removed SAD to preserve block correspondence when performing block matching [31]. In what follows, we discuss an example in the case of homogeneous/smooth regions in order to quantify how illumination parameters S and C affect the accuracy of the estimated displacements. Example: Homogeneous/Smooth-varying Regions: For ahomogeneous/smooth varyingregion, weassumethat thepixel values fitclosely to a planar function such that (1) B C (x,y) = ax+by +d, and (2) The corresponding region in the reference frame, without any illumination change, is a shifted version with displacement vector (Dx,Dy), i.e.,B i R (x,y) =a[(x−Dx)+dx i ]+b[(y−Dy)+dy i ]+d. In this scenario, we will have w i R = w C . By replacing μ i R with (μ C −Δμ i R ), the block matching of (5.7) can then be approximated as: 110 min i M−1 X y=0 N−1 X x=0 (1−S)μ C +S·Δμ i R −C +(w C (x,y)−S·w C (x,y)) min i M−1 X y=0 N−1 X x=0 (1−S)(μ C +w C (x,y))−C +S·Δμ i R (5.9) Given that for P k |a k −X|, the minimum occurs when X is equal to the median of the sequence {a k }, the block matching in (5.9) will find the block with mean difference closest to the following value: Δμ R =− 1−S S ·median(μ C +w C (x,y))− C S (5.10) Since the pixel positions x and y are both equally spaced, median(μ C +w C (x,y)) = median n ax+by+d| 0≤x≤N−1 0≤y≤M−1 o = mean n ax+by+d| 0≤x≤N−1 0≤y≤M−1 o = μ C . Thus from (5.10), the block matching will find a block with: Δμ R =− 1−S S μ C − C S ,for planar regions (5.11) μ C and Δμ R at a given (dx,dy) can be computed as: μ C = 1 MN M−1 X y=0 N−1 X x=0 (ax+by+d) =a N−1 2 +b M−1 2 +d (5.12) Δμ R | (dx,dy) = 1 MN M−1 X y=0 N−1 X x=0 (ax+by+d)−[a(x−Dx+dx)+b(y−Dy+dy)+d] = a(Dx−dx)+b(Dy−dy) (5.13) 111 Using (5.12) and (5.13) in (5.11), we get: a(Dx−dx)+b(Dy−dy) =− 1−S S ·(a N−1 2 +b M−1 2 +d)− C S (5.14) Toillustratetheeffectofilluminationparameters(S,C), letusconsiderthecasewhen the region has zero gradient in y direction (b = 0), in other words, pixel values linearly change along x-axis only: a(Dx−dx) = − 1−S S ·(a N−1 2 +d)− C S dx = Dx+ 1−S S ·( N−1 2 + d a )− C aS (5.15) We can see that, in the presence of illumination change, the estimated displacement dx will deviate away from the actual displacement Dx, by 1−S S ·( N−1 2 + d a ) - C aS . Thus, besides the obvious fact that larger |C| and S farther away from 1 will result in greater error in the estimated displacement, other interesting properties can be observed from (5.15), which we summarize below: With only additive illumination change (S =1),|dx−Dx|=| C a | Planar regions with larger gradient will be more robust to such additive offset, since |dx−Dx| is inversely proportional to a(or b). With only multiplicative illumination change (C =0),|dx−Dx| =| 1−S S ·( N−1 2 + d a )| 112 • When a and d are fixed, block matching with larger block size (larger N) is more vulnerable to multiplicative illumination change, leading to larger error in the esti- mated displacement. • Under same block size and same gradient (N anda fixed), a block with a larger d, hencea larger mean, will receive stronger influencefrommultiplicative illumination change, thus will make greater error in the estimated displacement. In conclusion, stronger illumination mismatch (greater |C| and |1−S| values) will cause larger error in the estimated displacement, thus providing less reliable block corre- spondence. For planar regions, more specific properties can be derived which relate the estimation error to the signal characteristics (mean, gradient) and to the block matching size. A larger error in estimated displacement indicates that the selected reference block is farther away from the actual corresponding block B ∗ R for B C . Consequently, the can- didate vectors obtained with methods proposed in Section 5.2 will empirically become less trustworthy. In Section 5.4, simulation results will demonstrate that when the first estimated field is affected by illumination change such that the block correspondence is not accurate, performing predictive search for the other field based on the obtained candidates, will lead to larger degradation in coding efficiency. 5.4 Experiments Design and Results In this section, we design experiments to evaluate the proposed predictive fast search methods which obtain new candidate vectors. For frames utilizing joint MCP/DCP, we consider thefollowing three schemes: 1. Both ME and DE use full search (we denote this 113 a dual full search scheme), 2. DtM, where disparity field is estimated by full search and motion is fast searched with new candidates, and 3. MtD, where motion is full searched and disparity is fast searched with new candidates. The dual full search scheme serves as a baseline of the potential gain as compared to simulcast. As for MtD and DtM, there are multiple candidate vectors that can beused. In Section 5.4.1, we investigate different search patterns utilizing these candidates. 5.4.1 Investigation of Efficient Search Patterns As described in Section 5.2, the main novelty in the proposed algorithms is that we track along the first estimated field (motion or disparity) to obtain candidate vectors for the other field. The additional candidates provide improved prediction in cases where the motion/disparity vector of the current block is not similar to that of its neighboring blocks. Inmostpredictivemotionsearchalgorithmsformonoscopicvideo,themeanormedian of the candidates is used to initialize the search location. This approach relies on the assumption that the motion field tends to belocally smooth. However, the disparity field is not as homogeneous as the motion field and disparity vectors can be seen to exhibit significant variation even across neighboring blocks [30,56]. To see why this is true, considerthat indisparity estimation anarea withinanobject that iscloser tothecamera will have larger disparity than an area in the same object that is further away from the camera. However motion in the two areas is likely to be the same unless the object is rotating. Thus blocks that belong to the same moving object could have very different disparity even though they have similar motion vectors. This suggests that computing 114 the mean/median of a set of disparity vector candidates may not provide as good a predictor as applyingthesametechnique to a set of motion vectors candidates. To tackle this problem, we investigate a search strategy, such that multiple searches are performed aroundeach ofthecandidates, witheach ofthesearchesemployingamuchsmallersearch range than what would be typically used in combination a single search window centered at the mean of the candidates. Fig. 5.4 illustrates this search pattern. Note that this multiple-search method can be seen as an extension of the enhanced predictive zonal search (EPZS) [44] where we obtain additional candidate vectors via motion/disparity fields. Individual candidates Mean of the candidates o X o o o X Figure 5.4: The search pattern that uses all candidate vectors ForMtD,thereare9candidatestobeconsidered: DV 1 toDV 8 and ~ 0. Oneapproachis toorderthecandidates,basedontheirSAD,forexample,andonlyperformsearcharound the top priority candidates. Or we can search around all 9 candidates. Since some of the candidates might be the same and their search ranges may overlap, we currently adopt the second approach so that a search is performed for each of the candidates. To verify the efficiency of the new search pattern, we compared the following two scenarios in the MtD scheme: • EachC: Search with 3×3 windows (±1×±1) centered at each DV candidate 115 • MeanC: Search with a 9×9 window (±4 × ±4) centered at the mean of the DV candidates Note that the maximum number of vectors to search forEachC is 81, whileMeanC always has to check exactly 81 vectors. 800 1000 1200 1400 1600 1800 37.5 38 38.5 39 39.5 40 Ballroom V2, MtD Kb/sec PSNR MtD−EachC−3x3 MtD−MeanC−9x9 100 200 300 400 500 39.5 40 40.5 41 41.5 42 42.5 43 43.5 ST V1, MtD Kb/sec PSNR MtD−EachC−3x3 MtD−MeanC−9x9 Figure 5.5: Comparison of different search patterns. Fig. 5.5 provides the simulation results for two different sequences using H.264/AVC- based coding structure described in Fig. 5.2. The search pattern EachC achieves higher coding efficiency when we perform the predictive search on disparity estimation. The non-smooth disparity field is better predicted because our search pattern takes all the candidate vectors into account instead of adopting the mean of candidates as a single search center. In the following simulations, we adopt this multiple-search method for both MtD and DtM schemes. 5.4.2 H.264/AVC-based Simulations Our H.264/AVC-based MVC coding structure is implemented on top of the JM reference software [41]. A immediate benefit of MVC can be observed: the anchor frames are now 116 encodedusinginter-viewprediction,whichprovidesahighercodingefficiencyascompared to the simulcast case, where they were intra coded. The quality of these encoded anchor frames is crucial because they serve as the temporal references for the later frames. The work in [22] studied the bit allocation issue in dependent video coding scenarios such as those arising in MVC. Their results demonstrate that if more bits are spent on anchor frames, abetteroverall codingefficiencywillbeachieved, becausealltheframesfollowing the anchor can be encoded more efficiently. Here we simply use a smaller quantization step size (the QP parameter in H.264/AVC codec) to encode the anchor frames. For example, if the non-anchor frames are coded with QP = 28, then the anchor frames will be coded with QP = 26. Fig. 5.6 shows the effect of this QP change. Gains from this bit allocation areobserved onall thetest sequences andunderall threecodingschemes (dual full search, DtM, MtD). Inall thefollowing resultsprovided inthis section, weadopt this smaller QP for anchor frames. 100 150 200 250 300 350 32 32.5 33 33.5 34 34.5 35 35.5 36 Aqua V1 (Starting view is V2) Kb/sec PSNR Dual Full: AnchorQP−2 Dual Full: sameQP Simulcast Figure 5.6: The effect of changing QP for the anchor frames In Figures 5.7, 5.8, and 5.9, we present the rate-distortion curves (R-D curves) from our simulation results. The GOP structure in our simulations has 10 frames in time 117 direction and 3 to 4 views in spatial direction. The degradation of MtD and DtM, as compared to the dual full search scheme, comes from that one of the field is fast searched with the proposed algorithms. When using a 3×3 range (±1×±1) around each candidate, the average number of vectors tested for the fast searched field, is about 40 ∼ 65 for disparity and 30 ∼ 50 for motion. This is a very low cost for the predictive searched field, as compared to full search range of ±32 × ±32 = 4225 or ±64 × ±64 = 16641 used in our simulations (see Table 5.2). Note that during the search, here we did not use any early termination techniques such as the one proposed in EPZS [44]. Since in Section 5.4.1 we have demonstrated that the candidates obtained with our methods can bemore efficiently exploited with multiple search regions, it is reasonable to consider using them in the context of the EPZS scheme in which its early termination techniques can help further reduce the search complexity. The MtD approach achieves very good performance for all three test sequences, even with a small search window (3×3) around each candidate. The R-D curves are very close to the curves obtained with the dual full search coding scheme. The performance of DtM predictive search varies among different test sequences. The Aqua sequence has the most densecamera setting amongourthreetest sequences: 15 cameras withabout 1.8cm spacing. The correlation between motion fields in different views is very high. Fig. 5.7 Table 5.2: Parameters of the sequences and simulation settings for Section 5.4 Sequence Dimension Frame Rate No.Views GOP in simulation Full search range Aqua 320×240 10 fps 15 V2→V1→V0 ±32×±32 Ballroom 640×480 25 fps 8 V0→V1→V2→V3 ±64×±64 ST 640×480 15 fps 6 V0→V1→V2 ±64×±64 118 shows that the DtM approach provides almost the same coding efficiency as MtD. For the Ballroom sequence, the R-D performance of DtM exhibits some degradation (0.1 ∼ 0.2 dB) with respect to the dual full search scheme. In our simulations, this is the sequence with the highest gain in coding efficiency (up to 1.5dB) when comparing MVC with simulcast (Fig. 5.8). DtM preserves much of this gain with a low complexity for predictive motion search. The worst case of DtM appears to be on the ST sequence (Fig. 5.9). As discussed in Section 5.3, the inter-view illumination mismatch reduces the accuracy of the first estimated disparity field. With a 3×3 search range for MV candidates, the DtM’s R-D curves lie about halfway between the dual full scheme and simulcast. We provide one more set of simulation results as the search range is increased to 5×5. The corresponding R-D curves move to about 0.1 ∼ 0.2 dB below the dual full search scheme. Once again we see that the key to the performance of the proposed predictive al- gorithm is that the most reliable estimation should be performed first, so that the fast predictive search on the second field can make use of good candidate vectors. For MVC, sincethecameraviewsettingsvaryamongdifferentapplications,plusthefactthatframes from different views are proneto suffer from illumination and focus mismatches [17], it is likely that computing the motion field first will be in general more efficient, so that MtD should in general be chosen over DtM in MVC. 119 5.5 Conclusions Higher coding efficiency can be achieved in MVC by exploiting both temporal and inter- view redundancy. In this chapter we propose novel predictive fast search algorithms to reducethecomplexity forMVC. After oneofthemotion/disparity fieldsis estimated, the proposedalgorithmobtainsgoodcandidatevectorstoperformtheestimationontheother field with very low complexity. A more efficient search pattern employing the candidate vectors is also investigated, and the results conform with the finding in EPZS. The new candidatevectors canprovideadditional predictioninformationifthefirstestimated field is accurate. Sincemotion estimation generally providesbetter block correspondencethan disparity estimation, MtD generates very consistent coding efficiency among different test sequences, as compared to DtM. Simulation results with H.264/AVC-based MVC structureshow that MtD can achieve codingefficiency that isvery similar tothedualfull search scheme, while the complexity is reduced significantly. The simulations also verify that in general MtD should be chosen over DtM. 120 100 200 300 400 500 32 33 34 35 36 37 Aqua V1, MtD PSNR Dual Full MtD−Each3x3 Simulcast 100 200 300 400 500 32 33 34 35 36 37 Aqua V1, DtM PSNR Dual Full DtM−Each3x3 Simulcast 100 200 300 400 500 32 33 34 35 36 37 Aqua V0, MtD Kb/sec PSNR Dual Full MtD−Each3x3 Simulcast 100 200 300 400 500 32 33 34 35 36 37 Aqua V0, DtM Kb/sec PSNR Dual Full DtM−Each3x3 Simulcast Figure 5.7: Predictive search simulation results for Aqua sequence 121 500 1000 1500 2000 2500 35 35.5 36 36.5 37 37.5 38 38.5 39 39.5 Ballroom V1, MtD PSNR Dual Full MtD−Each3x3 Simulcast 500 1000 1500 2000 2500 35 35.5 36 36.5 37 37.5 38 38.5 39 39.5 Ballroom V1, DtM PSNR Dual Full DtM−Each3x3 Simulcast 500 1000 1500 2000 2500 35 35.5 36 36.5 37 37.5 38 38.5 39 39.5 40 Ballroom V2, MtD PSNR Dual Full MtD−Each3x3 Simulcast 500 1000 1500 2000 2500 35 35.5 36 36.5 37 37.5 38 38.5 39 39.5 40 Ballroom V2, DtM PSNR Dual Full DtM−Each3x3 Simulcast 500 1000 1500 2000 2500 35 35.5 36 36.5 37 37.5 38 38.5 39 39.5 Ballroom V3, MtD Kb/sec PSNR Dual Full MtD−Each3x3 Simulcast 500 1000 1500 2000 2500 35 35.5 36 36.5 37 37.5 38 38.5 39 39.5 Ballroom V3, DtM Kb/sec PSNR Dual Full DtM−Each3x3 Simulcast Figure 5.8: Predictive search simulation results for Ballroom sequence 122 150 200 250 300 350 400 450 500 550 39.5 40 40.5 41 41.5 42 42.5 43 43.5 ST V0, MtD PSNR Dual Full MtD−Each3x3 Simulcast 150 200 250 300 350 400 450 500 550 39.5 40 40.5 41 41.5 42 42.5 43 43.5 ST V1, DtM PSNR Dual Full DtM−Each5x5 DtM−Each3x3 Simulcast 100 200 300 400 500 600 39 39.5 40 40.5 41 41.5 42 42.5 ST V2, MtD Kb/sec PSNR Dual Full MtD−Each3x3 Simulcast 100 200 300 400 500 600 39 39.5 40 40.5 41 41.5 42 42.5 ST V2, DtM Kb/sec PSNR Dual Full DtM−Each5x5 DtM−Each3x3 Simulcast Figure 5.9: Predictive search simulation results for ST sequence 123 Chapter 6 Conclusions and Future Work Multiview video compression is a key technology for efficient storage and transmission of multiview video data. In this dissertation, we exploit special characteristics of multiview video to develop techniques that improve MVC efficiency with reduced complexity. We consider the problem of encoding video content exhibiting focus mismatch due to focus setting differences. We first use geometrical optics to analyze characteristics of images captured under the effect of focus. It is demonstrated that the focus mismatch canberepresentedintermsofthefocussettingparameters(camera-dependency) andthe depthsofobjects(depth-dependency). Thefocusmismatchkernelsarecircularsymmetric with their shapes varying across different depth. For 1D parallel camera arrangements in multiviewsystems,werelatethefocusmismatchtothedisparityexhibitedinframesfrom different views. The analytical results provide useful properties that can be exploited to design focus mismatch compensation techniques. Basedontheanalysis,tocompensatefordepth-dependentfocusmismatch,wepropose a novel adaptive reference filtering (ARF) approach. We estimate block-wise parameters as features for classification such that an image is first partitioned into regions suffering 124 from different types of focus mismatch: For inter-view coding in MVC, we exploit the disparity field to partition frames into regions corresponding to different depth levels. As formonoscopicvideo, withnodisparityinformation,weproposeamethodtoestimatethe localized focus changes and partition frames into regions consisting of macroblocks that suffer froma similar typeof focuschange. After frame-partitioning, for each region, a 2D filter is designed by minimizing prediction error. Filtered references are then generated for encoder to perform rate-distortion optimized coding selection. The ARF approach is also extended to MVC bi-directional inter-view coding, in which we propose filter design method that incorporate well with conventional bi-directional search. Simulation results demonstrate higher coding efficiency as compared to multiple-reference prediction and adaptive interpolation filtering methods. Finally, complexity reduction techniques for MVC are presented. We analyze the encoding results of ARF on inter-view prediction. It is observed that the coding gains demonstrate a strong view-wise variation, while at different timestamps the estimated filters exhibit strong correlation whenthe objects’ depths remain similar. Based on these findings and the analytical results, we propose i) view-wise ARF adaptation based on RD-cost prediction, which determines whether ARF is beneficial for a given view, and ii) filter updating based on depth-composition change, in which the same set of filters will be used (i.e., no new filters will be designed) until there is significant change in the depth- composition within the scene. We also propose fast predictive search algorithms, which exploit the relationship between motion and disparity fields, that can be used when one of the fields is available and we wish to estimate the other field efficiently. We construct a model and analytically demonstrate how illumination change will affect the accuracy 125 of our fast predictive search methods. Simulation results show that when applying these techniques, significant complexity reduction is achieved while the coding efficiency can be well-preserved. Based on the results in this thesis, there are some interesting extension of the work that can be addressed in future research: • Improving ARF based on camera parameters. In Section 2.4, welisted several prop- erties of focus mismatch based on camera setting parameters and object depth. However, due to the fact that most of these values are not available in the test sequences we have, our ARF approach only exploits qualitative properties such as isotropy and depth/disparity dependency of focus mismatch kernels. It will be in- teresting to further utilize quantitative properties if camera setting parameters are available. For example in monoscopic video with nodisparity information, knowing the setting parameters and depth range of the scene, we can derive focus mis- match kernels and construct a set of representative filters covering these kernels. To identify regions suffering from different type of mismatch, instead of estimating block-wise filters as in Section 3.3.1, we can compare the representative filters to determine which one provides best match and then design MMSE filter for each region. • Adaptive filtering for other non-translational mismatches. Thegeneralmethodology of ARF, i.e., partition frames into regions based on the properties of the targeted mismatch, and then estimate mismatch kernels, can be extended to other types of mismatch. For example, consider motion blur in sport sequences. Instead of being 126 isotropic, the direction of motion blur is expected to be highly correlated with mo- tion information. Thus an ARF approach can bedeveloped by separating an image into regions moving in different directions, and design directional filters. Another example is depth-dependent affine transformations in inter-view prediction due to convergent camera arrangement. Once again we can partition a frame into depth levels and for each level estimate affine parameters. Warped reference frames can be generated, after applying obtained affine parameters, to provide better coding efficiency. 127 Reference [1] T. Aach and A. Kaup. Disparity-based segmentation of stereoscopic fore- ground/background image sequences. IEEE Trans. Communications, 42, is- sue.2/3/4, Part.1:673–679, Feb-Apr. 1994. [2] R. V. Babu, K. R. Ramakrishnan, and S. H. Srinivasan. Video object segmenta- tion: A compressed domain approach. IEEE Trans. Circuits Systems and Video Technologies (CSVT), 14, no.4:462–474, Apr. 2004. [3] C. Bilen, A. Aksay, and G.B. Akar. A multi-view video codec based on H.264. In Proc. IEEE 2006 International Conference on Image Processing (ICIP), pages 541–544, Atlanta, GA, USA, Oct. 2006. [4] C.A.Bouman. Cluster: AnunsupervisedalgorithmformodelingGaussianmixtures. http://cobweb.ecn.purdue.edu/bouman/software/cluster/, thisversionwasreleasedin Jul. 2005. [5] R. N. Bracewell. The Fourier Transform and Its Applications. McGRAW-HILL, 3rd edition, 2000. [6] M. Budagavi. Video compression using blur compensation. In Proc. 2005 IEEE International Conference on Image Processing (ICIP), pages II.882–II.885, Genoa, Italy, Sep. 2005. [7] J. Chalidabhongse and C.-C.J Kuo. Fast motion vector estimation using multiresolution-spatio-temporal correlations. IEEE Trans. on Circuits and Systems for Video Technology (CSVT), 7, Issue 3:477–488, Jun 1997. [8] X.ChenandA.Luthra. MPEG-2multiview profileanditsapplication in3DTV. In Proc. SPIE 1997 Multimedia Hardware Architectures, volume 3021, pages 212–223, 1997. [9] T.Dekker, S.T.deZwart, O.H.Willemsen, M. G.H.Hiddink, andW.L.IJzerman. 2D/3D switchable displays. In Proc. SPIE Liquid Crystal Materials, Devices, and Applications XI, San Jose, CA, USA, Feb. 2006. [10] A.Dempster,N.Laird,andD.Rubin. Maximumlikelihoodfromincompletedatavia the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, no.1:1–38, 1977. 128 [11] L.-F. Ding, S.-Y. Chien, Y.-W. Huang, Y.-L. Chang, and L.-G. Chen. Stereo video coding system with hybrid coding based on joint prediction scheme. In Proc. IEEE 2005 International Symposium on Circuits and Systems (ISCAS), volume 6, pages 23–26, May 2005. [12] U. Fecker and A. Kaup. H.264/AVC-compatible coding of dynamic light fields using transposed picture ordering. In Proc. 13th European Signal Processing Conference (EUSIPCO 2005), pages II.882–II.885, Antalya, Turkey, 2005. [13] U. Fecker and A. Kaup. Statistical analysis of multi-reference block matching for dynamic light field coding. In Proc. 10th International Fall Workshop Vision, Mod- eling, and Visualization (VMV 2005), pages 445–452, Erlangen, Germany, 2005. [14] M. Flierl, T. Wiegand, and B. Girod. A locally optimal design algorithm for block- based multi-hypothesis motion-compensated prediction. In Proc. IEEE Data Com- pression Conference 1998 (DCC), pages 239–248, Mar 1998. [15] D. A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall, 2003. [16] E. Francois and B. Chupeau. Depth-based segmentation. IEEE Trans. Circuits and Systems for Video Technology (CSVT), 7:237–239, Jun. 1997. [17] Y.-S. Ho, K.-J. Oh, C. Lee, B. Choi, andJ.-H. Park. Observationsof multi-view test sequences. JVT Document W084, Apr 2007. [18] ISO/IEC-JTC1/SC29/WG11. Callforproposalsonmulti-viewvideocoding. MPEG Document N7327, Jul 2005. [19] ISO/IEC-JTC1/SC29/WG11. SubmissionsreceivedinCfPonmulti-view videocod- ing. MPEG Document M12969, Jan 2006. [20] E. Izquierdo. Disparity/segmentation analysis: Matching with an adaptive window and depth-driven segmentation. IEEE Trans. Circuits and Systems for Video Tech- nology (CSVT), 9, no.4:589–607, Jun. 1999. [21] M. L. Jamrozik and M. H. Hayes. A compressed domain video object segmentation system. In Proc. IEEE 2002 International Conference on Image Processing (ICIP), pages I.113–I.116, Rochester, NY, Sep. 2002. [22] J.-H. Kim, J. Garcia, and A. Ortega. Dependent bit allocation in multiview coding coding. In Proc. IEEE 2005 International Conference on Image Processing (ICIP), pages II.293–II.296, Genoa, Italy, Sep. 2005. [23] J.-H. Kim, P. Lai, J. Lopez, A. Ortega, Y. Su, P. Yin, and C. Gomila. New coding tools for illumination and focus mismatch compensation in multi-view video coding. IEEE Trans. Circuits Systems and Video Technologies (CSVT), 17, no. 11:1519– 1535, Nov 2007. 129 [24] P. Lai and A. Ortega. Predictive fast motion/disparity search for multiview video coding. In Proc. SPIE 2006 Visual Communications and Image Processing (VCIP), volume 6077, Jan 2006. [25] P. Lai, A. Ortega, P. Pandit, P. Yin, and C. Gomila. Adaptive reference filtering for bidirectional disparity compensation with focus mismatches. In Proc. IEEE 2008 International Conference on Image Processing (ICIP), pages 2456–2459, San Diego, CA, USA, Oct 2008. [26] P. Lai, A. Ortega, P. Pandit, P. Yin, and C. Gomila. Focusmismatches in multiview systemsandefficient adaptivereferencefilteringformultiview videocoding. InProc. SPIE 2008 Visual Communications and Image Processing (VCIP), Jan 2008. [27] P. Lai, Y. Su, P. Yin, C. Gomila, and A. Ortega. Adaptive filtering for cross-view prediction in multi-view video coding. In Proc. SPIE 2007 Visual Communications and Image Processing (VCIP), Jan 2007. [28] P. Lai, Y. Su, P. Yin, C. Gomila, and A. Ortega. Adaptive filtering for video coding withfocuschange. InProc. IEEE 2007 ICASSP,volumeI,pages661–664, Apr2007. [29] H.-C.Lee. Reviewofimage-blurmodelsinaphotographicsystemusingtheprinciples of optics. SPIE Optical engineering, 20, issue. 5:405–421, May 1990. [30] G. Li and Y. He. A novel multi-view video coding scheme based on H.264. In Proc. IEEEJointConference ofthe 4th International Conference onInformation, Commu- nications and Signal Processing, and the 4th Pacific Rim Conference on Multimedia, volume 1, pages 493–497, Dec. 2003. [31] J.Lopez,J-H.Kim,A.Ortega,andG.Ghen. Block-basedilluminationcompensation and search techniques for multiview video coding. In Proc. 2004 Picture Coding Symposium (PCS), San Francisco, CA, USA, Dec. 2004. [32] E.Martinian,A.Behrens,J.Xin,A.Vetro,andH.-F.Sun. ExtensionsofH.264/AVC for multiview video compression. In Proc. IEEE 2006 International Conference on Image Processing (ICIP), pages 2981–2984, Atlanta, GA, USA, Oct. 2006. [33] W. Matusik and H. Pfister. 3D TV: a scalable system for real-time acquisition, transmission, and autostereoscopic display of dynamic scenes. In Proc. ACM 2004 International Conference on Computer Graphics and Interactive Techniques archive (SIGGRAPH), pages 814–824, Los Angeles, CA, USA, Aug. 2004. [34] P. Merkle, K. Muller, A. Smolic, and T. Wiegand. Efficient compression of multi- view video exploiting inter-view dependencies based on H.264/MPEG4-AVC. In Proc. IEEE 2006 International Conference on Multimedia and Expo (ICME), pages 1717–1720, Toronto, Canada, Jul. 2006. [35] R. P. Millane and J. L. Eads. Polynomial approximations to Bessel functions. IEEE Trans. Antennas and Propagation, 51, no.6:1398–1400, Jun 2003. 130 [36] P. Mouroulis and J. Macdonald. Geometrical Optics and Optical Design. Oxford Series in Optical and Imaging Sciences, 1996. [37] S. Oka, T. Fujii, and M. Tanimoto. Dynamic ray-space coding using inter-view pre- diction. In Proc. International Workshop on Advanced Image Technology (IWAIT) 2005, pages 19–24, Jeju Island, Korea, Jan 2005. [38] E. Redner and H. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26, no.2, Apr. 1984. [39] J. Rissanen. A universal prior for integers and estimation by Minimum Descrip- tion Length. Institute of Mathematical Statistics Journal: Annals of Statistics, 11, no.2:417–431, 1983. [40] A. Smolic, K. Muller, P. Merkle, C. Fehn, P. Kauff, P. Eisert, and T. Wiegand. 3D videoand freeviewpoint video- technologies, applications andMPEG standards. In Proc. IEEE 2006 International Conference on Multimedia and Expo (ICME), pages 2161–2164, Toronto, Canada, Jul. 2006. [41] K. Suhring. Software implementation of H.264: JM Version 9.6. http://iphome.hhi.de/suehring/tml/index.htm, this version was released in Jul. 2005. [42] K. Suhring. Software implementation of H.264: JM Version 10.2. http://iphome.hhi.de/suehring/tml/index.htm, this version was released in Jul. 2006. [43] G.J. Sullivan and T. Wiegand. Rate-distortion optimization for video compression. IEEE Signal Processing Magazine, 15, Issue 6:74–90, Nov 1998. [44] A. M. Tourapis, O. C. Au, and M. L. Liou. Highly efficient predictive zonal al- gorithms for fast block-matching motion estimation. IEEE Trans. on Circuits and Systems for Video Technology (CSVT), 12, Issue 10:934–947, Oct 2002. [45] Y. Vatis. Software implementation of adaptive interpolation filter. http://iphome.hhi.de/suehring/tml/download/KTA/, this software was released in Nov. 2005. [46] Y. Vatis, B. Edler, D. T. Nguyen, and J. Ostermann. Motion-and aliasing- compensated prediction using a two-dimensional non-separable adaptive wiener in- terpolationfilter. InProc. IEEE 2005 International Conference on Image Processing (ICIP), pages II.894–II.897, Genoa, Italy, Sep. 2005. [47] Y. Vatis, B. Edler, I. Wassermann, D. T. Nguyen, and J. Ostermann. Coding of coefficients of two-dimensional non-separable adaptive Wiener interpolation filter. In Proc. SPIE 2005 Visual Communication and Image Processing (VCIP), volume 5960, pages 623–631, San Jose, CA, Jul. 2005. 131 [48] Y. Vatis and J. Ostermann. Comparison of complexity between two-dimensional non-separabl adaptive interpolation filter and standard Wiener filter. ITU-T SGI 6/Q.6 Doc. MPEG05/VCEG-AA11, Apr 2005. [49] Y. Vatis and J. Ostermann. Prediction of P- and B-frames using a two-dimensional non-separable adaptive Weiner interpolation filter for H.264/AVC. ISO/IEC- JTC1/SC29/WG11 MPEG Document M13313, Apr 2006. [50] Z.Wang, G.Liu,andL.Liu. Afastandaccuratevideoobjectdetectionandsegmen- tation method in the compressed domain. In Proc. IEEE International Conference on Neural Networks and Signal Processing, pages II.1209–II.1212, Dec. 2003. [51] T. Wedi. Adaptive interpolation filter for motion compensated prediction. In Proc. IEEE 2002 International Conference on Image Processing (ICIP), pages II.509– II.512, Rochester, NY, Sep. 2002. [52] T. Wedi. Adaptive interpolation filters and high-resolution displacements for video coding. IEEE Trans. Circuits Systems and Video Technologies (CSVT), 16, no.4:484–491, Apr. 2006. [53] T. Wiegand and B. Girod. Lagrange multiplier selection in hybrid video coder control. In Proc. IEEE 2001 International Conference on Image Processing (ICIP), pages III.542–545, Thessaloniki, Greece, Oct 2001. [54] T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H.264/AVC video coding standard. IEEE Trans. Circuits Systems and Video Tech- nologies (CSVT), 13 no.7:560–576, Jul. 2003. [55] K. Y. Wong and M. E. Spetsakis. Motion segmentation by EM clustering of good features. In Proc. IEEE 2004 Computer Vision and Pattern Recognition Workshop (CVPR), pages 166–173, Jun. 2004. [56] W. Yang, K. N. Ngan, and J. Cai. MPEG-4 based stereoscopic and multiview video coding. In Proc. IEEE 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, pages 61–64, Hong Kong, China, Oct. 2004. [57] W. Yang, K. N. Ngan, J. Lim, and K. Sohn. Joint motion and disparity fields estimation for stereoscopic video sequences. Elsevier Journal of Signal Processing: Image Communication, 20, Issue.3:265–276, Mar. 2005. [58] D. Zhang and G. Lu. Segmentation of moving objects in image sequence: A review. Springer Journal of Circuits, Systems and Signal Processing, 20, no.2:143–183, Mar. 2001. [59] Z. Zhu, G. Jiang, M. Yu, and X. Wu. Fast disparity estimation algorithm for stereo- scopic image sequence coding. In Proc. 2002 IEEE Region 10 Conference on Com- puters, Communications, Control and Power Engineering (TENCON), pages I.285– I.288, Beijing, China, Oct. 2002. 132
Abstract (if available)
Abstract
Multiview video systems utilize multiple cameras to simultaneously capture the scene from different viewpoints. They provide video data for new applications such as 3D television and free-viewpoint video. The amount of data in multiview video is very large as compared to monoscopic video. Multiview video coding (MVC) is an emerging research field that focuses on compression of multiview video data. In this dissertation, by exploiting special characteristics of multiview video, we develop techniques that improve MVC efficiency while also taking complexity into account.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced techniques for high fidelity video coding
PDF
Predictive coding tools in multi-view video compression
PDF
Complexity scalable and robust motion estimation for video compression
PDF
Advanced intra prediction techniques for image and video coding
PDF
Efficient coding techniques for high definition video
PDF
Distributed source coding for image and video applications
PDF
Texture processing for image/video coding and super-resolution applications
PDF
Low complexity mosaicking and up-sampling techniques for high resolution video display
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Algorithms for scalable and network-adaptive video coding and transmission
PDF
Robust video transmission in erasure networks with network coding
PDF
Rate control techniques for H.264/AVC video with enhanced rate-distortion modeling
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Techniques for compressed visual data quality assessment and advanced video coding
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Experimental design and evaluation methodology for human-centric visual quality assessment
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
Asset Metadata
Creator
Lai, PoLin
(author)
Core Title
Focus mismatch compensation and complexity reduction techniques for multiview video coding
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
05/04/2009
Defense Date
10/07/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptive filtering,disparity compensation,fast search,focus mismatch,H.264,multiview video coding,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ortega, Antonio (
committee chair
), Kuo, C.-C. Jay (
committee member
), Neumann, Ulrich (
committee member
)
Creator Email
po-lin.lai@thomson.net,wormwormlai@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2171
Unique identifier
UC1425786
Identifier
etd-Lai-2485 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-225692 (legacy record id),usctheses-m2171 (legacy record id)
Legacy Identifier
etd-Lai-2485.pdf
Dmrecord
225692
Document Type
Dissertation
Rights
Lai, PoLin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
adaptive filtering
disparity compensation
fast search
focus mismatch
H.264
multiview video coding