Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Predictive coding tools in multi-view video compression
(USC Thesis Other)
Predictive coding tools in multi-view video compression
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PREDICTIVE CODING TOOLS IN MULTI-VIEW VIDEO COMPRESSION by Jae Hoon Kim A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Ful¯llment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2008 Copyright 2008 Jae Hoon Kim Dedication To my family ii Acknowledgements First of all, I would like to thank Prof. Antonio Ortega for his advice, guidance, support and patience throughout the years I have been pursuing my Ph.D degree at the University of Southern California. I am also grateful to Prof. C.-C. Jay Kuo and Prof. Ulrich Neumann for their advices and comments in my defense. I wouldliketoextendmygratitudetoProf.AlexanderA.Sawchuk,Prof.Ramakant Nevatia and Prof. Karen Liu for serving on my Quali¯cation Exam Committee. I would like to thank Yeping Su, Peng Ying, Purvin Pandit, Dong Tian and Cristina Gomila for their advices and supports that I had during my invaluable internships with Thomson. I would like to thank all the colleagues in Compression group for their friend- ships and useful discussions over life and research. I would like to thank Joaquin Lopez and Po-Lin Lai for enjoyable discussions and collaborations. IthankallmyfriendsforthemomentsIsharedwiththem, whichgavemerest, passion and energy to overcome all di±culties in my life and research. I will not forget the moments and co®ees that I had with Wonseok Baek, Hyukjune Chung and In Suk Chong. And my special thanks to Young Gyun Koh for his lifetime friendship. My family, I can not express my thank enough to them. Without their belief, support, endurance and love, this work could not have been even started, let alone ¯nished. And Jung Yeun, my wife, thank you and I love you. iii Table of Contents Dedication: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ii Acknowledgements : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vi List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii Abstract : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xi Chapter 1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Multi-view Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Applications of Multi-view Video System . . . . . . . . . . . . . . . 3 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . 5 1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2 Dependent Bit Allocation in Multi-view Video Coding : : : : : 10 2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 2-D Dependent Bit Allocation . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Reduced Search Range . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 Search Algorithm for Non-anchor Frames . . . . . . . . . . . 22 2.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3 Illumination Compensation In Multi-View Video Coding : : : : 26 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Illumination Compensation Model . . . . . . . . . . . . . . . . . . . 31 3.4 Illumination Mismatch Parameter Coding . . . . . . . . . . . . . . 35 3.5 Complexity of IC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6.1 Combined Solution with ARF . . . . . . . . . . . . . . . . . 47 iv 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 4 Implicit Block Segmentation : : : : : : : : : : : : : : : : : : : : 53 4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Implicit Block Segmentation . . . . . . . . . . . . . . . . . . . . . . 56 4.2.1 Motivation from Block Motion Compensation . . . . . . . . 56 4.2.2 Block Based Segmentation . . . . . . . . . . . . . . . . . . . 58 4.2.3 Weighted Sum of Predictors . . . . . . . . . . . . . . . . . . 60 4.2.4 Joint Search of Base and Enhancement Predictors . . . . . . 62 4.2.5 Three Error Metrics in Joint Search . . . . . . . . . . . . . . 64 4.2.6 IBS algorithm in H.264/AVC . . . . . . . . . . . . . . . . . 66 4.3 Complexity of IBS . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.1 Implementation within an H.264/AVC Architecture . . . . . 70 4.4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 75 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 5 Conclusions and Future Work : : : : : : : : : : : : : : : : : : : 79 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82 Appendices : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87 Appendix A Comparison between MSD and MAD in Motion/Disparity Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Appendix B Additional Weight Selection in IBS. . . . . . . . . . . . . . 91 Appendix C Comparison between MSD and MAD for Weight Selection . 95 v List of Tables 3.1 Unary binarization and assigned probability for index of quantized di®erential o®set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 InitializationofthecontextforICactivationbitwithdi®erentmost probable symbol (MPS) . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Number of addition/subtraction for SAD and SADAC. N is the number of pixels in a macroblock and S is the number of search points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Complexity for SAD Calculation in Di®erent Block Modes. N is the number of pixels in a macroblock and S is the number of search points. Note that in Fast IC mode, ¹ ¹ p i in 8£8 block is saved to be used in the larger block mode so that 4NS complexity is required in 8£8 block mode and 3NS in 8£16, 16£8, 16£16 block modes. 39 3.5 ComplexitywhenSADandSADACarecalculatedatthesametime. N is the number of pixels in a macroblock and S is the number of search points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Percentage of non intra selection in cross-view prediction (% in H.264 ! % in H.264+IC). Note that more signi¯cant PSNR in- creases can be observed for those sequences where the increase in number of inter code blocks is greater. . . . . . . . . . . . . . . . . 44 3.7 Temporal partitioning of test data sets . . . . . . . . . . . . . . . . 47 4.1 De¯nition of symbols in complexity analysis. The integersin paren- thesis besides N x are the values used in the simulation. . . . . . . . 68 4.2 Complexity analysis of K-means clustering . . . . . . . . . . . . . . 69 4.3 Complexity analysis of weight index decision for each ¹ p 0 and ¹ p 1 pair 69 vi 4.4 Comparison of IBS and GEO complexity. M is the number of base predictor candidates. . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5 Percentage of times that di®erent motion vector predictors (mvp) are selected for enhancement predictor in the current macroblock. Data is collected by encoding 15 frames of Foreman sequence with QP 24 (IPPP). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.6 Comparison of signaling bits for motion vector of enhancement pre- dictor. Dataiscollectedbyencoding15framesofForemansequence (IPPP). In (A! B), A is the average number of signaling bits for motion vector (mv) when the mvp of the enhancement predictor is set to the mvp of INTER16£16. B is the average number of signaling bits for mv when the mvp of the enhancement predictor is chosen from 6 mvp schemes. . . . . . . . . . . . . . . . . . . . . . . 72 4.7 Comparison of IBS results when the mvp of the enhancement pre- dictor is set to (a) the mvp of INTER16£16 QT block mode and chosen from (b) 6 mvp schemes (upper bound). . . . . . . . . . . . 73 4.8 ComparisonofdatabyQTandIBSfromMERL BallroomandFore- man with QP 20. Data is averaged for the macroblocks where IBS is the best mode from 14 P-frames in each sequence. A!B means `data by QT' ! `data by IBS'. SSD p and SSD r are SSD between the original and predictor and between the original and reconstruc- tion, respectively. Bit res , Bit mv and Bit w are bits for residual, mo- tion/disparity vectors and weight indices respectively. . . . . . . . . 77 B.1 The probability in (B.10) is calculated changing three parameters, (i) m, (ii) · 2 0 · 2 1 , and (iii) ® 0 . The average of probabilities for m = f10;50;100;150;200;250g is shown with respect to di®erent · 2 0 · 2 1 and ® 0 . The last row shows the average over · 2 0 · 2 1 . . . . . . . . . . . . . . 94 C.1 Sub-optimality of MAD with respect to MSD . . . . . . . . . . . . 98 vii List of Figures 1.1 End to end multi-view system . . . . . . . . . . . . . . . . . . . . . 2 1.2 Multi-view video coding structure . . . . . . . . . . . . . . . . . . . 6 2.1 Diagram for Multiview Video Coding. N is the number of views and M is the anchor frame interval. Anchor is encoded only by cross-view prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 MVC examples where the number of views is 4. The number in parenthesis is the order of encoding in the trellis expansion.. . . . . 11 2.3 Trellis expansion in anchor frame. The thick line shows one of an- chor frame quantizer allocations.. . . . . . . . . . . . . . . . . . . . 18 2.4 Trellis expansion in View 1. For each anchor frame quantizer q a non-anchor frame quantizer ¹ q with minimum cost can be chosen (thick line) and the total cost for each quantizer allocation can be calculated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Relationship between q(QP a ) and ¹ q(QP na ) for Aquarium sequence.. 19 2.6 Example of R-D curves for reduced search range . . . . . . . . . . . 21 2.7 Aqua sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.8 SC vs. MVC with ¯xed QP vs. MVC by proposed Algorithm 1. Averagenumberofbitsfor21framesisusedandPSNRiscalculated as10log 10 (255 2 =(averageMSE for 21frames)). ¸is200,500,and 900 with C1 and 300, 700, and 2500 with C2 in proposed algorithm. 24 3.1 Camera arrangement that causes local mismatches . . . . . . . . . . 27 3.2 Illumination mismatches in ST sequence . . . . . . . . . . . . . . . 28 3.3 Modi¯ed Search Loop for the current block . . . . . . . . . . . . . . 33 viii 3.4 Context of current block c=a+b, where a;b2f0;1g. . . . . . . . 37 3.5 MVC sequences: 1D/parallel. Sequences are captured by an array of cameras located horizontally with viewing directions in parallel. . 41 3.6 Cross-view coding with IC, at time stamps 0, 10, 20, 30, 40 . . . . . 43 3.7 Example of multiple references from di®erent time stamps and views 45 3.8 Prediction structure for multi-view video coding with 8 views and GOP length 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.9 Multi-view coding with IBPBPBPP cross-view, hierarchical B tem- poral [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.10 Comparison of IC with WP in MVC with IBPBPBPP cross-view, hierarchical B temporal [2] . . . . . . . . . . . . . . . . . . . . . . . 49 3.11 Cross-viewcodingwithH.264/AVC,IC,ARFandARF+ICattime stamps 0, 10, 20, 30, 40 . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1 InterblockmodesinH.264/AVC.Each8£8sub-blockin4.1(a)can be split into di®erent sub-block sizes as in 4.1(b). . . . . . . . . . . 54 4.2 A straight line in GEO is de¯ned by slope s and distance d from center in 16£16 macroblock. . . . . . . . . . . . . . . . . . . . . . 55 4.3 Example of block motion compensation. The best match of current macroblock ¹ x can be found in two locations for di®erent objects. However, in region b of matches by QT and GEO, signi¯cant pre- diction error exists. . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4 De¯nition of predictor di®erence ¹ p d . Pattern of predictors are from Fig. 4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5 Example of two step post-processing after 1-D K-means clustering. First, disconnected segment 2 is classi¯ed as di®erent segment in- creasing the number of segment N from 3 to 4. Second, segment 4 is merged into segment 1 decreasing N to 3 again. . . . . . . . . . . 61 4.6 Aftersegmentationoftheoriginalmacroblockfrom MERL ballroom sequence, the best matches for the segments are added to the set of base predictor candidates. . . . . . . . . . . . . . . . . . . . . . . . 63 4.7 Search loop of enhancement predictor for given base predictor . . . 64 ix 4.8 Example of predictor di®erence and segmentation from Foreman sequence. The segment indices are shown, which are decided by raster scanning from the top left corner to bottom right corner of the macroblock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.9 MERL Ballroom with 1 and 3 reference . . . . . . . . . . . . . . . . 75 4.10 Foreman with 1 and 3 references . . . . . . . . . . . . . . . . . . . . 76 4.11 AC increases by 4x4 or 8x8 block DCT due to the unequal DC residual between di®erent segments in IBS. . . . . . . . . . . . . . . 77 A.1 Comparison of normal and Laplace distribution with real data ob- tainedbyencodingForeman sequences(CIF).Dataiscollectedfrom 7 P frames coding using QP 20,§32 search range and quarter-pixel precision by JSVM 8.4. The di®erences between original and pre- dictordataareobtainedonlyforluminance. Meanandvarianceare -0.25 and 10.66 respectively. . . . . . . . . . . . . . . . . . . . . . . 89 C.1 Distribution of ¾ 2 0 ¾ 2 1 from IBS coding results of Foreman with QP 24 . 100 x Abstract Multi-viewvideosequencesconsistofasetofmonoscopicvideosequencescaptured at the same time by cameras at di®erent locations and angles. These sequences contain 3-D information that can be used to deliver new 3-D multimedia services. Dueto theamountofdata, itisimportanttoe±cientlycompressthesemulti-view sequences to deliver more accurate 3-D information. Sincethecapturedframesbyadjacentcamerashavesimilarcontents,cross-view redundancy can be exploited for disparity compensation. Typically both temporal and cross-view correlations are exploited in multi-view video coding (MVC), so that a frame can use as a reference the previous frame in time in the same view and/or a frame at the same time from an adjacent view, thus leading to a 2-D dependency problem. The disparity of an object depends primarily on its depth in the scene, which can lead to lack of smoothness in the disparity ¯eld. These complexdisparity¯eldsarefurthercorruptedbythebrightnessvariationsbetween views captured by di®erent cameras. We propose several solutions to solve these problems in block based predictive coding in MVC. Firstly, the 2-D dependency problem is addressed in Chapter 2. We use the monotonicity property and the correlation between anchor and non-anchor quan- tizers to reduce the complexity in data collection of an optimization based on the Viterbi algorithm. The proposed bit allocation achieves 0.5 dB coding gains as compared to MVC with ¯xed QP. xi In Chapter 3, we propose an illumination compensation (IC) model to com- pensate local illumination mismatches. With about 64% additional complexity for IC, 0.3-0.8 dB gains are achieved in cross-view prediction. IC techniques are extended to compensate illumination mismatches both in temporal and cross-view prediction. In Chapter 4, we seek to enable compensation based on arbitrarily-shaped regions, whilepreservinganessentiallyblock-basedcompensationarchitecture. To do so, we propose tools for implicit block-segmentation and predictor selection. Given two candidate block predictors, segmentation is applied to the di®erence of predictors. Then a weighted sum of predictors in each segment is selected for prediction. Simulation results show 0.1-0.4 dB gains as compared to the standard quad tree approach in H.264/AVC. xii Chapter 1 Introduction 1.1 Multi-view Video Since early 20th century when the ¯rst generation of television came into being, many novel technologies have been introduced (e.g., color, new types of displays, etc.). However, the main framework of video service has not been changed signif- icantly in that the frames captured by the camera are edited to generate a single sequence that is delivered and displayed on a 2-dimensional (2-D) screen. With this monoscopic video system, 2-D scenes are regenerated and shown to users for the ¯xed viewpoint provided by the camera at each time instant. The visual information of an object can be de¯ned by the intensity and the location. The object intensities are represented by three color channels - R, G and B. The locations of the objects are de¯ned by 3-D information - horizontal and vertical location and depth. In the captured 2-D frame where 3-D location of an object is projected, only horizontal and vertical information is delivered to users with three color channels, while depth information is not delivered explicitly. Intherealworld,thedepthofanobjectcanbeestimatedbyvariousdepthcues. Forexamplewhenmovinghead,objectsthatarecloserwillmovefartheracrossthe 1 Scene/ Objects Camera 0 : View 0 Camera 1 : View 1 Camera N : View N Processor/ Storage Transmitter Processor/ Buffer Receiver Display Fig. 1.1: End to end multi-view system ¯eld of view and di®erent scenes will be observed according to the displacement of head. This is called motion parallax, a monocular depth cue. Also the occlusions or the exposed areas can give information about which objects are closer. When both eyes are open and the head is not moving, each eye will see di®erent images of the same scene. Stereo images from this binocular parallax are used to measure depth. Binocular parallax is the most important cue and depth information can be obtained even if all other depth cues are removed. In the conventional monoscopic video system, only limited cues are available for depth estimation (e.g., from the occluded or disclosed regions of scenes). How- ever, in order to deliver complete information to users and enable 3-D multimedia services,depthinformationneedstobetransmittedorestimatedaccurately. Multi- viewvideosystemsareusedforsimultaneouslycapturingthescenesorobjectswith multiplecamerasfromdi®erentviewpoints. Inmulti-viewvideo,di®erentperspec- tives by cameras for the same scenes or objects provide binocular parallax, from which depth information can be extracted, thus enabling 3-D multimedia services. An end to end multi-view video system is depicted in Fig. 1.1. Multi-view sequences captured by an array of cameras are stored, processed and transmitted. 2 Received sequences are processed and displayed on 2-D or 3-D display devices. After sequences are reconstructed at the receiver, intermediate views can be inter- polated to provide smooth transition and improved quality of display. Sequences can be displayed on conventional 2-D displays with view switching capability [16], orspeciallydesigned3-Ddisplaydevicescanbeused[28]forbetter3-Dperception. In the sequence acquisition step, the number of views decides the range of 3-D scene and the quality of service e.g., the more cameras are used, the more accurate depth information will be, thus enabling improvements in the quality of interpolated views. However, the amount of data for the captured sequences also increases proportionally to the number of views. For example, transmitting uncompressed multi-view sequences with 8 views, 1280£720 resolution and 24 bits per pixel at 30 frames/sec requires 5.3 Gbps. Because of the increased amount of data in multi-view system, e±cient coding of multi-view sequences is essential for the widespread use of services. 1.2 Applications of Multi-view Video System There have been active research e®orts on applications of multi-view video, espe- cially for 3-D TV and free-viewpoint video. In [28], a 3-D TV prototype system is proposed which uses an array of 16 cameras, clusters of network-connected PCs andamulti-projector3-Ddisplay. Twotypesofdisplay,rear-projectionscreenand front-projectionscreenareimplementedaccordingtothelocationofprojectors. Al- though blur is prominent on both types of display due to the crosstalk between subpixelsofdi®erentprojectorsandlightdi®usion, thedisplayre°ectsuser'sview- pointandshowsdi®erentimages. In[10],avideo-plus-depthdatarepresentationis proposed as a °exible solution to diverse 3-D display technologies. A depth map is 3 created from frames captured using stereo cameras or multiple monocular cameras and streams including N video sequences and a depth sequence are used to render M views. Depth image based rendering (DIBR) is proposed as a solution to 3-D reproduction. For free-view point video, in [20] view generation methods are explained using ray-spaceapproach[11]. Forcodingofmulti-viewvideosequences,agroupofGOP (GoGOP) structure is proposed in order to enable low delay random access, which is an extension of the group of picture (GOP) structure in standard video cod- ing. In [16], a color segmentation-based stereo algorithm is used to generate high quality photo-consistent correspondences across views captured by high-resolution cameras. Scene depth is recovered by disparities and matting information is used at object boundaries to compensate depth discontinuities. A real time rendering system is described, which interactively synthesizes intermediate views. In[31],panoramicvideocapturing,stitchinganddisplayisproposedtoprovide users individual control of viewing direction. A camera array is used to capture sequences and captured scenes are stitched, then displayed on head-mounted dis- play with orientation tracker so that di®erent scene can be displayed according to user's orientation. This system assumes that the user has ¯xed location but his/her viewing direction can rotate so that scenes around user can be viewed in 360 degree. This approach is di®erent to multi-view system in that the viewing direction of user rotates at ¯xed location. With the recognition that a multi-view video coding is a key technology for a widevarietyofapplicationsincluding3-DTV,freeviewpointTVandsurveillance, various topics related to multi-view video are covered in [12]. In [39], an overview of 3-D TV and free viewpointvideo is givenwith related standardization activities in MPEG. 4 1.3 Contributions of the Research Design of multi-view video system involves multiple disciplines such as video cod- ing,optics,computervision,computergraphics,stereoscopicdisplays,multi-projector displays, virtual reality and psychology, in order to enable services that can bridge the gap between 2-D and full 3-D experience e.g., holographic display. In this work, we focus on providing e±cient compression methods for multi-view video coding. A straightforward approach for compression of multi-view video sequences would be to apply standard video coding techniques to each view independently. This simulcast (SC) approach allows temporal redundancy to be exploited using block-based motion compensation techniques as shown in Fig. 1.2(a). Since the captured frames by adjacent cameras have objects in common, cross-view redun- dancy could also be exploited in the form of disparity compensation as shown in Fig. 1.2(b). To achieve high coding gains, multi-view video coding exploits both temporal and cross-view redundancies. In Fig. 1.2(c), a multi-view video coding structure using both temporal and cross-view correlation is depicted. To facilitate random access, anchor frames are inserted at prede¯ned time intervals. These anchor frames are encoded using only cross-view prediction. Inblockbasedpredictivecoding,theblockmostcorrelatedtothecurrentblock is searched for in the previously encoded frame. Therefore the gains in coding e±ciency mainly come from ¯nding highly correlated blocks leading to residuals with high degrees of energy compaction. Applying a block based predictive coding to both temporal and cross-view prediction in MVC, the following problems are observed. 5 TIME VIEW Simulcast Coding - SC (a) Simulcast coding TIME VIEW Cross- view Coding (b) Cross-view coding Non- anchor TIME VIEW anchor (c) An example of temporal and cross- view prediction in MVC Fig. 1.2: Multi-view video coding structure 6 To avoid drift between encoder and decoder, the same frame should be used for prediction at encoder and reconstruction at decoder. Therefore reconstructed frames are used for the prediction of current frames, so that encoding of current frame depends on the quality of reconstructed frames. This dependency problem was addressed in [34] in monoscopic video coding. This one dimensional (time) dependency problem is expanded into two dimensional (time and view) problem in multi-view systems. In cross-view prediction of multi-view video, imperfectly calibrated cameras give di®erent brightness and focus to sequences at di®erent views. Even if cam- eras are perfectly calibrated, di®erences in camera positions and orientations lead to di®erences in how certain objects appear in di®erent views. The accuracy of disparity search is corrupted by these mismatches between views, which can lead to irregular disparity ¯elds and degrade cross-view coding e±ciency. In block based motion/disparity compensation, block sizes used for compen- sation can be chosen to achieve a good trade-o® between signaling overhead and prediction accuracy. However current quad-tree based motion compensation leads tomotionboundariesthatarenotnecessarilyalignedwitharbitraryobjectbound- aries, which limits the accuracy of block-based compensation, even when small block sizes are chosen. Therefore, moving objects in motion compensation and ob- jects at di®erent depths in disparity compensation result in signi¯cant distortion in places where the object boundary is not aligned with the rectangular grid that can be represented by quad-tree. The main contribution of this research is to provide new predictive coding tools to solve the problems described above and improve overall coding e±ciency in multi-view video coding. In the proposed bit allocation scheme, we use the 7 monotonicity property and the correlation between anchor and non-anchor quan- tizers to reduce the complexity in data collection of the Viterbi algorithm, which was proposed in [34] for solving a dependency problem in standard video coding. To improve the accuracy of disparity search under brightness variation between views captured by di®erent cameras, a local illumination compensation technique is proposed. Implicit block segmentation algorithm is proposed to ¯nd a match corresponding to arbitrary object boundaries while preserving a block based com- pensation architecture. 1.4 Organization of Dissertation The rest of the dissertation is organized as follows. In Chapter 2, we consider the bit allocation problem in MVC. A dependent coding technique using trellis expansion and the Viterbi algorithm (VA) is proposed, which takes into account dependencies across time and views. We note that, typically, optimal quantizer choices have the following properties: i) quantization choices tend to be similar for frames that are consecutive (in time or in view), ii) better quantization tends to be used for frames closer to the root of the dependency tree. We propose a search algorithm to speed up the optimization of quantization choices. Our results indicate 0.5 dB coding gains can be achieved by an appropriate selection of bit allocation across frames. In Chapter 3, we propose a block-based illumination compensation (IC) tech- nique for cross-view prediction in MVC. Models for illumination (brightness) mis- matches across views are proposed and new coding tools are developed from the models. In IC, disparity ¯eld and illumination changes are jointly computed as part of the disparity estimation search. IC can be adaptively applied by taking 8 intoaccounttherate-distortioncharacteristicsofeachblock. Bycompensatingthe e®ect of mismatches, we improve the quality of references obtained via disparity search, which leads to coding gains of up to 0.8 dB. In Chapter 4, we propose an implicit block based segmentation method to im- prove quality by using multiple predictors for each block. Given two candidate block predictors, segmentation is applied to the di®erence of predictors and the optimal predictor is selected in each segment. Implicit block segmentation is im- plemented in H.264/AVC as an additional inter block mode and achieves 0.1-0.4 dB gains as compared to the results obtained with only a hierarchical quad-tree. In Chapter 5, conclusions and future work are discussed. 9 Chapter 2 Dependent Bit Allocation in Multi-view Video Coding 2.1 Preliminaries To achieve coding gains in multi-view video coding (MVC), both temporal and cross-view correlation can be exploited using block-based predictive coding. Any such block-based predictive coding technique leads to dependencies, as quantiza- tion choices for one frame a®ect the achievable rate-distortion points for those frames that depend on it [34]. In Fig. 2.1, an MVC coding structure is shown with temporal and view indeces. Note that di®erent types of coding dependen- cies arise depending on the coding scheme being used. In the simulcast case of Fig. 2.2(a), each view is coded independently, so only temporal dependency (1-D) within each view can be observed, similar to the monoscopic video case. Instead, Figs. 2.2(b) and 2.2(c) represent cases where the set of anchor frames are encoded in IPPP or IBBP modes. This introduces additional dependencies across views (2- D). For example, when encoding frame V2T2, reconstructed frame V2T1 is used as a reference, and in turn V2T1 uses frame V1T1 as a reference (see Fig. 2.1). 10 Non- anchor V2 T2 V1 T2 VN T2 V3 T2 V2 TM V1 TM VN TM V3 TM TIME VIEW V1 T1 VN T1 V3 T1 V2 T1 anchor Fig. 2.1: Diagram for Multiview Video Coding. N is the number of views and M is the anchor frame interval. Anchor is encoded only by cross-view prediction. 1 1 1 1 I(1) I(1) I(1) I(1) B B P B B P B B P B B P (2) (2) (2) (2) (a) Simulcast only 1 1 1 1 (2) (4) (6) (8) P(3) I(1) P(5) P(7) B B P B B P B B P B B P (b) IPPP in anchor 1 1 1 1 (2) (6) (8) (4) B(5) I(1) B(7) P(3) B B P B B P B B P B B P (c) IBBP in anchor Fig. 2.2: MVC examples where the number of views is 4. The number in parenthesis is the order of encoding in the trellis expansion. 11 While the problem of dependent bit allocation has been considered in several contexts, including standard video [24,34,36,41] and stereo image coding [45], its potential impact in multiview video coding has not been considered yet. In this chapter, we extend previously proposed frame-wise dependent bit allocation techniques[34](usingatrellisrepresentationandtheViterbialgorithm)toamulti- view video coding scenario where cross-view prediction is used. This leads to a complex 2-D dependency problem, where the total number of video frames and candidatequantizationchoicesinvolvedcanbeverylarge. Moreover,asuboptimal choice of quantizer for a given frame may a®ect many other frames (if the frame in question is close to the root of the dependency tree). This suggests that a proper quantizer allocation may be more important in an MVC environment than for standard video. Indeed this was initially motivated by the observation that in an H.264/AVC encoder, which we modi¯ed for MVC, coding results were very sensitive to bit allocation [8]. In order to reduce the complexity of searching for the optimal solution in our MVCenvironment, wemakeuseofthemonotonicitypropertyobservedin[34]. To further reduce complexity, we show that the number of solutions to be searched can be reduced by considering only candidate solutions such that anchor and non- anchor frames are allocated similar quantizers. 2.2 2-D Dependent Bit Allocation Inwhatfollows,distortion(D)ismeasuredasframe-wisemeansquareerror(MSE). The quantization parameter, q, rate, R, distortion, D, and Lagrangian cost J of the anchor frame in View i, are represented as q i , R i , D i , and J i , respectively. We denote ¹ q i the quantization choice for the non-anchor frames in View i. ¹ R i , ¹ D i , 12 and ¹ J i will denote the total rate, distortion and Lagrangian cost for all non-anchor frames in View i. 1 In our notation q < q 0 means that quantizer q is ¯ner, i.e., better quality, than q 0 . A solution to the dependent bit allocation problem was proposed based on a trellis expansion and the VA in prior work [34]. Our problem, which includes dependency across views, can be seen as an extension of this 1-D problem. A constrained 2-D dependent coding problem can then be formulated as follows (for the 2-view case): min q 1 ;q 2 ;¹ q 1 ;¹ q 2 [D 1 (q 1 )+ ¹ D 1 (q 1 ;¹ q 1 )+D 2 (q 1 ;q 2 )+ ¹ D 2 (q 1 ;q 2 ;¹ q 2 )] such that R 1 (q 1 )+ ¹ R 1 (q 1 ;¹ q 1 )+R 2 (q 1 ;q 2 )+ ¹ R 2 (q 1 ;q 2 ;¹ q 2 )·R budget : (2.1) Note that because of the dependency on the previously coded frames, some of the RandD valuesincludemultipleq's. Forexample,becausenon-anchorframesrefer to the anchor frames as a reference, the values of ¹ R 1 and ¹ D 1 depend on q 1 and ¹ q 1 . Also the anchor frame in View 2 refers to the anchor frame in View 1, the values of R 2 and D 2 depend on q 1 and q 2 . This problem can be solved by considering an unconstrained problem with Lagrange multiplier ¸¸0 and cost J =D+¸R [37]: min q 1 ;q 2 ;¹ q 1 ;¹ q 2 [J 1 (q 1 )+ ¹ J 1 (q 1 ;¹ q 1 )+J 2 (q 1 ;q 2 )+ ¹ J 2 (q 1 ;q 2 ;¹ q 2 )]; (2.2) 1 First, we begin assuming the same quantizer is used for all non-anchor frames in a view. Thus, the quantizer selection for anchor and non-anchor is a sub-optimal solution for frame level bit allocation. In Section 2.2.3, a search algorithm for non-anchor frames is proposed. 13 where J 1 (q 1 )=D 1 (q 1 )+¸R 1 (q 1 ) (2.3a) ¹ J 1 (q 1 ;¹ q 1 )= ¹ D 1 (q 1 ;¹ q 1 )+¸ ¹ R 1 (q 1 ;¹ q 1 ) (2.3b) J 2 (q 1 ;q 2 )=D 2 (q 1 ;q 2 )+¸R 2 (q 1 ;q 2 ) (2.3c) ¹ J 2 (q 1 ;q 2 ;¹ q 2 )= ¹ D 2 (q 1 ;q 2 ;¹ q 2 )+¸ ¹ R 2 (q 1 ;q 2 ;¹ q 2 ) (2.3d) In a system with N views, assume that our bit allocation requires evaluating, on average, n a coding choices for each anchor frame, and n b for each set of non- anchor frames in a view. The main complexity in the bit allocation comes from encoding/decoding step to determine R-D values. Because non-anchor frames in a view are not further referred by frames in other views, the maximum dependency would be n b n N a to encode non-anchor frames in View N. Thus, the bit allocation complexity will be O(n b n N a ). We achieve a reduction in complexity based on two methods. First, as in [34], we exploit the monotonicity property of dependent coding to helps us reduce n a . Second, we choose the non-anchor frame quantizers to be coarser than the quantizers chosen for the corresponding anchor frame, i.e., ¹ q i ¸ q i for View i, so that fewer quantization choices for the non-anchor frames need to be evaluated (smaller n b ). 2.2.1 Monotonicity The monotonicity property observed in [34] for a temporal dependency scenario states that, for two dependent frames (the second frame is motion/disparity pre- dicted from the ¯rst one), we usually have: J 2 (q 1 ;q 2 )·J 2 (q 0 1 ;q 2 ) for q 1 ·q 0 1 ; (2.4) 14 i.e., for a given quantizer, q 2 , applied to the predicted frame, ¯ner quantization of the predictor tends to lead to better R-D characteristics for the predicted frame. This property usually holds when the frames in (2.4) are anchor frames. Similar properties can also be observed for the dependency within a view ¹ J 1 (q 1 ;¹ q 1 )· ¹ J 1 (q 0 1 ;¹ q 1 ) for q 1 ·q 0 1 ; (2.5) as well as when various levels of dependencies, across both views and time, are present, so that, for example: ¹ J 2 (q 1 ;q 2 ;¹ q 2 )· ¹ J 2 (q 0 1 ;q 2 ;¹ q 2 ) for q 1 ·q 0 1 (2.6a) ¹ J 2 (q 1 ;q 2 ;¹ q 2 )· ¹ J 2 (q 1 ;q 0 2 ;¹ q 2 ) for q 2 ·q 0 2 (2.6b) From these monotonicity properties, the following lemma can be derived. Lemma 1: If J 1 (q 1 )+ ¹ J 1 (q 1 ;¹ q 1 )+J 2 (q 1 ;q 2 )<J 1 (q 0 1 )+ ¹ J 1 (q 0 1 ;¹ q 1 )+J 2 (q 0 1 ;q 2 ) for q 1 <q 0 1 (2.7) then q 0 1 is not in the optimal path set and can be pruned out. Proof: Similar to the proof in [34], we prove the lemma by contradiction. Assumethatq 0 1 foranyq 1 <q 0 1 ispartoftheoptimalpath. Letthe optimalanchor 15 frame quantizer sequence be (q 0 1 ;¹ q 1 ;q 2 ;¹ q 2 ;:::;q N ;¹ q N ). However, for q 1 <q 0 1 , by the monotonocicy from (2.4), J 3 (q 1 ;q 2 ;q 3 )<J 3 (q 0 1 ;q 2 ;q 3 ) (2.8) ¢¢¢ J N (q 1 ;q 2 ;:::;q N )<J N (q 0 1 ;q 2 ;:::;q N ) (2.9) and by the monotonicity from (2.5) and (2.6), ¹ J 2 (q 1 ;q 2 ;¹ q 2 )< ¹ J 2 (q 0 1 ;q 2 ;¹ q 2 ) (2.10) ¹ J 3 (q 1 ;q 2 ;q 3 ;¹ q 3 )< ¹ J 3 (q 0 1 ;q 2 ;q 3 ;¹ q 3 ) (2.11) ¢¢¢ ¹ J N (q 1 ;q 2 ;:::;q N ;¹ q N )< ¹ J N (q 0 1 ;q 2 ;:::;q N ;¹ q N ) (2.12) Summing up (2.7), (2.8), ..., (2.12), we get the contradiction that the Lagrangian costwith(q 1 ;¹ q 1 ;q 2 ;¹ q 2 ;:::;q N ;¹ q N )issmallerthantheonewith(q 0 1 ;¹ q 1 ;q 2 ;¹ q 2 ;:::;q N ;¹ q N ) thus, q 0 1 is not in the optimal path. ¤ Lemma 1 above and the Lemma 2 in [34] are used in the pruning steps in our proposed algorithm. This algorithm is based on an IPPP anchor frame coding schemeasshowninFig.2.2(b). Forthetrellisexpansioninanchorandnon-anchor frames, refer to Figs. 2.3 and 2.4. In the following algorithm, q i 1 =fq 1 ;q 2 ;:::;q i g is an anchor frame quantizer allocation for views 1 through i. J i (q i¡1 1 ;q i ) is the La- grangian cost of the anchor frame in View i for a surviving anchor frame quantizer allocationq i¡1 1 andanchorframequantizerq i inViewi. ¹ J i (q i 1 ;¹ q i )istheLagrangian costofthenon-anchorframeinViewiforanchorframequantizerallocationq i 1 and non-anchor frame quantizer ¹ q i in View i. J(q i 1 ;¹ q i 1 ) is the total cost with quantizer 16 allocations q i 1 and ¹ q i 1 for views 1 through i. Algorithm 1: 1. ForViewi>1,generatetheLagrangiancostoftheanchorframe: J i (q i¡1 1 ;q i ), for all surviving quantizer allocations q i¡1 1 , and for all choices of q i . The anchor frame of View 1 is coded independently for all possible quantizer allocations q 1 with the Lagrangian cost J 1 (q 1 ). 2. Compute J(q i¡1 1 ;¹ q i¡1 1 )+J i (q i¡1 1 ;q i ) and use pruning condition of Lemma 1 and Lemma 2 in [34] to eliminate suboptimal paths up to View i. 3. For View i, generate the non-anchor frame cost: ¹ J i (q i 1 ;¹ q i ) for ¹ q i for all sur- viving allocations q i¡1 1 , and all surviving anchor frame quantizers q i . 4. Find minimum non-anchor frame cost ¹ J i (q i 1 ;¹ q i ) for each q i 1 . 5. For View i>1, compute total cost J(q i 1 ;¹ q i 1 )=J(q i¡1 1 ;¹ q i¡1 1 )+J i (q i¡1 1 ;q i )+ ¹ J i (q i 1 ;¹ q i ) for each anchor frame quantizer q i . For View 1, total cost is J(q 1 ;¹ q 1 )=J 1 (q 1 )+ ¹ J 1 (q 1 ;¹ q 1 ) for each anchor frame quantizer q 1 . 6. Witheverysurvivingpath,q 1 ;q 2 ;:::;q i ,proceedtoViewi+1andgotoStep1. Notethatforeachanchorframequantizerineachsurvivingallocationq i 1 ,there isacorrespondingnon-anchorframequantizerwithminimumcost, whichisshown as a thick line in Fig. 2.4. The above algorithm can be easily modi¯ed for either IBBP or IBP coding of anchor frames. An additional step required to search for a solution under IBP coding of anchor frames would be to populate branches between I and P1 with costs J B1 (q I ;q P1 ;q B1 ) and ¹ J B1 (q I ;q P1 ;q B1 ;¹ q B1 ). 17 q 1 q 2 View 1 View 2 View N (small q, better quality) (large q, worse quality) View 3 J 2 (q 1, q 2 , q 2 ) J 2 (q 1 ,q 2 ) J(q 1 , q 1 ) = J 1 (q 1 ) + J 1 (q 1 , q 1 ) J 1 (q 1 , q 1 ) J 1 (q 1 ) ... J(q 1 ,q 2 ,q 1 ,q 2 ) = J(q 1 , q 1 ) + J 2 (q 1 ,q 2 ) + J 2 (q 1 ,q 2 , q 2 ) Fig. 2.3: Trellis expansion in anchor frame. The thick line shows one of anchor frame quantizer allocations. (large q, worse quality) q 1 q 1 Anchor frames in View 1 Non-anchor frames in View 1 (small q, better quality) Selected q for given q J 1 (q 1 ) J 1 (q 1 ,q 1 ) + J(q 1 ,q 1 ) Fig. 2.4: Trellis expansion in View 1. For each anchor frame quantizer q a non-anchor frame quantizer ¹ q with minimum cost can be chosen (thick line) and the total cost for each quantizer allocation can be calculated. 18 15 20 25 30 35 40 45 15 20 25 30 35 40 45 QP a QP na Aqua: View 1 Aqua: View 2 Fig. 2.5: Relationship between q(QP a ) and ¹ q(QP na ) for Aquarium sequence. 2.2.2 Reduced Search Range Even though complexity is reduced by taking advantage of the monotonicity prop- erty, further reductions are achievable by considering the relationship between the anchor and non-anchor quantizer chosen in an optimal solution. According to our experiments, optimal bit allocations are such that there exists a strong correlation between q and ¹ q. This is shown in Fig. 2.5, where we plot, for di®erent values of ¸, the pair of quantization values for anchor and non-anchor frames that minimize theLagrangiancostforthegiven ¸. TheexactslopeinFig.2.5dependsingeneral on the number of non-anchor frames and how the anchor frame is encoded. In what follows we provide an analysis that supports the type of relationship between quantizers that we observe in optimal solutions. Let Q 1 and Q 2 be the quantization choices made for the anchor frame in a view and the non-anchor 19 frames in the same view, respectively, where smaller Q means ¯ner quantization. The Lagrangian cost J for that view is then J =D 1 (Q 1 )+D 2 (Q 1 ;Q 2 )+¸(R 1 (Q 1 )+R 2 (Q 1 ;Q 2 )): (2.13) Inordertobetterunderstandthepropertiesoftheoptimalsolutionwetakederiva- tives of J with respect to Q 1 and Q 2 , and set them to zero: @J @Q 1 = @D 1 @Q 1 + @D 2 @Q 1 +¸( @R 1 @Q 1 + @R 2 @Q 1 )=0 (2.14) @J @Q 2 =0,¸=¡ @D 2 @Q 2 = @R 2 @Q 2 =¡ d 2 r 2 ; (2.15) where we de¯ne d i =@D i =@Q i and r i =@R i =@Q i . Then, from (2.14) and (2.15), d 1 r 2 ¡d 2 r 1 =d 2 @R 2 @Q 1 ¡ @D 2 @Q 1 r 2 (2.16) d 1 r 1 ¡ d 2 r 2 =¡ 1 r 1 ( @D 2 @Q 1 +¸ @R 2 @Q 1 ) (2.17) Notethat,bythemonotonicityproperty,ifQ 1 increaseswhileQ 2 remainsconstant then both D 2 and R 2 will tend to increase. Thus, @D 2 @Q 1 ¸ 0 and @R 2 @Q 1 ¸ 0. Because d i ¸0 and r i ·0, from (2.17) d 1 r 1 ¸ d 2 r 2 (2.18) so that we can say that j ¢D 1 ¢R 1 j·j ¢D 2 ¢R 2 j: (2.19) In words, at optimality, the slope of operating point in the R 1 -D 1 characteristic is smaller than the slope of the operating in the R 2 -D 2 characteristics. Note that to 20 Rate R 1 -D 1 Distortion r 1 d 1 (a) R 1 -D 1 R 2 -D 2 Rate Distortion r 1 r 2 d 2 d 1 (b) R 2 -D 2 Fig. 2.6: Example of R-D curves for reduced search range derive (2.18) we only had to make one assumption, namely that the monotonicity property holds. Giventhattheslopej ¢D ¢R jofaconvexR-D decreasesasQdecreases(i.e.,asthe coding quality improves), we can conclude that if R 1 -D 1 and R 2 -D 2 have similar shape, then from (2.18), at optimality Q 2 > Q 1 . In our case of interest, Q 2 is the quantizer used to encode several non-anchor frames. In this case j ¢D 2 ¢R 2 j would be the slope of the aggregate R-D characteristic. While the absolute values of R 2 and D 2 are likely to be larger than those for R 1 and D 1 at a given Q, the shapes of the curves and corresponding slopes can still be assumed to be similar. This approximation agrees well with our observed experimental behavior and provides a tool for complexity reductions. For example, in Fig. 2.6, it is assumed that the R 1 -D 1 and R 2 -D 2 curves are similar. To have j ¢D 1 ¢R 1 j · j ¢D 2 ¢R 2 j as shown in (2.19), the operating point should be in the left of r 1 (i.e., r 2 ) in Fig. 2.6(b). Because rate decreases and distortion increases as Q increases, Q for r 2 (Q 2 ) should be larger than Q for r 1 (Q 1 ), i.e., Q 1 ·Q 2 . Based on this observation, Step 3 in Algorithm 1 can be modi¯ed as 21 3. For View i, generate the non-anchor frame cost: ¹ J i (q i 1 ;¹ q i ) for ¹ q i for all sur- viving allocations q i¡1 1 , and all surviving anchor frame quantizers q i , such that ¹ q i ¸q i . 2.2.3 Search Algorithm for Non-anchor Frames Up to now, for simplicity, we have assumed that the same quantizer is used for all non-anchor frames. Wenowproposeanon-anchorframequantizersearchalgorithm,whichoperates for a given anchor frame quantizer and, to reduce complexity, uses the following property (based on the discussion of the previous section): a frame close to root of the dependency tree has more in°uence on cost and therefore a better quantizer should be applied to it. Thus we begin the search with the frame which is close to the root. In the following algorithm, ¹ q = fQ 2 ;Q 3 ;:::;Q M g, is the vector of quantizers allocated to the non-anchor frames in a given view. Algorithm 2: Dependent coding in each view 1. Given ¸ and the QP of anchor frame q 0 , initialize ¹ q = fQ 2 ;Q 3 ;:::;Q M g = fq 0 ;q 0 ;:::;q 0 g. 2. For frames i=2;3;:::;M, ¯nd ® i = @J @Q i =( P M j=i @D j @Q i )+¸( P M j=i @R j @Q i ). 3. - If ® i < 0, Q i = Q i +1. Increase Q j which is less than Q i for j = fi+ 1;:::;Mg. - If ¹ J decreases, update ¹ q. Proceed to the next frame. 4. Repeat step 2 - 3 until there is no update in Q i 22 In this algorithm ® i is calculated for the current ¹ q. Then in order to make ® i closer to 0, we increase Q i by 1 if ® i < 0. Then, using the property motivated in the previous section, we also increase Q j such that Q j <Q i for j >i. 2.3 Simulation Results Using the H.264/AVC reference codec, we encoded the Aquarium multiview se- quencesfromTanimotoLabshowninFig.2.7usingthreedi®erentcodingschemes, i.e., SC in 2.2(a), MVC in 2.2(b) with ¯xed QP and optimized QP using proposed Algorithm 1. In the experiment all non-anchor frames in a view were assigned the same quantizer. Two di®erent coding conditions are used: First, all possible block sizes can be used and intra coding is enabled (C1). Second, only 8x8 block size is used and intra coding is disabled except for I frame (C2). The ¯rst 7 frames of Views 1, 2, and 3 are used in the experiment. As can be seen in Fig. 2.8, the proposed algorithm provides a gain of 0.5 dB as compared to MVC with ¯xed QP. In trellis expansion, six quantizers are selected as candidates for anchor and only three quantizers are selected for non-anchor frames using correlation between anchor and non-anchor quantizer. Note that in C2, intra coding is disabled except I frame thus, dependencies betweenframesarehigherthanC1. InC2, theproposedalgorithmachieveshigher coding gains (e.g., up to 1 dB compared to MVC) than in C1. 2.4 Conclusions In this chapter, 2-D bit allocation scheme was proposed. Complexity of data generation in trellis expansion is signi¯cant due to the increased dimensionality in 23 Fig. 2.7: Aqua sequence 0.5 1 1.5 2 x 10 4 26 27 28 29 30 31 32 33 34 bits per frame PSNR by average MSE Proposed:C1 MVC:C1 SC:C1 Proposed:C2 MVC:C2 SC:C2 Fig. 2.8: SC vs. MVC with ¯xed QP vs. MVC by proposed Algorithm 1. Av- erage number of bits for 21 frames is used and PSNR is calculated as 10log 10 (255 2 =(average MSE for 21 frames)). ¸ is 200, 500, and 900 with C1 and 300, 700, and 2500 with C2 in proposed algorithm. 24 MVC. We extend the monotonicity property from [34] and use it to prune sub- optimalquantizers. Complexitycanbereducedfurtherusingthefactthatoptimal solutions tend to show correlation between quantizers of anchor and non-anchor frames. Proposed algorithm with reduced complexity achieves 0.5 dB gains as compared to MVC. 25 Chapter 3 Illumination Compensation In Multi-View Video Coding 3.1 Preliminaries In Chapter 2, we addressed the dependency problem in multi-view video coding (MVC)andproposedaquantizersearchmethod. Thisoptimizationisperformedat the frame level according to the multi-view sequence structure and coding scheme. InthischapterandChapter4,wemovedowntoblocklevelinaframeandpropose methods to improve the quality of estimation for the original block in order to improve coding e±ciency in MVC. In block based predictive coding, a frame is divided into blocks ¯rst, then for each block, the most correlated match (predictor 1 ) is searched in reference frames, and residual error between the original and the best match is encoded and trans- mitted with signaling information (i.e., the motion vector). These block based 1 Here by predictor, we mean the selected estimate of current block after motion/disparity searchinthereferences. Notethatapredictionisalsousedtoselectthecenterofmotion/disparity search. This is obtained from motion/disparity vectors of neighboring blocks using spatial cor- relation. We will refer to this as motion/disparity vector predictor in this work. 26 A B Camera 3 : View 3 Camera 2 : View 2 Camera 1 : View 1 z1 z3 Fig. 3.1: Camera arrangement that causes local mismatches approaches exploiting the correlation between frames are applied for disparity es- timation and compensation in cross-view prediction, e.g., [3]. While the motion in temporal prediction is caused by the displacement of the objects, the disparity in cross-view prediction and the depths of objects in the scene comes from the displacement and orientation of the cameras. Generally, disparity in cross-view prediction is known to be more di±cult to compensate than motion because of the irregularity of the disparity ¯eld [5] and the severe occlusion e®ects that are caused by di®erent object depths. In contrast, in temporal prediction, most of the background is static and only moving objects need to be motion compensated. Furthermore,framesfromdi®erentviewsarepronetosu®erfrommismatchesother than disparity. We now consider other mismatch cases. Firstly, in a generic multi-view video capturing system, we can not assume that a perfect calibration is achieved among di®erent cameras because there are too many variables to be adjusted including intrinsic camera parameters. These heterogenous cameras can cause global (frame-wise) mismatches among di®erent views,whichcanmanifestthemselvesinbothluminanceandchrominancechannels. 27 (a) ST sequence ¯rst frame of view 3 (b) ST sequence ¯rst frame of view 4 0 50 100 150 200 250 0 1000 2000 3000 4000 5000 6000 pixel intensity: 0−255 number of occurence (c) Histogram of frame in (a); the mean grayscale value is 131. 0 50 100 150 200 250 0 1000 2000 3000 4000 5000 6000 pixel intensity: 0−255 number of occurence (d) Histogram of frame in (b); the mean grayscale value is 122. Fig. 3.2: Illumination mismatches in ST sequence For example, frames in one view may appear brighter and/or out of focus as compared to frames from the other view, due to mis-calibration. Secondly, even if camera calibration is perfect, objects may appear di®erently in each view due to camera locations and orientations. Consider the camera ar- rangement in Fig. 3.1, object A is projected to camera 1 and camera 3 at di®erent angle, and therefore it causes di®erent re°ection e®ects with respect to the cam- eras. For this example, di®erent portions of a video frame can undergo di®erent illumination changes with respect to the corresponding areas in frames from the other views. Fig. 3.2 demonstrates illumination mismatches between two views 28 from the ST sequence. In Fig. 3.2(a) and 3.2(b), severe illumination mismatches can be observed in the background, which correspond to the di®erent maximum pixel intensities in Fig. 3.2(c) and 3.2(d). However, the minimum pixel intensities of di®erent views are similar (e.g., the person's clothes). From Fig. 3.2(c) and 3.2(d), two histograms show similar shape with local variations, which are caused by global and local illumination mismatches. Average pixel intensities of View 3 and View 4 are 131 and 122 respectively. In addition to illumination mismatches in cross-view frames, focus may change fromoneviewtoanotherview[21]. InFig.3.1,ObjectAisatagreaterscene-depth (z 1 ) in View 1 than in View 3 (z 3 ). Even if all cameras are perfectly calibrated with the same focus at scene depth z 1 , Object A appears focused in View 1 while it is de-focused (blurred) in View 3. On the other hand, Object B will become sharpened in View 3 as compared to in View 1. Allthesefactorsleadtodiscrepanciesamongvideosequencesindi®erentviews. The e±ciency of cross-view disparity compensation could deteriorate in the pres- ence of these mismatches. In this work, we focus on techniques for illumination compensation in order to improve coding e±ciency in the presence of illumination mismatches between views. 3.2 Related Work Various approaches have been proposed for monoscopic video coding to address illumination changes in temporal prediction. In [19], illumination is compensated in two steps. First, illumination mismatch is compensated globally using a dec- imated image (that contains the DC coe±cients of all blocks). Then block-wise compensation is applied. In both steps, multiplicative and additive terms are 29 used. This two step compensation is applied only to frames classi¯ed as having large illumination mismatches, which does not occur as frequently in monoscopic temporal prediction, as compared to cross-view prediction in MVC. Note also that local compensation is not fully integrated into the search step and that an e±- cient coding for mismatch parameters is not provided. In [32], an illumination component and a re°ectance component are both compensated using scale factors that are quantized and Hu®man coded. This illumination model is useful for con- trast adjustment but cannot model severe mismatches in MVC properly. In [15], a brightness variation is modeled by two parameters for the multiplier ¯eld and o®set term, respectively. These parameters are used globally for whole frames. To reducetheimpactoflocalbrightnessvariation, asetofparametersiscollectedand apairischosenbasedontherelativefrequencyofallparameterpairs. Illumination compensationisdisactivatedforthoseblocksforwhichtheselectedparametersare not e±cient. This global approach cannot adapt to some large luminance varia- tions in MVC, which are dependent on relative positions of camera and objects. Recently, in [23], illumination mismatches are compensated using scale and o®set parameters, which is similar to the approach proposed in this work. Mismatch parameters are computed as part of the motion search and are di®erentially coded and selectively activated. However, this approach mainly targets the illumination compensation in video sequences where luminance changes progressively or due to abrupt changes in lighting, e.g., a °ash, which can be compensated by a global model (e.g., weighted prediction). In cross-view frames, illumination mismatches are caused by heterogeneous cameras and di®erent depths and perspectives, which leads to both local and global mismatches. Weightedprediction(WP)methodshavebeenproposedandadoptedinH.264/ AVC [6]. Multiplicative weighting factors and additive o®sets are applied to the 30 motion compensated prediction. According to whether two parameters are coded for each reference picture index or are derived based on the relative picture order count, these techniques are categorized as explicit and implicit, respectively. This globalapproachprovidessigni¯cantbitratereductionincodingfadesinmonoscopic video. However,formulti-viewvideowhereseverelocalvariationsarepresent,WP does not provide e±cient compensation. Next, block-based illumination compensation (IC) techniques [17,25,26] are presented. These were originally addressed in [25]. In [26], vector quantization of two parameters was proposed. In [17], we proposed to use only an additive term considering the trade-o® between the computational complexity and the coding e±ciency. We start by de¯ning an illumination model, and derive a coding scheme that e±ciently compensates for illumination changes across views. 3.3 Illumination Compensation Model Block-wise disparity search aims to ¯nd the block in the reference frame that best matches a block in the current frame, leading to minimum residual error after prediction. Under severe illumination mismatch conditions, coding e±ciency will su®er because i) residual energy for the best match candidate will generally be higher, and ii) true disparity is less likely to be found, leading to a more irregular disparity ¯eld and likely increases to the rate needed for disparity ¯eld encoding. Asdescribedpreviously, illuminationmismatchescanbelocalinnature. Thus, weadoptalocalICmodeltocompensatebothglobalandlocalluminancevariation in a frame. The IC parameters are estimated as part of the disparity vector search and these parameters are di®erentially encoded for transmission to the decoder, in order to exploit the spatial correlation in illumination mismatch. Finally, a 31 decision is made to activate IC on a block per block basis using a rate distortion criterion. Whenconsideringpixelscorrespondingtoagivenobjectbutcapturedbydi®er- entcameras, observedilluminationmismatchesneednotbethesameforallpixels, and will depend in general on the continuous plenoptic and radiance functions [4]. However, since the goal is to transmit explicit illumination mismatch information to the decoder, block-wise IC models are adopted, with the optimal block size de- cided based on R-D cost. As an initial step we evaluate a simple block-wise a±ne model, with an additive o®set term C and a multiplicative scale factor, S, leading to a mismatch model ª=fS;Cg as proposed in [15]. For the original block signal to be encoded (¹ x), the i th predictor candidate (¹ p i ) in the reference frames can be decomposed into the sum of its mean ¹ p i and a zero mean signal, ¹ p i 0 : ¹ p i (x;y)=¹ p i+¹ p i 0 (x;y), where (x;y) is the pixel location within the block. Then the illumination compensated predictor ^ p i (x;y) with IC model ª i is: ^ p i (x;y)=[¹ p i +C i ]+S i ¢¹ p i 0 (x;y): (3.1) This formulation allows us to separate the e®ect of each parameter, so that DC and AC mismatches are compensated separately. Furthermore, by applying a multiplicative compensation to the mean removed prediction in (3.1) we avoid the propagation of quantization error from scale to o®set [26]. AsshowninFig.3.3,fortheoriginalblocksignal,welookforthebestmatching predictorwithinthesearchrangeinthereferenceframeusingamodi¯edmatching metric that incorporates an IC model between the original block and a predic- tor candidate. This new metric, sum of absolute di®erences after compensation 32 Find Optimal Illumination Model (S i ,C i ) Find Cost : Matching Metric (SADAC i ) Lowest Cost? Update Current Best Match (p, S, C) Modified predictor (p i ) Yes No Next predictor (p i ) Original signal (x) Predictor candidates in references Fig. 3.3: Modi¯ed Search Loop for the current block (SADAC), essentially computes the SAD between the original block and the pre- dictor to which IC has been applied. Thus, for each predictor candidate, optimal IC parameters have to be computed. While SADAC is used for the search with IC, similarly to how SAD is used in H.264/AVC, a quadratic metric, namely, sum ofsquareddi®erencesaftercompensation(SSDAC)isusedto¯ndICparameters. 2 For the original signal ¹ x and illumination compensated i th predictor candidate ^ p i , the SSDAC is de¯ned as SSDAC i ´ X 8(x;y) j¹ x(x;y)¡^ p i (x;y)j 2 : (3.2) Replacing ^ p i using (3.1) and separating the mean from ¹ x, we have SSDAC i = X 8(x;y) j[¹ x ¡¹ p i¡C i ]+[¹ x 0 (x;y)¡S i ¢¹ p i 0 (x;y)]j 2 (3.3) 2 Toreducethecomputationalcomplexityinsearchingstep,SADACisusedinsteadofSSDAC, which is only used to ¯nd the optimal IC parameters. However for normal and Laplace distri- bution models of residual error, the same search results will be obtained with the two metrics under the conditions as discussed in Appendix A 33 ThentheoptimalICparameterª i =argmin fS i ;C i g fSSDAC i gcanbeobtainedby setting to zero the gradient of (3.3): S i = ¾ 2 ¹ x¹ p i ¾ 2 ¹ p i ¹ p i ; (3.4) C i = ¹ x ¡¹ p i; (3.5) where ¾ AB 2 = 1 N X 8(x;y) [A(x;y)¡¹ A ][B(x;y)¡¹ B ]; (3.6) with A;B2f¹ x;¹ p i g and N is the number of pixels in the block. This solution shows that the additive parameter directly removes the o®set mismatch and the multiplicative parameter compensates zero-mean variations ac- cording to block statistics. If the mean removed current and reference blocks are not highly correlated to each other, this scale factor will be small and thus only additive o®set compensation will a®ect the reference block. Among all candidates within the search range, the predictor ¹ p minimizing SADACwithICparametersisselectedasthebestmatchandtheminimumSSDAC is given as follows, \ SSDAC =N¢(¾ 2 ¹ x¹ x ¡ ¾ 4 ¹ x¹ p ¾ 2 ¹ p¹ p )=N¢¾ 2 ¹ x¹ x ¢(1¡½ 2 ); (3.7) where ½ is the correlation coe±cient between the original block signal ¹ x and the predictor ¹ p. 34 3.4 Illumination Mismatch Parameter Coding Using both scale and o®set parameters leads to more °exibility in compensating forilluminationmismatchesbutmaynotbee±cientforcoding,giventheoverhead required to represent both IC parameters. In our observation the scale parameter is also sensitive to quantization noise because of its multiplicative nature given that: ^ SSDAC = \ SSDAC +N¢C 2 +N¢S 2 ¾ 2 ¹ p¹ p (3.8) where ^ SSDAC is the SSDAC after quantization of IC parameters, \ SSDAC is the minimum SSDAC in (3.7), N is the number of pixels in the block and ¢C and ¢S is the quantization noise of o®set and scale parameter, respectively. The quantizationnoiseofscaleparameterismultipliedbythevarianceofthepredictor, ¾ 2 ¹ p¹ p , thus, even small quantization errors in the scale parameter can lead to fairly large di®erences in the compensated reference block. Taking this into account, as well as the complexity involved in calculating this parameter within the disparity search step, in the rest of the work we use only the o®set parameter for IC. Toencodetheo®setparameterweexploitthecorrelationsbetweenillumination compensation parameters in neighboring blocks. As a predictor of the IC parame- terofablock,weusetheICparameteroftheblocktoitsleft;thisallowsprediction to be performed in a causal manner. If the left block was not encoded using IC, the block above is used instead as a predictor. If IC is disabled for both of these blocks then no prediction is used to encode the IC parameter for the current block (equivalently, the predictor is set to zero). The prediction residue is quantized and then encoded. We use a simple uni- formquantizer,whicho®ersgoodperformanceandlowcomplexity. Thisquantized di®erential o®set is encoded using the context adaptive binary arithmetic coder 35 Tab. 3.1: Unary binarization and assigned probability for index of quantized di®erential o®set Absolute value (val) Bin 1 Bin 2 Bin 3 Bin 4 ... 0 0 1 1 0 2 1 1 0 3 1 1 1 0 ... ... ... ... ... Assigned probability. P1 P2 P3 P4 (CABAC) [27], which consists of (i) binarization, (ii) context modeling and (iii) binaryarithmeticcoding. We¯rstseparatetheabsolutevalue (val)andthesignof these quantized di®erential o®sets. Then, the absolute values of quantized o®sets are binarized by selecting a unary representation as in Tab. 3.1. These represen- tations of symbols (IC parameters) reduce the alphabet size of symbol and enable context modeling on a sub-symbol level [27]. Thedi®erentialo®setparametersarepredictionresidueswhichtendtobesmall and exhibit a symmetric distribution around zero, with very limited spatial cor- relation. Therefore, di®erent probability models are used for the di®erent binary symbol positions of val as shown in Tab. 3.1. The number of di®erent probability models for binary symbols in val is chosen to be four experimentally. Bits corre- sponding to val greater than 3 use the same probability model. All probability models are initialized with equal symbol probability and updated according to the binary symbols to be coded. Arithmetic coding is also used for the sign, with a probability model initialized with equal symbol probability. Clearly,di®erentblockssu®erfromdi®erentlevelsofilluminationmismatch,so that potential R-D bene¯ts of using IC di®er from block to block. Thus we allow the encoder to decide whether or not the IC parameters are used on a block by 36 Current block (context c) Upper block (activation bit b) Left block (activation bit a) Fig. 3.4: Context of current block c=a+b, where a;b2f0;1g. block basis. This is achieved by computing the R-D values associated to coding each block with and without IC, and then letting the Lagrangian optimization tools in the H.264/AVC codec make an R-D optimal decision. There is an added overhead needed to indicate for each block whether IC is used but this is more e±cient overall than sending IC parameters for all blocks. This IC activation bit is also entropy-encoded using CABAC. The context is de¯ned based on the activation choices made for the left and upper blocks. If IC is enabled or disabled in both these blocks, it is highly probable that the same choice will be made for the current block. However if only one of these two neighboring blocks uses IC, the probability of the current block using IC should be close 1=2. Based on this observation,threecontextsareassignedandinitializedforactivationswitch,which issimilartothecontextsetupfortheSkip°agorthetransformsizeinH.264/AVC. Fig. 3.4 demonstrates how the context of current block is de¯ned by IC activation bits from left and upper block. Tab. 3.2 shows the initialization of the context for IC activation bit. 37 Tab. 3.2: Initialization of the context for IC activation bit with di®erent most probable symbol (MPS) Context c MPS Probability 0 0 1 1 0 1 2 2 1 1 Tab. 3.3: Number of addition/subtraction for SAD and SADAC. N is the number of pixels in a macroblock and S is the number of search points. SAD (Original) SADAC (IC enabled) P 8(x;y) j¹ x(x;y)¡¹ p i (x;y)j ¹ ¹ x !N ¹ ¹ p i !N (or 0 in Fast IC mode) !2N C i =¹ ¹ x ¡¹ ¹ p i !1 P N j¹ x(x;y)¡¹ p i (x;y)¡C i j!3N For S search points!2NS N For S search points!4NS +S 2NS N +S +4NS¼4NS 3.5 Complexity of IC TheimpactofIConencodingcomplexityismostlyduetochangesinthedisparity estimation metric computation (other changes to the encoder such as encoding of IC parameter and R-D based IC activation, have a negligible e®ect on overall complexity). Thus, in what follows, additional complexity for IC is analyzed in terms of the number of addition/subtraction operations in the SAD calculation. AscanbeseeninTab.3.3,forN pixelsinamacroblockandS searchpoints,in eachblockmode,ICrequires4NS calculationsforSADAC,while2NS arerequired inSAD.ForSAD,thedi®erencesofcurrentandreferencepixels(N)arecalculated ¯rst. After the absolute value operation, N absolute di®erences are added to computeSAD,whichrequireatotalof2N operations. Similarly,forSADACatotal of4N operationsarerequired,including¹ ¹ x ,¹ ¹ p i andC i calculations. Forthemean 38 Tab. 3.4: Complexity for SAD Calculation in Di®erent Block Modes. N is the number of pixels in a macroblock and S is the number of search points. Note that in Fast IC mode, ¹ ¹ p i in 8£8 block is saved to be used in the larger block mode so that 4NS complexity is required in 8£8 block mode and 3NS in 8£16, 16£8, 16£16 block modes. Block Modes Original IC Fast IC 4£4 2NS - - 4£8 2NS - - 8£4 2NS - - 8£8 2NS 4NS 4NS 8£16 2NS 4NS 3NS 16£8 2NS 4NS 3NS 16£16 2NS 4NS 3NS TOTAL 14NS 16NS 13NS calculation,weneedtosumN pixels,whichrequiresN additions. Throughoutthe analysis, shift operation for mean calculation and absolute value operation are not counted. Assuming the center of search for di®erent block modes does not deviate signi¯cantly, ¹ ¹ p i in small blocks can be reused in larger blocks avoiding redundant calculations. For example, by storing ¹ ¹ p i in 8£8 blocks, the calculation of ¹ ¹ p i in the larger block (e.g., 16£16, 16£8 and 8£16) can be simpli¯ed as the sum of ¹ ¹ p i in 8£8 blocks (e.g., 4, 2 and 2, respectively) when a predictor candidate comes from the same location. Thus, the SADAC complexity can be lowered from 4NS to 3NS (Fast IC mode). Considering di®erent block modes supported in H.264/AVC, complexity for SAD calculation is summarized in Tab. 3.4. For IC, both SAD and SADAC need to be calculated for IC activation thus, the total complexity for IC would be the sumof14NS+16NS (or13NS forfastICmode). Thereforetotalcomplexitywith ICisabout2:1(or1:9forfastICmode)timestoH.264/AVCwithoutIC.However, this complexity can be reduced further noting that the same search range is used 39 Tab. 3.5: Complexity when SAD and SADAC are calculated at the same time. N is the number of pixels in a macroblock and S is the number of search points. Block Modes `SAD+SADAC' `SAD+SADAC' in Fast IC mode 4£4 2NS 2NS 4£8 2NS 2NS 8£4 2NS 2NS 8£8 5NS 5NS 8£16 5NS 4NS 16£8 5NS 4NS 16£16 5NS 4NS TOTAL 26NS 23NS for SAD and SADAC with IC. In the calculation of SADAC in Tab. 3.3, B C ¡B i R canbeusedtocalculateSAD,sothatSADandSADACarecalculatedatthesame time for the same search point, which requires only N operations for SAD instead of 2N. This leads to a total complexity with IC in fast mode that would be about 1.64 times that of H.264/AVC without IC, as can be seen in Tab. 3.5. When the fast mode decision algorithm is used, all block sizes are not tested. For example, in [46], 16£ 16, 8£ 8 and 4£ 4 block modes are examined ¯rst and their R-D costs are used to decide which block modes are tested further. The complexity of IC in this case can be evaluated by adding the complexities of those block modes that are tested (obtained from Tab. 3.5). 3.6 Simulation Results Thethreesequencesusedinourexperiments, Ballroom, Race1andRena, havedif- ferent characteristics [1]. All test sequences are 640(w)x480(h) with 8 views that are captured by an array of 8 cameras located horizontally (1-D) with viewing 40 (a) Ballroom: 8 cameras with 20cm spacing (b) Race1: 8 cameras with 20cm spacing (c) Rena: 8 cameras with 5cm spacing Fig. 3.5: MVC sequences: 1D/parallel. Sequences are captured by an array of cameras located horizontally with viewing directions in parallel. 41 directionsinparallel. InFig.3.5, sampleframesoftestsequencesareshown. Ball- room has the most complicated background and fast moving objects. Objects are locatedatmultipledepthsandthedistancefromthecameratothefrontobjectsis smallsothedisparityoffrontobjectsislarge. In Race1, amountedand¯xedcam- era array is used to follow racing carts so that there is global motion. Signi¯cant luminance and focus changes between views are observed due to imperfect camera calibration and illumination changes are also observed in time because of global motion by camera. In Rena, a gymnast moves fast in front of curtains. Distance between cameras is smaller than in the other sequences and luminance and focus changes between views are observed clearly. Our proposed IC technique is combined with standard H.264/AVC [14] coding tools. IC is enabled only for 16£16, 16£8, 8£16 and 8£8 blocks. While the encoder could be given the option to select whether to use IC on smaller blocks, we observed that this choice was rarely made and thus, for complexity reasons, we choose 8£8 to be the smallest block size. Also IC can be applied in Skip/Direct modesothatmodelparametersarepredictedfromneighboringblocksusingspatial correlation [22]. Using the reference codec JM-10.2 [14] as a starting point, we encode frames in cross-view direction only, i.e., we take a sequence of frames captured at the same time from di®erent cameras and feed this to the encoder as if it were a temporal sequence. 8 frames at time stamp 0 are concatenated with 8 frames at time stamp 10. These 16 frames are concatenated again with 8 frames at time stamp 20. By repeating this procedure, we generate a sample sequence with 40 frames from time stamps 0, 10, 20, 30 and 40. By setting the intra period to the number of views, sample sequences are encoded with cross-view prediction. 42 20 40 60 80 100 120 140 160 180 32 33 34 35 36 37 38 39 40 Ballroom: 640x480, View 0~View 7 Kb/frame (IPPPPPPP) psnr (dB) IC WP H.264/AVC:1ref (a) Ballroom 20 40 60 80 100 120 140 160 180 33 34 35 36 37 38 39 40 41 Race1: 640x480, View 0~View 7 Kb/frame (IPPPPPPP) psnr (dB) IC WP H.264/AVC:1ref (b) Race1 10 20 30 40 50 60 70 80 90 37 38 39 40 41 42 43 44 45 46 Rena: 640x480, View 38~View 45 Kb/frame (IPPPPPPP) psnr (dB) IC WP H.264/AVC:1ref (c) Rena Fig. 3.6: Cross-view coding with IC, at time stamps 0, 10, 20, 30, 40 43 Tab. 3.6: Percentage of non intra selection in cross-view prediction (% in H.264 ! % in H.264+IC). Note that more signi¯cant PSNR increases can be observed for those sequences where the increase in number of inter code blocks is greater. Sequence QP24 QP28 QP32 QP36 Ballroom 68.8! 75.2 72.3! 79.9 73.3! 81.8 77.2! 85.6 Race1 53.1! 71.1 53.4! 71.2 53.6! 72.6 54.6! 73.9 Rena 53.0! 66.9 54.0! 70.3 56.0! 72.3 62.5! 72.8 We performed simulations with full search, range equal to §64 pixels, quarter- pixel precision, 1 reference frame, and tested four di®erent QP values (24, 28, 32, 36) to obtain di®erent rate points in Fig. 3.6. It can be seen that for Race1 and Rena there is signi¯cant improvement by using IC (0.8 dB) as compared to the results by H.264/AVC because of severe illumination mismatch across views. Instead, Ballroom showed small improvement (0.2 dB). Ballroom is the most di±cult sequence to encode because of its complicated background and irregular disparity¯eld,duetolargevariancesinobjectdepths. Alsoilluminationmismatch is not signi¯cant compared to the other sequences. It can be seen that WP does not provide signi¯cant coding gains because it cannot compensate severe local mismatches in cross-view prediction. From Tab. 3.6, we can see that the number of blocks in Inter and Skip mode increases once IC is used, which means that disparity search ¯nds more correct matches after compensation. Note that IC gains can be observed even at low bit rates because the selection of IC in each block is optimized based on R-D criteria. In MVC, multiple references from di®erent time and views are available. For example, if the current frame is at View 2 and time stamp 1 (V2T1) as shown in Fig.3.7, 4referencesareavailableforcurrentBslice-(V2T0),(V2T2),(V1T1)and (V3T1). In [40], in addition to the two reference lists (L0 and L1) in H.264/AVC, 44 V1 T1 V0 T1 V3 T1 V2 T1 V1 T2 V0 T2 V3 T2 V2 T2 TIME VIEW V0 T3 V3 T3 V2 T3 V1 T3 V3 T0 V2 T0 V1 T0 V0 T0 Fig. 3.7: Example of multiple references from di®erent time stamps and views two view reference lists (VL0 and VL1) are proposed to enable both temporal and cross-view prediction. In [42], a prediction structure using hierarchical B pictures proposed in [30] is adopted as a reference encoder for multi-view video coding. The size of decoded picture bu®er (DPB) is increased to store additional reference frames from the other views. The coding structure can be speci¯ed using the con¯guration ¯le of H.264/AVC. An example of this prediction structure is shown in Fig. 3.8 with 8 views and GOP length 8. IBPBPBPP is used for cross-view prediction in anchor frames and hierarchical B is used in temporal prediction. In the even numbered views of non-anchor frames, only temporal prediction is used and in the odd numbered views of non-anchor frames, both temporal and cross- viewpredictionsareused. NotethatallBframesinFig.3.8areencodedasB-store frames, i.e., they can be used as references. Although IC techniques primarily aimed at compensating illumination mis- matches in cross-view prediction, they can easily be used to compensate illumi- nation mismatches in temporal prediction, which happens in moving objects and abrupt scene changes. With the prediction structure described in Fig. 3.8, IC is implemented to be applied in both temporal and cross-view prediction [18]. 45 I0 P0 P0 B1 P0 B1 P0 B1 V0 V7 V6 V5 V4 V3 V2 V1 B3 B3 B3 B4 B3 B4 B3 B4 B2 B2 B2 B3 B2 B3 B2 B3 B3 B3 B3 B4 B3 B4 B3 B4 B1 B1 B1 B2 B1 B2 B1 B2 B3 B3 B3 B4 B3 B4 B3 B4 B2 B2 B2 B3 B2 B3 B2 B3 B3 B3 B3 B4 B3 B4 B3 B4 I0 P0 P0 B1 P0 B1 P0 B1 T0 T1 T2 T3 T4 T5 T6 T7 T8 Non-anchor frames Anchor frame Anchor frame Fig. 3.8: Prediction structure for multi-view video coding with 8 views and GOP length 8 46 Tab. 3.7: Temporal partitioning of test data sets Data set Temporal Partitioning Ballroom 250 frames = 20£ GOP 12 + GOP 9 Race1 532 frames = 35£ GOP 15 + GOP 6 Rena 300 frames = 19£ GOP 15 + GOP 14 Fig. 3.9 provides coding results for the parameters (GOP length and total number of frames) of Tab. 3.7. For Ballroom, Race1 and Rena, IC achieves 0.1-0.5 dB gains. Overall gains from using IC (as compared to using the same temporal/cross view prediction but no IC) are lower relative to the case where only cross-view prediction is used (Fig. 3.6) because illumination mismatches between frames in timearenotassevereasacrossviewsandmoststaticbackgroundcanbee±ciently encodedbySkip/Directmodeintemporalprediction. Completesimulationresults of proposed IC in MVC for various multi-view test sequences can be found in [18]. Fig. 3.10 demonstrates coding results of IC and WP in MVC. In this compari- son, 73, 76 and 31 frames/view (rather than the complete sequences as in Tab. 3.7 to lower encoding complexity) are encoded for Ballroom, Race1 and Rena, respec- tively. IC achieves higher coding e±ciency as compared to WP. In particular for Race1, IC achieves a 0.5 dB gain over WP. More detailed comparisons of IC with WP in MVC for various multi-view test sequences can be found in [22]. 3.6.1 Combined Solution with ARF In [21], adaptive reference ¯ltering (ARF) is proposed to compensate focus mis- match in cross-view prediction. To compensate both illumination and focus mis- matchesincross-viewprediction,ICandARFtechniquesarecombined[17]. Mean- removed-search (MRS) is adopted to remove redundancies in combined system so 47 250 300 350 400 450 500 550 33 33.5 34 34.5 35 35.5 36 Ballroom: 640x480 Kbps PSNR (dB) With IC w/o IC (a) Ballroom 350 400 450 500 550 600 650 700 750 800 37.5 38 38.5 39 39.5 40 Race1: 640x480 Kbps PSNR (dB) With IC w/o IC (b) Race1 100 150 200 250 300 350 400 450 500 550 37 38 39 40 41 42 43 44 45 Rena: 640x480 Kbps PSNR (dB) With IC w/o IC (c) Rena Fig. 3.9: Multi-view coding with IBPBPBPP cross-view, hierarchical B temporal [2] 48 300 350 400 450 500 550 600 33 33.5 34 34.5 35 35.5 36 Ballroom: 640x480 Kbps PSNR (dB) IC WP H.264/AVC (a) Ballroom (73 frames/view) 600 700 800 900 1000 1100 1200 1300 36.5 37 37.5 38 38.5 39 Race1: 640x480 Kbps PSNR (dB) IC WP H.264/AVC (b) Race1 (76 frames/view) 100 200 300 400 500 600 700 37 38 39 40 41 42 43 44 Rena: 640x480 Kbps PSNR (dB) IC WP H.264/AVC (c) Rena (31 frames/view) Fig. 3.10: Comparisonof IC with WPin MVCwith IBPBPBPP cross-view, hierarchical B temporal [2] 49 that DC and AC compensation is performed by IC and ARF respectively. From the matches by MRS (¯rst search), ARF ¯lter coe±cients are calculated and ad- ditional reference frames are generated by ¯ltering the original reference frame. Finally, IC is applied for the disparity search (second search) with the original ref- erence frame and reference frames generated by ARF. Since the di®erent ¯ltered references created by ARF come from the same original reference frame, the dis- parity ¯elds obtained from the ¯rst (MRS) and second (IC) search should not be very di®erent. Complexity reduction can be achieved by taking the disparity ¯eld obtained from MRS search as predictor for the second search with a much reduced search range. Under the same coding conditions used for Fig. 3.6, we encode frames in cross- view prediction using the combined system. The simulation results are shown in Fig. 3.11. For Ballroom, block-wise IC alone provides very limited gain so that the combined system also barely outperforms the ARF coding. On the other hand, ARF and IC each achieve 0.5»0.8 dB gain for Race1 and Rena. The combined system produces an additional 0.5 dB gain over either IC only or ARF only. The overall coding gain, as compared to using H.264/AVC with 1 reference for cross- view coding, is about 0.5 dB for Ballroom, about 1.3 dB for Race1 and about 1 dB for Rena. 3.7 Conclusions To compensate localized illumination mismatches across di®erent views in multi- view systems, block-wise illumination compensation techniques are proposed. IC coding tools are developed from the corresponding mismatch models and show signi¯cantgainsoverstandardH.264/AVCincross-viewprediction. Theproposed 50 20 40 60 80 100 120 140 160 180 32 33 34 35 36 37 38 39 40 Ballroom: 640x480, View 0~View 7 Kb/frame (IPPPPPPP) psnr (dB) ARF+IC ARF IC H.264/AVC:1ref (a) Ballroom 20 40 60 80 100 120 140 160 180 33 34 35 36 37 38 39 40 41 42 Race1: 640x480, View 0~View 7 Kb/frame (IPPPPPPP) psnr (dB) ARF+IC ARF IC H.264/AVC:1ref (b) Race1 10 20 30 40 50 60 70 80 90 37 38 39 40 41 42 43 44 45 46 Rena: 640x480, View 38~View 45 Kb/frame (IPPPPPPP) psnr (dB) ARF+IC ARF IC H.264/AVC:1ref (c) Rena Fig. 3.11: Cross-view coding with H.264/AVC, IC, ARF and ARF+IC at time stamps 0, 10, 20, 30, 40 51 techniques are applied to a general multi-view video coding system where both temporal and cross-view prediction are used and to more general prediction struc- tures. Simulation results show that, when performing predictive coding across di®erent views in multi-view systems and in general multi-view video coding, our proposed methods provide higher coding e±ciency than other advanced coding tools. Joint coding bene¯t and complexity of the combined system are discussed and an improved coding algorithm is presented. 52 Chapter 4 Implicit Block Segmentation 4.1 Preliminaries In Chapter 3, a predictor from each reference is compensated using the IC model in order to minimize the residual error with respect to the original block signal. This additional IC model was introduced because the true match is corrupted by brightness variations in the multi-view system. In this chapter, assuming that references are not corrupted, we propose a technique to improve predictor quality for the original macroblock. Exploitinginter-framecorrelationviamotionestimationiskeyinachievinghigh video compression e±ciency. Block-based motion estimation and compensation provides a good balance between prediction accuracy and rate overhead. Clearly, blocks of pixels are not guaranteed to have uniform displacement across frames. For video sequences this is the case if an object boundary exists in a block and pixels which belong to di®erent objects move in di®erent ways. In stereo or multi- viewsequencesthisisthecaseifanobjectboundaryformedbyobjectsindi®erent depths exists in a block and pixels which belong to di®erent objects are occluded or uncovered due to disparity. 53 16x16 16x8 0 16x8 1 8x16 0 8x16 1 8x8 0 8x8 1 8x8 2 8x8 3 (a) Block mode - INTER16£16, INTER16£8, INTER8£16 and INTER8£8 8x8 8x4 0 8x4 1 4x8 0 4x8 1 4x4 0 4x4 1 4x4 2 4x4 3 (b) Sub-block mode for INTER8£8 Fig. 4.1: Inter block modes in H.264/AVC. Each 8£8 sub-block in 4.1(a) can be split into di®erent sub-block sizes as in 4.1(b). Numerous approaches have been proposed to provide more accurate motion compensation by providing di®erent prediction for di®erent regions in a mac- roblock. Examples include techniques used in the H.264/AVC video coding stan- dards [43] or the hierarchical quad-tree (QT) approach [38]. In these methods a macroblock is split into smaller blocks and the best match for each block is searched. As the number of blocks in a macroblock increases, overhead increases while distortion between the original and the match decreases. Therefore, there is an optimal point in terms of rate-distortion behavior so that the best block mode can be decided based on Lagrangian techniques. Fig. 4.1 depicts di®erent block modes available in H.264/AVC. The R-D costs from all candidate block modes are computed in inter frame prediction and the block mode with minimum R-D cost is chosen. For 8£8 block mode in Fig. 4.1(a), a block can be further split into 4 sub-blocks as shown in Fig. 4.1(b). To increase the quality of matching achievable by square or rectangular block shapes available in QT, a geometry based approach (GEO) is proposed in [9,13]. A block is split into two smaller regions called 54 d s Fig. 4.2: A straight line in GEO is de¯ned by slope s and distance d from center in 16£16 macroblock. wedgesbyalinedescribedbyaslope(s)andatranslationparameter(d)asshown in Fig. 4.2. These parameters and matching wedges are jointly estimated for each candidate within the motion search. Although GEO captures object boundaries better than QT, it is still limited in that the boundary has to be a straight line. Furthermore, the search for the best slope and translation parameters combined with motion search increases the complexity signi¯cantly. In [33], an object based motion segmentation method is proposed to solve the occlusion problem. To estimate di®erent motions in a block, motion vectors from neighboring blocks are copied after block segmentation. To avoid transmitting segmentation information, previously encoded frames at (t¡ 1) and (t¡ 2) are used to estimate segmentation for the current frame at (t). However, since only motion vectors in neighboring blocks are used to estimate motion, the accuracy of this estimation may su®er. In this chapter, we present a framework for implicit block segmentation to improve prediction quality. Implicit block segmentation is obtained based on the predictors from previously encoded frames as in [33]. However, segmentation is applied to the di®erence of two predictors, rather than directly to the predictor 55 itself. Also, unlike in [33], motion vectors are explicitly transmitted to signal the location of chosen predictors and the encoder searches for the best combination of predictors. We use 16£16 macroblocks, which are assumed to be small relative to typical objects in the scene, so that in many cases at most two objects 1 move with di®erentdisplacementsattheboundaries[33]. Althoughdistortioncanbereduced as the number of predictors increases, the overhead required for motion/disparity vectors and for identifying the selected predictor for each segment also increases with the number of predictors. While the number of predictors can be optimally chosen based on R-D cost (as is done in the hierarchical quad-tree case), in this work for simplicity we choose the maximum number of predictors to be two. 4.2 Implicit Block Segmentation 4.2.1 Motivation from Block Motion Compensation Fig. 4.3 shows an example of block motion estimation between current and ref- erence frame. In the current block, we have two objects which are separated by a smooth boundary. Let us assume that the correct matches of each object can be found as a base predictor (¹ p 0 ) and an enhancement predictor (¹ p 1 ) as shown in the reference frame. For the current macroblock signal ¹ x, 2 QT and GEO ¯nd best predictors by selecting the best match for regions as de¯ned in those algorithms (i.e., constrained to be rectangular regions or to have a straight line boundary). Therefore, although the correct matches for each object are given as ¹ p 0 and ¹ p 1 , neither QT nor GEO ¯nds a correct match without signi¯cant prediction error in 1 Thus, the initial number of segments, N c in K-means clustering algorithm in 4.2.2 is set to 2. 2 The vector notations ¹ x, ¹ p 0 and ¹ p 1 are used to represent block signals. For pixel data or random variables, the terms x, p 0 and p 1 are used. 56 REF 2 1 p 0 p 1 CUR a c b Quad-tree 8x16 a c b A line by GEO 2 1 Current macroblock x Fig. 4.3: Exampleofblockmotioncompensation. Thebestmatchofcurrentmacroblock ¹ x can be found in two locations for di®erent objects. However, in region b of matches by QT and GEO, signi¯cant prediction error exists. 1 - 3 1 2 = p 0 - p 1 = p d 2 Fig. 4.4: De¯nition of predictor di®erence ¹ p d . Pattern of predictors are from Fig. 4.3. some regions (e.g., those labeled b in Fig. 4.3) because object boundaries are not necessarily well described by a straight line. Following the same example, in Fig. 4.4 we depict the di®erence between the two predictors, ¹ p d = ¹ p 0 ¡¹ p 1 . In region 1 of ¹ p d , the absolute di®erence of pixel values is small because ¹ p 0 and ¹ p 1 come from the same object and both ¹ p 0 and ¹ p 1 will estimate the original block with small error. Therefore, the di®erence in residual error when using the two predictors (i.e.,j¹ x¡¹ p 0 j¡j¹ x¡¹ p 1 j) will tend to be small, which means either predictor would be a good estimate of the original signal. In regions 2 and 3 of ¹ p d , ¹ p 1 and ¹ p 0 provide the best match, respectively. Thus, the absolute di®erence between the two predictors will tend to be large, and 57 wesimilarlywouldexpectthatthedi®erencesinresidualerrorafterpredictionwill be large. For each region several scenarios are possible. In the area where j¹ p d j is small, because the two predictors are similar we have that either i) both predictors pro- vide a good match or ii) the residual error is large with respect to both predictor and choosing one of the predictors over the other will not lead to signi¯cant im- provements. Instead, in areas where j¹ p d j is large, either i) only one of the two predictorsprovidesagoodmatch, orii)acombinationofbothpredictorsmaylead to a better matching performance. Clearly, choosing the \right" predictor among the two available choices is more important for regions wherej¹ p d j is large; it is in these regions where signaling a predictor choice can lead to a more signi¯cant gain in prediction performance. We propose implicit block segmentation (IBS), where each macroblock is seg- mented¯rst, basedontheseobservations. Foreachsegment, weightsarechosenso thatthepredictiongeneratedbytheweightedsumofpredictorsminimizesresidual error. Anestimateoftheoriginalmacroblockisobtainedbycombiningpredictions for each segment. Next, a block based segmentation method is proposed. 4.2.2 Block Based Segmentation Assumetwopredictorsareavailableforagivenmacroblock(i.e.,two16£16blocks from neighboring frames). These two predictors have been chosen by the encoder and their positions will be signaled to the decoder. The optimal segmentation for the purpose of prediction would be such that each pixel in the original macroblock is assigned to whichever predictor, ¹ p 0 or ¹ p 1 , provides the best approximation. 58 However this cannot be done implicitly (without sending side information) since the decision depends on the original block itself. In [33], MAP estimation of block segmentation is proposed based on a Markov random ¯eld (MRF) model. This Bayesian image segmentation method provides optimized segmentation results for given probability models. If the original block signal to be encoded (x), base predictor (¹ p 0 ) and enhancement predictor (¹ p 1 ) are given, MAP segmentation (^ s) can be found as ^ s=argmin ¹ s P(¹ sjx;¹ p 0 ;¹ p 1 ) (4.1) However, it is di±cult to ¯nd the segmentation minimizing (4.1) with reasonable computational complexity considering that ¹ p 0 and ¹ p 1 have to be jointly searched. Thus, this approach is not applied in this work. When the depth information of a frame is available, object boundaries can be extracted by segmenting the depth map. Because occlusions and uncovered regions are caused by moving objects in di®erent depths, better matches can be found for the segments using the depth map. However in our work, we assume that auxiliary information, such as depth maps, is not available. Based on our previous observations about the expected gains depending on the di®erences between predictors, we apply segmentation to the block of predic- tor di®erences, ¹ p d . Due to the noisy characteristics of predictor di®erences, edge based segmentation methods do not detect simple boundaries e±ciently in 16£16 macroblocks. Inthiswork, K-meansclustering[29]isusedasabasicsegmentation algorithm. To take the spatial information of pixels into account with the pixel valueofpredictordi®erence, 3-DK-meansclusteringalgorithmcanbeusedtaking horizontal (x), vertical (y) location and predictor di®erence (p d ) as three inputs. 59 Because of di®erent ranges for (x;y) and p d , p d needs to be scaled before K-means clustering. However,thesegmentationresultsarequitesensitivetothisscalingfac- tor and an accurate scaling factor is hard to ¯nd because the range of p d changes depending on the disparities between base and enhancement predictors. There- fore, instead of 3-D K-means clustering, 1-D K-means clustering followed by two step post-processing is adopted. The input to the K-means clustering is the pixel value of predictor di®erence ¹ p d in 16£16 macroblock. N c centroids are initialized uniformly spaced between maximum and minimum value of ¹ p d . The maximum number of iterations N it is set to 20. According to the minimum distance to N c centroids, pixels are classi¯ed into the N c segments. After 1-D K-means cluster- ing, disconnected pixels exist within each segment because spatial connectivity is not considered in 1-D K-means clustering. A two step post-processing is ap- plied to take spatial information into account. First, using connected component labeling [7], disconnected pixels assigned to the same segment are classi¯ed into di®erent segments. Second, to prevent the occurrence of segments due to noise, if the number of pixels in a segment is smaller than a threshold, N th , it is merged intotheneighboringsegmentthathastheminimumsegment-meandi®erencewith current segment. Fig. 4.5 depicts this post-processing. Note that the number of segments depends on the disparities between base and enhancement predictors. In this work, N c and N th are set to be 2 and 10, experimentally. 4.2.3 Weighted Sum of Predictors Foreachsegmentk in¹ p d , theoptimalpredictor^ x k canbecalculatedasaweighted sum of base and enhancement predictors when the original ¹ x is known. If scalar 60 1 2 2 3 1 3 4 2 1 3 2 Separation of disconnected area Removal of noisy small segments Fig. 4.5: Example of two step post-processing after 1-D K-means clustering. First, dis- connected segment 2 is classi¯ed as di®erent segment increasing the number of segmentN from3to4. Second, segment4 ismergedintosegment1decreasing N to 3 again. weights ® k 0 and ® k 1 are applied to all pixels in segment k of ¹ p 0 and ¹ p 1 , the sum of squared di®erence (SSD) for the segment k is SSD k =jj¹ x k ¡^ x k jj 2 =jj¹ x k ¡(® k 0 ¹ p k 0 +® k 1 ¹ p k 1 )jj 2 ; (4.2) where ¹ p k 0 and ¹ p k 1 speci¯es the pixels of ¹ p 0 and ¹ p 1 belonging to segment k. By setting to zero the gradient of (4.2) with ® k 0 + ® k 1 = 1, optimal weights can be found as ® k 0 = ¡(¹ p k 1 ¡¹ x k )¢¹ p k d jj¹ p k d jj 2 ® k 1 = (¹ p k 0 ¡¹ x k )¢¹ p k d jj¹ p k d jj 2 : (4.3) Because the optimal ® k 0 is calculated using information from the block to be encoded, the chosen value has to be signaled. For 16£16 blocks, this signaling overhead may not be justi¯ed given the overall reductions in residual error. Also the complexity to ¯nd the optimal weight is signi¯cant due to the multiplications and the divisions in calculation, which increases as the predictors ¹ p 0 and ¹ p 1 are jointly searched during the motion searching step. Therefore instead of ¯nding 61 optimal weight for each block and signaling them, we de¯ne W, a set of weights most frequently chosen and ¯nd the one with minimum distortion in each seg- ment. In this work, W is chosen to bef(1;0);(0;1);( 1 2 ; 1 2 )g, which corresponds to predictorsf¹ p 0 ;¹ p 1 ; 1 2 (¹ p 0 +¹ p 1 )g respectively. The additional weight ( 1 2 ; 1 2 ) has been selected as the one that is most frequently chosen, as shown in Appendix B. Thus a weight index with only three values f0;1;2g has to be signaled. Note that it is easy to extend this framework by including additional weights in W. With binary arithmetic coding or variable length coding of weight indices, a given weight will be chosen only if it leads to gains in an R-D sense. Insummary, predictionfortheblocktobeencodedisachievedbysignalingthe two predictors, ¹ p 0 and ¹ p 1 , and the weights to be used for each segment, w k . The segmentation itself is generated by encoder and decoder in the same manner from the decoded predictors, so that there is no need for side information to be sent. 4.2.4 Joint Search of Base and Enhancement Predictors Since prediction is performed by combining two predictors using proposed IBS technique, thereisnoguaranteethatonecanobtainthebestmatchingpairofpre- dictors by searching for each predictor individually using standard residual energy metrics based on the whole 16£16 block. In theory one would have to search for pairs of predictors, i.e., for each base predictor candidate, it would be necessary to searchallcandidateenhancementpredictorsandchoosethebestonebycomputing the prediction residue after segmentation and combined base/enhancement predic- tion. If the number of locations in search window is denoted N S , this pair-wise search would have N 2 S pairs of candidates when all candidates in search range are tested for base predictor. For example, for 32£32 full search window, N S is 1024 62 (a) Original mac- roblock (b) All segments (c) Segment-0 (d) Segment-1 (e) Segment-2 (f) Segment-3 (g) Segment-4 (h) Segment-5 Fig. 4.6: After segmentation of the original macroblock from MERL ballroom sequence, the best matches for the segments are added to the set of base predictor can- didates. thus, N 2 S = 1048576. As an alternative solution to individual search or pair-wise search, we start by obtaining a set of base predictor candidates. First, the original macroblock is segmented as shown in Fig. 4.6 and the best matches for the seg- ments are collected as good base predictor candidates by SAD distortion measure. Then, for each base predictor candidate in the set, we perform the joint search for enhancement predictor. For the example of Fig. 4.6, a total of 6 pairs of base and enhancement predictors will be found. Fig. 4.7 illustrates the IBS search loop of the enhancement predictor ¹ p 1 for given base predictor ¹ p 0 . To decide the best pair of base and enhancement pre- dictors, three decisions should be made. First, for each segment, the best weight index should be decided. Second, for each base predictor, the best complementary enhancementpredictorshouldbechosen. Third, for the given macroblock,thebest pairofbaseandenhancementpredictorshouldbedecidedforIBS.Thesedecisions are made based on three di®erent error metrics explained next. 63 2. Segment 1. Calculate p d = p 0 - p 1 3. Find the best weight index in each segment 4. Lowest cost? org x 5. Update weight index and enhP information Yes For each p 0 No Weight set W baseP candidates {p 0 } enhP candidates {p 1 } Fig. 4.7: Search loop of enhancement predictor for given base predictor 4.2.5 Three Error Metrics in Joint Search In our proposed approach a predictor for the original block is generated by com- bining the best prediction in eachsegment from ¹ p 0 ,¹ p 1 and¹ p a , where¹ p a is de¯ned as an average of ¹ p 0 and ¹ p 1 . Therefore, the ¯rst error metric is used to decide which weight index or predictor is used in each segment. In the comparison of three predictors, SSD can be used as a distortion measure, but due to the mul- tiplication complexity in SSD, SAD is adopted instead. In Appendix C, it is shown that there is no penalty for SAD if the residuals are normally distributed. If the residuals follow a Laplace distribution, the residual distortion by SAD can increase up to 11% when ¾ 2 0 ¾ 2 1 2 ( 1 3 ;0:382) or (2:618;3). However, the probability, P ³ ¾ 2 0 ¾ 2 1 2( 1 3 ;0:382) or (2:618;3) ´ is relatively low (for example, less than 7% from coding results of Foreman with QP 24) thus, on average this penalty is negligible. SAD for k th segment is de¯ned as SAD k =min ¹ p j X i2SEG k j¹ x(i)¡¹ p j (i)j (4.4) 64 and the associated weight index is w k =argmin j X i2SEG k j¹ x(i)¡¹ p j (i)j (4.5) where¹ x(i)denotestheoriginalsignalatthepixellocationiandSEG k denotesthe k th segment. The total distortion for the macroblock would be the sum of SAD k over all the segments, P Nseg k=1 SAD k . Now,thetotalSADforagivenenhancementpredictorcandidate, P N seg k=1 SAD k , iscalculatedforeveryenhancementpredictorcandidatewithinsearchrange. These values should be compared to decide what is the best complementary pair for the given base predictor candidate. Because the location of base and enhancement predictor is signaled, the motion vector cost of enhancement predictor candidate needs to be added to J enh , the total cost by the enhancement predictor candidate. Because the enhancement predictor candidates are compared for the given base predictor candidate, the motion vector cost of the base predictor is not added to J enh . Also di®erent enhancement predictors would lead to a di®erent segmenta- tion and as the number of segments increases, total distortion decreases and the signaling cost of weight indices increases. Thus, in order to consider enhancement predictor selection based on a rate and distortion trade-o®, we de¯ne a new cost metric as: J enh = N seg X k=1 SAD k + p ¸N seg dlog 2 N w e+ p ¸C mv (¹ p 1 ) (4.6) where N seg is the number of segments, N w is the number of weight indices and C mv (¢) is the signaling cost of motion vector. In (4.6), N seg dlog 2 N w e+C mv (¹ p 1 ) corresponds to the signaling bits for weight index and motion vector. Considering 65 P N seg k=1 SAD k istheSADdistortionmeasure,notSSD, p ¸isusedasascalingfactor instead of ¸ [44]. In the implementation of IBS, C mv and ¸ follow the de¯nition in H.264/AVC reference codec. If we pick the best enhancement predictor for the given base predictor using J enh , for M base predictor candidates, equal numbers of matching enhancement predictors will be found. Finally, R-D cost of M base and enhancement predictor pairs are calculated to decide the best pair for IBS. 4.2.6 IBS algorithm in H.264/AVC We summarize the IBS algorithm when it is implemented as an additional block mode (INTER16£16 IBS) in the H.264/AVC. IBS Algorithm: 1. Collect base predictor candidates (a) Apply 1-D K-means clustering to the original macroblock followed by two step post-processing and ¯nd segments (b) During motion/disparity search for INTER16£16, ¯nd a match for each segment of the original macroblock from Step-(1a) and form W, a set of base predictor candidates 2. (INTER16£16 IBS block mode) For each base predictor candidate (¹ p 0 ) fromW,thecomplementaryenhancementpredictorissearchedwithinsearch window. Eachenhancementpredictorcandidateinsearchwindowisdenoted as ¹ p 1 . (a) Calculate predictor di®erence, ¹ p d =¹ p 0 ¡¹ p 1 66 (b) Apply1-DK-meansclusteringto¹ p d followedbytwosteppost-processing and ¯nd segments (c) Foreachsegmentof¹ p d fromStep-(2b),¯ndtheweightindexminimizing SAD k as shown in (4.4) and (4.5) and generate new prediction for the original macroblock (d) Calculate J enh in (4.6). If J enh is the minimum, save ¹ p 1 as the best enhancement predictor to ¹ p 0 . (e) Repeat Step-(2a) - Step-(2d) until there is no more enhancement pre- dictor candidate in search window 3. Calculate R-D costs of the pairs found in Step-(2) and ¯nd the pair with minimum R-D cost 4. Compare R-D cost with other QT block modes and choose the one with minimum R-D cost as the best block mode (R-D mode decision) 4.3 Complexity of IBS The impact of IBS on encoding complexity is mostly due to joint search of base and enhancement predictor, where for each pair of base and enhancement predic- tor candidates, segmentation based on the predictor di®erence is applied and in each segment, the weight with minimum distortion is selected (other changes to encoder such as encoding of weight index and R-D based IBS mode decision have a negligible e®ect on overall complexity). Therefore, in what follows, the com- plexity of motion/disparity estimation in IBS is analyzed in terms of arithmetic operations, e.g., addition and multiplication. Tab. 4.1 explains the symbols used in this analysis. Assuming that n-bit integers are used to represent pixel values, 67 Tab. 4.1: De¯nition of symbols in complexity analysis. The integers in parenthesis be- sides N x are the values used in the simulation. Variable Meaning Complexity Meaning N p (256) # of pixels in a macroblock C + :O(n) addition or substraction N it (20) # of iteration (K-means) C £ :O(m) integer multiplication or division N c (2) # of centroids (K-means) C jj :O(1) absolute operation N w (3) # of weights C s :O(1) shift operation addition/substractioncanbedoneinO(n)andmultiplication/divisioncanbedone in O(n 2 ) for the worst case. Depending on the algorithm used for multiplication, the complexity of multiplication/division can be di®erent, and thus the complex- ity of multiplication is denoted as O(m) as in Tab. 4.1. Because absolute or shift operations are applied to the whole number, C jj and C s is equal to O(1). Westartbyanalyzingthecomplexityofsegmentationby1-DK-meanscluster- ing algorithm. Tab. 4.2 summarizes the complexity for each step in the K-means clustering algorithm. To ¯nd predictor di®erence ¹ p d = ¹ p 0 ¡¹ p 1 , N p subtractions are required. Then, 1-D K-means clustering is applied to ¹ p d with the maximum number of iterations, N it . At each iteration, pixels are classi¯ed into the bins ac- cording to the distance to the centroids. Let c k and d k (i) denote the k th centroid and the distance of pixel i to c k then, d k (i)=j¹ p d (i)¡c k j. This distance should be calculated for all the pixels in the macroblock with respect to all centroids, thus the complexity would be N c N p (C + +C jj ). After pixel classi¯cation, the centroids (c k ) are updated based on the pixels in the same bins (BIN k ) as c k = P i2BIN k ¹ p d (i) P i2BIN k 1 with N p C + +N c C £ complexity. Secondly, in each segment the best weight is chosen by comparing the dis- tortions for all weight con¯gurations. With SAD distortion measure, Tab. 4.3 68 Tab. 4.2: Complexity analysis of K-means clustering Predictor di®erence 1-D K-means clustering Pixel classi¯cation Centroid update ¹ p d (i)=¹ p 0 (i)¡¹ p 1 (i) D k (i)=j¹ p d (i)¡c k j c k = P i2BIN k ¹ p d (i) P i2BIN k 1 )N p C + )N c N p (C + +C jj ) )N p C + +N c C £ N p C + N it (N c N p (C + +C jj )+N p C + +N c C £ ) TOTAL: N p (1+N it N c +N it )C + +N it N c C £ +N p N it N c C jj Tab. 4.3: Complexity analysis of weight index decision for each ¹ p 0 and ¹ p 1 pair New predictor generation Weight index selection (SAD) ¹ p j (i)=(® j ¹ p 0 (i)+(1¡® j )¹ p 1 (i))>>r w k = for N p pixels)N p (2C £ +C + +C s ) argmin j P i2SEG k j¹ x(i)¡¹ p j (i)j for j¸2)(N w ¡2)N p (2C £ +C + +C s ) )N w N p (2C + +C jj ) TOTAL: N p (3N w ¡2)C + +2N p (N w ¡2)C £ +N p (N w ¡2)C s +N p N w C jj for N w =3 with the weight ( 1 2 ; 1 2 ))7N p C + +N p C s +3N p C jj »7N p C + summarizes the complexity of the weight index decision for each ¹ p 0 and ¹ p 1 pair. Because ¹ p 0 and ¹ p 1 are given, the number of additional predictor is N w ¡2, which aregeneratedasaweightedsumof¹ p 0 and¹ p 1 . Itisassumedthatmultiplicationby a weight with °oating point precision can be replaced with the multiplication by theintegerweight® j andshiftoperationbyr. To¯ndtheweightindex, SAD'sfor N w weightsarecalculatedineachsegment,whichcorrespondsN w N p (2C + +C jj ). If the only ¹ p a is the average of ¹ p 0 and ¹ p 1 computed with the weights ( 1 2 ; 1 2 ), N w =3 and multiplication can be skipped in the calculation of ¹ p a so that the complexity for predictor generation would be N p (C + +C s ). Using the de¯nitions in Tab. 4.1, the complexity for K-means clustering algo- rithm is approximated as O(n)N p (1+N it N c +N it )+O(m)(N it N c )=O(61nN p )+ O(40m) and the complexity for the weight index decision is approximated as O(7nN p ). 69 Tab. 4.4: Comparison of IBS and GEO complexity. M is the number of base predictor candidates. IBS GEO Complexity M(O(68nN p )+O(40n 2 )) O(4024nN p ) »O(68nN p M) or O(154nN p ) for n=8, M =10 O(5440N p )+O(25600) O(32192N p ) and N p =256 »O(5440N p ) or O(1232N p ) In GEO [13], 2012 or 77 (fast mode) wedge partitions are compared to ¯nd the slope and the displacement for 16£ 16 macroblock. If SAD is used as a distortion measure, this corresponds to 2012N p (2C + + C jj ) (» O(4024nN p )) or 77N p (2C + +C jj )(»O(154nN p ))infastmode. InTab.4.4, thecomplexitiesofIBS and GEO are compared when the number of base predictor candidates is M =10. The complexity of IBS is in between that of the original and that of fast mode of GEO. 4.4 Simulation Results 4.4.1 Implementation within an H.264/AVC Architecture Implicit block segmentation is implemented in the H.264/AVC reference codec - JSVM 8.4. Currentinterblockmodesareextendedinserting INTER16£16 IBS between INTER16 £ 16 and INTER16 £ 8. The R-D optimization tool in H.264/AVC is applied to choose the best mode for each macroblock. To ¯nd base predictor candidates, the original macroblock is segmented ¯rst. If N org segments are obtained after post-processing, N org best matches for the segment are found during INTER16£16 motion search as reliable base predictor candidates. BecausethematchesfromINTER16£16,INTER16£8,INTER8£ 70 Tab. 4.5: Percentage of times that di®erent motion vector predictors (mvp) are selected for enhancement predictor in the current macroblock. Data is collected by encoding 15 frames of Foreman sequence with QP 24 (IPPP). mvp selected Percentage (%) mvp of INTER16£16 51.4 mv of baseP in the same MB 20.4 mv of baseP from left MB 5.7 mv of enhP from left MB 10.1 mv of baseP from upper MB 3.2 mv of enhP from upper MB 9.1 16 and INTER8£8 motion search can be good candidates, those are added and M =N org +9wouldbethemaximumnumberofbasepredictorcandidatesbecause duplicate candidates are removed. In INTER16£ 16 IBS block mode, base and enhancement predictors are jointly searched within search range as described in Section 4.2.4. Thus, for M base predictor candidates, equal numbers of matching enhancement predictors are found. Finally, the R-D costs of M base and enhancement predictor pairs are calculatedandcomparedwithR-DcostsofotherblockmodesinH.264/AVC(R-D mode decision). Encoded information in INTER16£16 IBS includes reference indices and motion vectors for base and enhancement predictors as well as the weightindicesforeachsegment. Encodingsofreferenceindicesandmotionvectors for base and enhancement predictor follow H.264/AVC standards. To exploit the correlation in motion vectors from neighboring blocks, di®erent motion vector predictors (mvp) are used for QT block modes in H.264/AVC. Be- cause INTER16£16 IBS is inserted as an additional block mode, we follow the mvp de¯nition in H.264/AVC and it is modi¯ed only when INTER16£16 IBS hasbeenchosenintheneighboringblocksoritistestedinthecurrentmacroblock. 71 Tab. 4.6: Comparisonofsignalingbitsformotionvectorofenhancementpredictor. Data is collected by encoding 15 frames of Foreman sequence (IPPP). In (A!B), A is the average number of signaling bits for motion vector (mv) when the mvp of the enhancement predictor is set to the mvp of INTER16£16. B is the average number of signaling bits for mv when the mvp of the enhancement predictor is chosen from 6 mvp schemes. QP Average bits for Average bits for Average bits for mv of QT block mode mv of IBS BaseP mv of IBS EnhP 20 20.8! 20.5 7.0! 6.7 7.6! 6.5 24 14.9! 14.8 6.8! 6.6 7.1! 6.1 29 9.4! 9.5 6.4! 6.3 6.2! 5.3 Firstly, assume that the QT block mode is tested in the current macroblock. If neighboring blocks do not use INTER16£16 IBS, the original mvp de¯nition from H.264/AVC is used to ¯nd the mvp of current macroblock. If INTER16£ 16 IBS is used in the neighboring blocks, it is regarded as INTER16£16 with a motion vector from base predictor and the mvp of the current macroblock follows H.264/AVC. Secondly, assume that INTER16£16 IBS is tested in the current macroblock. Base predictor uses the same mvp as INTER16£16. For enhance- ment predictor, to investigate which mvp improves the coding e±ciency most, 6 di®erent mvp's are de¯ned and tested. Tab. 4.5 shows the relative frequencies of occurrenceofthese6mvpschemes. Whensearchingforthebestenhancementpre- dictor for a given base predictor, all mvp candidates are tested and the one with minimum distortion J enh is chosen as the best mvp for enhancement predictor. In this experiment, bits signaling mvp selection are not counted so the simulation results can be regarded as an upper bound. As a comparison to this upper bound, the mvp for the enhancement predictor is ¯xed as the mvp of INTER16£ 16, which is selected most as shown in Tab. 4.5. Tab. 4.6 shows that on average about 72 Tab. 4.7: Comparison of IBS results when the mvp of the enhancement predictor is set to (a) the mvp of INTER16£16 QT block mode and chosen from (b) 6 mvp schemes (upper bound). QP PSNR: (a)! (b) Bit rate: (a)! (b) 20 42.9123! 42.9234 63367! 62979 24 40.0678! 40.0587 33053! 32763 29 36.8543! 36.8702 14534! 14338 1 bit is reduced in signaling the mv of enhancement predictor. However, this re- duction is not enough to be re°ected into the overall coding gains. As can be seen in Tab. 4.7, less than 0.05 dB gains are achieved by the proposed upper bound, where the same data in Tab. 4.6 is used. Therefore, we conclude that there is no signi¯cant improvement in rate-distortion sense and the mvp of enhancement pre- dictor is ¯xed as the mvp of INTER16£16. In summary, if INTER16£16 IBS is used in neighboring blocks, it is treated the same as if it were INTER16£16 with the motion vector used as base predictor. If INTER16£16 IBS is tested in the current macroblock, the mvp of INTER16£16 is used for both base and enhancement predictor. Weight indices f0;1;2g which correspond to base, enhancement and average predictor respectively, are binarized and encoded by variable length code in R-D modedecisionandbinaryarithmeticcodeinbitstreamcoding. Theweightindices are signalled following the order of the segment indices that is de¯ned by raster scanning from the top left corner to the bottom right corner of macroblock. When a pixel is found during the raster scanning, which does not belong to the segment already found, the segment of that pixel is assigned the next index. This segment numbering is repeated until all segments are covered in a macroblock. 73 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 (a) Predictor di®erence of base and en- hancement predictor 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Seg - 1 Seg - 0 Seg - 2 Seg - 3 (b) Segmentation from predictor di®er- ence and chosen weight indices Fig. 4.8: Exampleofpredictordi®erenceandsegmentationfromForemansequence. The segment indices are shown, which are decided by raster scanning from the top left corner to bottom right corner of the macroblock. In Fig. 4.8, an example of predictor di®erence between base and enhancement predictor is shown, with its corresponding segment information. Predictor dif- ference shown in Fig. 4.8 (a) is scaled to show the di®erence clearly. Note that the segmentation shown in Fig. 4.8 (b) captures large predictor di®erences e±- ciently. Segment 0, 2 and 3 choose the weight index 0, base predictor and segment 1 chooses weight index 1, enhancement predictor. For the macroblock of this ex- ample, we signal INTER16£16 IBS block mode ¯rst. Then, the reference index andmotionvectorforbaseandenhancementpredictorissent. Finally,fourweight indices for each segment are sent. Note that the number of segments and the seg- ments themselves are not transmitted but extracted at the decoder using base and enhancement predictor information. Prediction by IBS achieves 30% SSD reduc- tion as compared with the best predictor based on a quad-tree for the example of Fig. 4.8. 74 0.4 0.6 0.8 1 1.2 1.4 1.6 x 10 5 35 36 37 38 39 40 41 42 43 Bit/Frame PSNR (dB) MERL_Ballroom: 320x240, IPPP, Cross−view H.264/AVC REF1 H.264/AVC+IBS REF1 H.264/AVC REF3 H.264/AVC+IBS REF3 Fig. 4.9: MERL Ballroom with 1 and 3 reference 4.4.2 Simulation Results Both multi-view video (MERL Ballroom, 320(w)x240(h)) and standard video se- quences (Foreman, 352(w)x288(h)) are tested. In MERL Ballroom, each anchor has 8 views coded IPPP PPPP and 2 anchors at di®erent time stamps (0, 10) are tested. In Foreman, 15 frames are coded as IPPP. Encoding conditions of H.264/AVC and H.264/AVC+IBS are the same except that in H.264/AVC+IBS, INTER16£16 IBS is tested as an additional inter block mode. QP 20, 24, 29 are used with§32 search range with quarter-pel and CABAC enabled. As can be seen in Figs. 4.9 and 4.10, 0.1-0.2 dB gains are achieved in MERL Ballroom and 0.2-0.4 dB gains from Foreman. Note that gains by IBS increase with the number of references. To see how the prediction gains achieved by IBS are re°ected into R-D gains, in Tab. 4.8, average distortions and bits are shown for blocks best predicted by IBS in R-D mode decision. Improvements in prediction quality by IBS shown in the reduction of SSD p are translated into reduction in residual coding bits and 75 1 2 3 4 5 6 7 x 10 4 36 37 38 39 40 41 42 43 44 Bit/Frame PSNR (dB) Foreman: 352x288, IPPP, Temporal H.264/AVC REF1 H.264/AVC+IBS REF1 H.264/AVC REF3 H.264/AVC+IBS REF3 Fig. 4.10: Foreman with 1 and 3 references SSD in reconstructed frame, SSD r . Note that typically the bits needed to signal motion vectors are reduced because only two predictors are used in IBS (while a QT approach could use more than two vectors). Extra bits are needed to signal weights when using IBS. Gains are not encouraging in MERL Ballroom. Firstly, due to the noisy back- ground of MERL Ballroom, predictor di®erence results in noisy segments, which increases signaling bits for weight indices as shown in Tab. 4.8. Secondly, for im- plicit block segmentation, it is assumed that references are not corrupted or mis- matches including illumination and focus do not exist between frames. As shown in [17], there exist illumination mismatches between frames in di®erent views. When two di®erent segments with non-zero DC level exist in a 4x4 or 8x8 DCT block as shown in Fig. 4.11, this leads to increases in high frequency components so that residual coding bits increase. Also this may create arti¯cial boundaries within a block. Note that in Tab. 4.8, 20% reduction in SSD p is translated into only9%reductioninresidualbitsinMERL Ballroomwhile12%reductioninSSD p 76 Tab. 4.8: Comparison of data by QT and IBS from MERL Ballroom and Foreman with QP 20. Data is averaged for the macroblocks where IBS is the best mode from 14 P-frames in each sequence. A! B means `data by QT' ! `data by IBS'. SSD p and SSD r are SSD between the original and predictor and between the original and reconstruction, respectively. Bit res , Bit mv and Bit w are bits for residual, motion/disparity vectors and weight indices respectively. Sequence SSD p SSD r Bit res Bit mv Bit w MERL 12403! 9885 1463! 1464 364! 333 19! 17 0! 11.5 Ballroom (20%) (0%) (9%) (10%) Foreman 3209! 2817 1077! 1052 149! 135 23! 16 0! 7.6 (12%) (2%) (9%) (32%) segment boundary segment 0 segment 1 Fig. 4.11: ACincreasesby4x4or8x8blockDCTduetotheunequalDCresidualbetween di®erent segments in IBS. is translated into 9% reduction in residual bits and 2% reduction in SSD r in Fore- man. Combined with illumination compensation [17], the performance of IBS for cross-view prediction could be improved. 4.5 Conclusions Inthischapter,implicitblocksegmentationbasedonthepredictorsavailableatthe decoderisproposed. Giventwocandidateblockpredictors,segmentationisapplied to the predictor di®erence. Di®erent weighted sums of predictors are selected for 77 each segment and signaled to the decoder. Implementation in H.264/AVC shows encouraging results in Foreman, where illumination mismatches are not present. Combining IBS with mismatch compensation tools would increase the coding e±- ciency in cross-view prediction. Areas of future work include improvements to the segmentation strategy where most of computational complexity of IBS comes and e±cient search techniques to allow searching for pairs of predictors. 78 Chapter 5 Conclusions and Future Work 5.1 Conclusions Firstly, the 2-D dependency problem that arises in MVC was addressed in Chap- ter 2. Because both cross-view and temporal correlations are exploited to improve coding e±ciency in MVC, 2-D dependencies are present in MVC. Optimal bit al- location is possible based on 3-D trellis expansion but with signi¯cant complexity during data generation process. To reduce the complexity, monotonicity property is extended to 3-D trellis expansion and from the correlation between quantizers of anchor and non-anchor frames, the number of quantizer candidates for non- anchor frames is limited. With proposed bit allocation scheme, 0.5 - 1 dB gains are achieved. Next, the illumination mismatch problem in multi-view video was covered in Chapter 3. Even with sophisticated calibration, it is not possible to ensure all cameras in an array are calibrated perfectly, which causes global brightness mis- matches among di®erent views. Even with perfect camera calibration, an object may appear di®erently due to the di®erent depths and perspectives of the objects 79 with respect to each camera causing local mismatches. The accuracy of dispar- ity search is degraded by these brightness variation between frames, leading to the degradationofcodinge±ciency. Tocompensatebothglobalandlocalmismatches, a block level illumination compensation (IC) model is proposed. Because di®er- ent portions of a video frame can undergo di®erent illumination changes, block by block activation of IC model is proposed. For e±cient transmission, IC parame- tersarequantizedandbinaryarithmeticcoded. ItisshownthatICrequiresabout 64% additional calculation within motion/disparity search. Simulation results of cross-view prediction show 0.2 - 0.8 dB gains. IC techniques are applied to both temporal and cross-view prediction in MVC and achieve higher coding e±ciency as compared to WP. It is also shown how IC and ARF can be combined to com- pensate both illumination and focus mismatches in MVC. The combined system achieves 0.5 - 1.3 dB gains in cross-view prediction of three test sequences. In Chapter 4, an implicit block segmentation (IBS) method was proposed in order to improve the quality of prediction. Block based motion/disparity estima- tion and compensation provides a good balance between prediction accuracy and rate overhead. However most of the object boundaries are not perfectly aligned with block boundaries, which makes motion/disparity search di±cult and reduces coding e±ciency. Given two candidate block predictors, from the observation that distortion can be reduced further where two predictors di®er most, segmentation is applied to the block of predictor di®erence. For each segment, weighted sum of predictors with minimum distortion is decided. Additional overheads for each block include the locations of two predictors and weight indices for each segment. Segment information can be retrieved implicitly by repeating the segmentation for the predictor di®erence at the decoder. IBS is implemented as an additional block mode in H.264/AVC reference codec and achieves 0.1 - 0.4 dB gains in cross-view 80 prediction of Ballroom and Foreman. The more references are available, the more the coding e±ciency of IBS improves. 5.2 Future Work Although each chapter addresses di®erent problems of predictive coding in MVC, these techniques can be combined to provide a uni¯ed solution. For example, IBS is proposed assuming there are no mismatches between frames. However in cross-view prediction, there are illumination mismatches. Therefore, applying IC in each segment by IBS helps to ¯nd the correct match and improve overall coding gains. On top of the block level compensations by IC and IBS, 2-D dependent bit allocation can be applied in order to optimize available resources in frame level. Inthiswork,itisassumedthatonlymulti-viewsequencesareavailablewithout any other information. However, when auxiliary information, e.g., camera param- eters or depth information, is available, e±ciency of multi-view video coding can be further improved. For example, instead of sending all the views, only video sequences corresponding to a subset of views are transmitted along with depth information, from which intermediate views can be interpolated. Because dispari- ties in cross-view prediction are caused by the di®erent depths of the objects and camera perspectives, if camera parameters and object depths are known, it can be used to help disparity estimation/compensation faster and more accurate. 81 Bibliography [1] \Callforproposalsonmulti-viewvideocoding,"ISO/IECJTC1/SC29/WG11 MPEG Document N7327, Jul. 2005. [2] \Description of core experiments in MVC," ISO/IEC JTC1/SC29/WG11 MPEG Document W8019, Montreux, Switzerland, Apr. 2006. [3] M. Accame, F. D. Natale, and D. Giusto, \Hierarchical block matching for disparity estimation in stereo sequences," in Proc. IEEE International Con- ference on Image Processing (ICIP), vol. 2, Washington, USA, Oct. 1995, pp. 23{26. [4] E. H. Adelson and J. R. Bergen, Computational Models of Visual Processing. Poznan, Poland: MIT Press, 1991. [5] H. Aydinoglu and M. Hayes, \Compression of multi-view images," in Proc. IEEE International Conference on Image Processing (ICIP), vol. 2, Austin, TX, Nov. 1994, pp. 385{389. [6] J. M. Boyce, \Weighted prediction in the H.264/MPEG AVC video coding standard," in Proc. IEEE International Symposium on Circuits and Systems (ISCAS), vol. 3, Vancouver, Canada, May 2004, pp. 789{792. [7] F. Chang, C. Chen, and C. Lu, \A linear-time component-labeling algorithm usingcontourtracingtechnique," Computer Vision and Image Understanding (CVIU), vol. 93, no. 2, pp. 206{220, Feb. 2004. [8] G. Chen, J. H. Kim, J. Lopez, and A. Ortega, \Response to call for evidence onmulti-viewvideocoding," ISO/IECJTC1/SC29/WG11MPEGDocument M11731, Hong Kong, China, Jan. 2005. [9] O. D. Escoda, P. Yin, D. Congxia, and L. Xin, \Geometry-adaptive block partitioning for video coding," in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, Apr. 2007, pp. 657{660. 82 [10] C. Fehn, N. Atzpadin, M. Muller, O. Schreer, A. Smolic, R. Tanger, and P. Kau®, \An advanced 3DTV concept providing interoperability and scala- bility for a wide range of multi-baseline geometries," in Proc. IEEE Interna- tional Conference on Image Processing (ICIP), Oct. 2006, pp. 2961{2964. [11] T. Fujii, T. Kimoto, and M. Tanimoto, \Ray space coding for 3-D visual communication," in Picture Coding Symposium (PCS), Meloburne/Australia, Mar. 1996. [12] Y. He, J. Ostermann, M. Tanimoto, and A. Smolic, \Introduction to the special section on multiview video coding," IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 11, pp. 1433{1435, Nov. 2007. [13] E. Hung and R. D. Queiroz, \On macroblock partition for motion compensa- tion," in Proc. IEEE International Conference on Image Processing (ICIP), Oct. 2006, pp. 1697{1700. [14] (2006, Jul.) Software implementation of H.264: JM Version 10.2. The Image Communication Group at Heinrich Hertz Institute Germany. [Online]. Available: http://iphome.hhi.de/suehring/tml/index.htm [15] K. Kamikura, H. Watanabe, H. Jozawa, H. Kotera, and S. Ichinose, \Global brightness-variation compensation for video coding," IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no. 8, pp. 988{1000, Dec. 1998. [16] S. Kang, C. Zitnick, M. Uyttendaele, S. Winder, and R. Szeliski, \Free- viewpoint video with stereo and matting," in Picture Coding Symposium (PCS), San Francisco, USA, Dec. 2004. [17] J. H. Kim, P. Lai, J. Lopez, A. Ortega, Y. Su, P. Yin, and C. Gomila, \New coding tools for illumination and focus mismatch compensation in multiview video coding," IEEE Transactions on Circuits and Systems for Video Tech- nology, vol. 17, no. 11, pp. 1519{1535, Nov. 2007. [18] J. H. Kim, P. Lai, A. Ortega, Y. Su, P. Yin, and C. Gomila, \Results of CE2 onmulti-viewvideocoding," ISO/IECJTC1/SC29/WG11MPEGDocument M13720, Klagenfurt, Austria, Jul. 2006. [19] S. Kim and R. Park, \Fast local motion-compensation algorithm for video sequences with brightness variations," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 4, pp. 289{ 299, Apr. 2003. [20] H. Kimata, M. Kitahara, K. Kamikura, and Y. Yashima, \Free-viewpoint video communication using multi-view video coding," NTT, Yokosuka-shi, Japan, Tech. Rep. F0282C, 2-8, 2004. 83 [21] P.Lai, Y.Su, P.Yin, C.Gomila, andA.Ortega, \Adaptive¯lteringforcross- view prediction in multi-view video coding," in Proc. SPIE Visual Communi- cation and Image Processing (VCIP), vol. 6508, San Jose, CA, Jan. 30-Feb. 1 2007. [22] Y. Lee, J. Hur, Y. Lee, K. Han, S. Cho, N. Hur, J. Kim, J. H. Kim, P. Lai, A. Ortega, Y. Su, P. Yin, and C. Gomila, \CE11 : Illumination compensa- tion," Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG JVT- U052, Hangzhou, China, Oct. 2006. [23] D. Liu, Y. He, S. Li, Q. Huang, and W. Gao, \Linear transform based mo- tion compensated prediction for luminance intensity changes," in Proc. IEEE International Symposium on Circuits and Systems (ISCAS), vol. 1, Beijing, China, May 2005, pp. 304{307. [24] S.LiuandC.-C.J.Kuo,\Jointtemporal-spatialbitallocationforvideocoding with dependency," IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 1, pp. 15{26, Jan. 2005. [25] J. Lopez, J. H. Kim, A. Ortega, and G. Chen, \Block-based illumination compensation and search techniques for multiview video coding," in Picture Coding Symposium (PCS), San Francisco, USA, Dec. 2004. [26] J. Lopez, \Block-based compression techniques for multiview video coding," Master'sthesis,UniversitatPolitecnicadeCatalunya,Barcelona,Spain,2005. [27] D. Marpe, H. Schwarz, and T. Wiegand, \Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 620{636, Jul. 2003. [28] W. Matusik and H. P¯ster, \3D TV: A scalable system for real-time acqui- sition, transmission, and autostereoscopic display of dynamic scenes," ACM Transactions on Graphics, vol. 24, no. 3, pp. 814{824, Aug. 2004. [29] D.Mount.KMlocal: AtestbedforK-meansclusteringalgorithmsbasedonlo- calsearchVersion: 1.7.1.DeptofComputerScienceatUniversityofMaryland. [Online]. Available: http://www.cs.umd.edu/»mount/Projects/KMeans/ [30] K. Mueller, P. Merkle, A. Smolic, and T. Wiegand, \Multiview coding us- ing avc," ISO/IEC JTC1/SC29/WG11 MPEG Document M12945, Bangkok, Thailand, Jan. 2006. [31] U. Neumann, T. Pintaric, and A. Rizzo, \Immersive panoramic video," in MULTIMEDIA '00: Proceedings of the eighth ACM international conference on Multimedia, Marina del Rey, California, USA, 2000, pp. 493{494. 84 [32] W. Niehsen and S. Simon, \Block motion estimation using orthogonal projec- tion," in Proc. IEEE International Conference on Image Processing (ICIP), Lausanne, Switzerland, Sep. 1996. [33] M. Orchard, \Predictive motion-¯eld segmentation for image sequence cod- ing,"IEEETransactionsonCircuitsandSystemsforVideoTechnology,vol.3, no. 1, pp. 54{70, Feb. 1993. [34] K. Ramchandran, A. Ortega, and M. Vetterli, \Bit allocation for dependent quantization with applications to multiresolution and MPEG video coders," IEEE Transactions on Image Processing,vol.3,no.5,pp.533{545,Sep.1994. [35] N.Sebe,M.S.Lew,andD.P.Huijsmans,\Towardimprovedrankingmetrics," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 10, pp. 1132{1143, Oct. 2000. [36] Y. Sermadevi and S. S. Hemami, \E±cient bit allocation for dependent video coding," in Proc. Data Compression Conference (DCC), Mar. 2004, pp. 232{ 241. [37] Y. Shoham and A. Gersho, \E±cient bit allocation for an arbitrary set of quantizers," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 36, no. 9, pp. 1445{1453, Sep. 1988. [38] R. Shukla, P. Dragotti, M. Do, and M. Vetterli, \Rate-distortion optimized tree-structured compression algorithms for piecewise polynomial images," IEEE Transactions on Image Processing, vol. 14, no. 3, pp. 343{359, Mar. 2005. [39] A. Smolic, K. Mueller, P. Merkle, C. Fehn, P. Kau®, P. Eisert, and T. Wie- gand, \3D video and free viewpoint video - technologies, applications and MPEG standards," in Proc. IEEE International Conference on Multimedia and Expo (ICME), Jul. 2006, pp. 2161{2164. [40] Y. Su, P. Yin, C. Gomila, J. H. Kim, P. Lai, and A. Ortega, \Thom- son'sresponsetoMVCCfP,"ISO/IECJTC1/SC29/WG11MPEGDocument M12969/2, Bangkok, Thailand, Jan. 2006. [41] K. M. Uz, J. M. Shapiro, and M. Czigler, \Optimal bit allocation in the presence of quantizer feedback," in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, Apr. 1993, pp. 385{388. [42] A. Vetro, Y. Su, H. Kimata, and A. Smolic, \Joint multiview video model(JMVM)2.0," ISO/IECJTC1/SC29/WG11MPEGDocumentN8459, Hangzhou, China, Oct. 2006. 85 [43] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, \Overview of the H.264/AVC video coding standard," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560{576, Jul. 2003. [44] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan, \Rate- constrained coder control and comparison of video coding standards," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 688{703, Jul. 2003. [45] W.WooandA.Ortega,\Optimalblockwisedependentquantizationforstereo image coding," IEEE Transactions on Circuits and Systems for Video Tech- nology, vol. 9, no. 6, pp. 861{867, Sep. 1999. [46] P.Yin,H.-Y.C.Tourapis,A.Tourapis,andJ.Boyce,\Fastmodedecisionand motion estimation for JVT/H.264," in Proc. IEEE International Conference on Image Processing (ICIP), vol. 3, Sep. 2003, pp. 853{856. 86 Appendix A Comparison between MSD and MAD in Motion/Disparity Search Let ¹ P = [P 0 ;P 1 ;:::;P N¡1 ] T be a candidate predictor for the original signal ¹ X = [X 0 ;X 1 ;:::;X N¡1 ] T ,forresidualerror(X i ¡P i ),thesumofsquareddi®erence(SSD) and the sum of absolute di®erence (SAD) metrics are de¯ned as SSD = 1 N N¡1 X i=0 (X i ¡P i ) 2 (A.1) SAD = 1 N N¡1 X i=0 jX i ¡P i j (A.2) It is known that SSD and SAD are justi¯ed as an error metric from maximum likelihood perspectives when the error follows normal and Laplace distribution, respectively [35]. In block motion search, due to the complexity of multiplication in SSD, SAD is commonly used as a search metric. For example, in illumination compensation (IC) in Chapter 3, scale and o®set parameters are calculated using SSDbutforthemotion/disparitysearch,SADaftercompensationisadopted. Also in implicit block segmentation (IBS) in Chapter 4, SAD is adopted instead of SSD during motion/disparity search. In this appendix, from the statistical modeling of residual error (X i ¡P i ), we evaluate conditions for the motion search results by SAD to be equal to those obtained with SSD. 1 Letpbethepredictionoforiginalsignalx,thenmeansquareddi®erence(MSD) and mean absolute di®erence(MAD) are de¯ned as MSD =Ef(x¡p) 2 g (A.3) MAD =Efjx¡pjg; (A.4) 1 WebelievetherewouldbethesimilarevaluationstoAppendixAbuttheconceptsandterms used in this analysis help understanding Appendix B and C thus, we start from scratch. 87 which are statistically equal to SSD and SAD, respectively. Let p 0 and p 1 be two candidate predictors for the original signal x. Then their respective residual errors are denoted n 0 and n 1 : n 0 =x¡p 0 n 1 =x¡p 1 : (A.5) LetMSD i andMAD i denote MSD and MAD by p i , respectively. If the mean and the variance of n i are denoted as ¹ i and ¾ 2 i , from (A.3) MSD i =Ef(x¡p i ) 2 g=Efn 2 i g=¹ 2 i +¾ 2 i : (A.6) If ¹ i »0 or ¹ i ¾ i »0, from (A.6) MSD i =Ef(x¡p i ) 2 g=Efn 2 i g=¹ 2 i +¾ 2 i »¾ 2 i : (A.7) Thus if ¾ 2 0 < ¾ 2 1 , p 0 will be chosen based on the MSD distortion measure. Note that the result is derived from the second order statistics of n 0 and n 1 without any assumption about a speci¯c probability model. Duetotheabsoluteoperation, MADcannotbefounddirectlyfromthesecond order statistics. In this appendix, MAD is derived for the statistical models of (i) `normal' and (ii) `Laplace' distributions. In Fig. A.1, the distribution of residual errors (n = x¡ p) from coding results of Foreman sequence is compared with normal and Laplace distributions using mean and variance from coding results, which veri¯es that both distributions are good approximations of real data. Note thatduringthemotionsearch,thequalityofpredictorimprovesbytheerrormetric and converge to the best predictor, thus the predictors used in Fig. A.1 are the best matches to the original signal in each block. (i) For normal distribution model n»N(¹;¾ 2 ), Efjnjg= Z 1 ¡1 jnj p 2¼¾ 2 e ¡ (n¡¹) 2 2¾ 2 dn = r 2¾ 2 ¼ e ¡ ¹ 2 2¾ 2 ¡¹ ³ 1¡2Q(¡ ¹ ¾ ) ´ (A.8) where Q(c)= Z 1 c 1 p 2¼ e ¡ t 2 2 dt: Therefore, MAD i =Efjn i jg= r 2¾ 2 i ¼ e ¡ ¹ 2 i 2¾ 2 i ¡¹ i µ 1¡2Q(¡ ¹ i ¾ i ) ¶ (A.9) 88 −60 −40 −20 0 20 40 60 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Org Y − Pred Y Probability Real data Noraml pdf (a) Normal distribution −60 −40 −20 0 20 40 60 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Org Y − Pred Y Probability Real data Laplace pdf (b) Laplace distribution Fig. A.1: Comparison of normal and Laplace distribution with real data obtained by encoding Foreman sequences (CIF). Data is collected from 7 P frames coding using QP 20,§32 search range and quarter-pixel precision by JSVM 8.4. The di®erencesbetweenoriginalandpredictordataareobtainedonlyforluminance. Mean and variance are -0.25 and 10.66 respectively. 89 If ¹ i »0 or ¹ i ¾ i »0, MAD i » r 2¾ 2 i ¼ (A.10) For signals with ¾ 2 0 <¾ 2 1 , from (A.10), p 0 is selected as a better estimate of x than p 1 . Therefore, when the residual errors n 0 and n 1 follow the normal distribution with ¹ i »0 and/or ¹ i ¾ i »0, MSD and MAD give the same result. (ii) If n follows a Laplace distribution, probability distribution function (pdf) is de¯ned as f n (n)= 1 2a e ¡ jn¡¹j a (A.11) where ¹=Efng and ¾ 2 =2a 2 =Ef(n¡¹) 2 g and Efjnjg= Z 1 ¡1 jnj 2a e ¡ jn¡¹j a dn =ae ¡ j¹j a +j¹j (A.12) Therefore, using a= q ¾ 2 2 MAD i = r ¾ 2 i 2 e ¡ j¹ i j ¾ i p 2 +j¹ i j (A.13) If ¹ i »0 or ¹ i ¾ i »0, MAD i » ¾ i p 2 (A.14) For signals with ¾ 2 0 <¾ 2 1 , from (A.14) p 0 is selected as a better estimate of x than p 1 . Therefore, when the residual error n 0 and n 1 follows Laplace distribution with ¹ i »0 and/or ¹ i ¾ i »0, MSD and MAD give the same result. In conclusion, for `normal' and `Laplace' distributions, if ¹ i »0 and/or ¹ i ¾ i »0 are satis¯ed, SSD and SAD would provide the same searching capability. As can be seen in Fig. A.1, the conditions `¹ i » 0' and/or ` ¹ i ¾ i » 0' would be satis¯ed in most video sequences. 2 2 Note that above analysis is derived `statistically' but in block motion/disparity search, there might be blocks such that the accurate predictor to the original signal is hard to ¯nd (e.g., occluded or uncovered regions) thus, the conditions `¹ i » 0' and ` ¹ i ¾i » 0' are not satis¯ed. In these cases, ifj¹ i jÀ0 orj ¹ i ¾ i jÀ0, MSD (=¹ 2 i +¾ 2 i ) tends to be large thus, instead of INTER block mode where motion/disparity search is performed, INTRA block mode would be used where the comparison between SSD and SAD in motion search has no meanings. 90 Appendix B Additional Weight Selection in IBS In this appendix, it is shown statistically why ( 1 2 ; 1 2 ) has been included in W in addition to (1;0) and (0;1) which correspond to p 0 and p 1 respectively. Letxbetheoriginalpixelpredictedbytwopixelpredictorp 0 andp 1 respectively and corresponding residual errors are represented by noise signal n 0 and n 1 . n 0 =x¡p 0 n 1 =x¡p 1 Let the mean and variance of n i be denoted ¹ i = Efn i g and ¾ 2 i = Ef(n i ¡¹ i ) 2 g, respectively. An additional predictor p a is de¯ned as a weighted sum of p 0 and p 1 ; p a =® 0 p 0 +® 1 p 1 . Let n a denote the residual error by p a thus, n a =x¡p a . With the constraint ® 0 +® 1 =1 to make p a =p 0 with ® 0 =1 and p a =p 1 with ® 1 =1, n a =x¡p a =(® 0 +® 1 )x¡(® 0 p 0 +® 1 p 1 ) =® 0 (x¡p 0 )+® 1 (x¡p 1 ) =® 0 n 0 +® 1 n 1 : Therefore, the mean and the variance of n a are ¹ a =Efn a g=® 0 ¹ 0 +® 1 ¹ 1 ¾ 2 a =Ef(n a ¡¹ a ) 2 g=® 2 0 ¾ 2 0 +2® 0 ® 1 ¾ 2 c +® 2 1 ¾ 2 1 (B.1) 91 where ¾ 2 c = Ef(n 0 ¡¹ 0 )(n 1 ¡¹ 1 )g. The residual energy corresponding to p a is quanti¯ed as MSE a =Efn 2 a g=¹ 2 a +¾ 2 a =(® 0 ¹ 0 +® 1 ¹ 1 ) 2 +® 2 0 ¾ 2 0 +® 2 1 ¾ 2 1 +2® 0 ® 1 ¾ 2 c =(® 0 (¹ 0 ¡¹ 1 )+¹ 1 ) 2 +® 2 0 ¾ 2 0 +(1¡® 0 ) 2 ¾ 2 1 +2® 0 (1¡® 0 )¾ 2 c =® 2 0 ((¹ 0 ¡¹ 1 ) 2 +¾ 2 0 +¾ 2 1 ¡2¾ 2 c )¡2® 0 (¾ 2 1 +¹ 2 1 ¡¾ 2 c ¡¹ 0 ¹ 1 )+¾ 2 1 +¹ 2 1 =® 2 0 (~ ¾ 2 0 +~ ¾ 2 1 )¡2® 0 ~ ¾ 2 1 +¾ 2 1 +¹ 2 1 =(~ ¾ 2 0 +~ ¾ 2 1 ) µ ® 0 ¡ ~ ¾ 2 1 ~ ¾ 2 0 +~ ¾ 2 1 ¶ 2 +¾ 2 1 +¹ 2 1 ¡ ~ ¾ 4 1 ~ ¾ 2 0 +~ ¾ 2 1 (B.2) where ~ ¾ 2 i = ¾ 2 i +¹ 2 i ¡¾ 2 c ¡¹ 0 ¹ 1 for i2f0;1g. By setting to zero the gradient of MSE a with respect to ® 0 in eq. (B.2), the optimal ® 0 and ® 1 can be found as ® 0 = ~ ¾ 2 1 ~ ¾ 2 0 +~ ¾ 2 1 = Efn 1 (n 1 ¡n 0 )g Ef(n 0 ¡n 1 ) 2 g ® 1 = ~ ¾ 2 0 ~ ¾ 2 0 +~ ¾ 2 1 ; (B.3) and the minimum MSE a is MMSE a =¾ 2 1 +¹ 2 1 ¡ ~ ¾ 4 1 ~ ¾ 2 0 +~ ¾ 2 1 =¾ 2 c +¹ 0 ¹ 1 + ~ ¾ 2 0 ~ ¾ 2 1 ~ ¾ 2 0 +~ ¾ 2 1 : (B.4) Duetothecomputationalcomplexityofmultiplicationanddivisionin(B.3)and signaling overhead of ® 0 , a weight can be pre-selected and only the weight index minimizing the distortion can be signaled. From the constraints that weights are non-negativeand® 0 +® 1 =1,® 0 shouldbein(0;1)andastraightforwardselection would be 1 2 which corresponds to the average of p 0 and p 1 . With respect to the computational complexity, 1 2 is the most e±cient weight between (0;1) because in the calculation of new predictor p a , only the sum of p 0 and p a followed by shift operation is needed as can be seen in chapter 4.3. If ® is de¯ned as the weight most frequently chosen, it can be found as ® =arg max 0<® 0 <1 PfMSE a <MSE 0 & MSE a <MSE 1 g: (B.5) 92 From (B.2), MSE a <MSE 0 ,® 2 0 (~ ¾ 2 0 +~ ¾ 2 1 )¡2® 0 ~ ¾ 2 1 +¾ 2 1 +¹ 2 1 <¾ 2 0 +¹ 2 0 , 1¡® 0 1+® 0 < ~ ¾ 2 0 ~ ¾ 2 1 (B.6) MSE a <MSE 1 ,® 2 0 (~ ¾ 2 0 +~ ¾ 2 1 )¡2® 0 ~ ¾ 2 1 +¾ 2 1 +¹ 2 1 <¾ 2 1 +¹ 2 1 , ~ ¾ 2 0 ~ ¾ 2 1 < 2¡® 0 ® 0 : (B.7) From (B.6) and (B.7), (B.5) is equal to ® =arg max 0<® 0 <1 P ½ 1¡® 0 1+® 0 < ~ ¾ 2 0 ~ ¾ 2 1 < 2¡® 0 ® 0 ¾ : (B.8) If m pixel residuals in the segment are regarded as m sample observations from independent normal random variable n i » N(0;· 2 i ) with i 2 f0;1g, the sum of residual energy by p i in the segment would be (m¡1)s 2 i where s 2 i is the sample variance from n i . Replacing ~ ¾ 2 i with s 2 i in (B.8) and noting that  2 i = (m¡1)s 2 i · 2 i has a chi-square density function with º i =m¡1 degrees of freedom, ® =arg max 0<® 0 <1 P ½ 1¡® 0 1+® 0 < · 2 0  2 0 · 2 1  2 1 < 2¡® 0 ® 0 ¾ : (B.9) Because n 0 and n 1 are assumed to be independent,  2 0 and  2 1 are independent chi-square random variables with º 0 and º 1 degrees of freedom, respectively, and then F =  2 0 =º0  2 1 =º 1 has an F-distribution with º 0 numerator degrees of freedom and º 1 denominator degrees of freedom. With º 0 =º 1 =m¡1, ® =arg max 0<® 0 <1 P ½ 1¡® 0 1+® 0 < · 2 0 · 2 1 F < 2¡® 0 ® 0 ¾ : (B.10) The probability in (B.10) is calculated for various values of three parameters. First, for six di®erent values of m 2 f10;50;100;150;200;250g, the probabilities are calculated ¯xing · 2 0 · 2 1 and ® 0 . Then their average is shown in the Tab. B.1 with respect to each · 2 0 · 2 1 and ® 0 . The range of · 2 0 · 2 1 is limited to £ 1 3 ;3 ¤ because most of · 2 0 · 2 1 lies between £ 1 3 ;3 ¤ as can be seen in the example of Fig. C.1. As shown in the last row of Tab. B.1, the average probability over m and · 2 0 · 2 1 is the highest for 93 Tab. B.1: The probability in (B.10) is calculated changing three parameters, (i) m, (ii) · 2 0 · 2 1 , and (iii) ® 0 . The average of probabilities for m = f10;50;100;150;200;250g is shown with respect to di®erent · 2 0 · 2 1 and ® 0 . The last row shows the average over · 2 0 · 2 1 . ® 0 =0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 · 2 0 · 2 1 =3 0.994 0.987 0.966 0.892 0.500 0.130 0.050 0.028 0.017 2 0.983 0.987 0.983 0.970 0.934 0.774 0.353 0.108 0.048 3 2 0.965 0.978 0.983 0.980 0.969 0.942 0.840 0.500 0.168 1 0.828 0.935 0.965 0.977 0.980 0.977 0.965 0.935 0.828 2 3 0.168 0.500 0.840 0.942 0.969 0.980 0.983 0.978 0.965 1 2 0.048 0.108 0.353 0.774 0.934 0.970 0.983 0.987 0.983 1 3 0.017 0.028 0.050 0.130 0.500 0.892 0.966 0.987 0.994 0.572 0.646 0.734 0.809 0.827 0.809 0.734 0.646 0.572 ® 0 = 1 2 . If we take the probability of · 2 0 · 2 1 in Fig. C.1 into account, the probabilities corresponding to · 2 0 · 2 1 =1 are weighted most, which favors the weight 1 2 more. Although the weight 1 2 is found with the assumption that n 0 and n 1 are inde- pendentandfollowthenormaldistributionswithzeromean,itisthemoste±cient weight which has the least computational complexity for p a . Therefore in IBS, the additional weight is de¯ned to be 1 2 . 94 Appendix C Comparison between MSD and MAD for Weight Selection InordertoavoidthecomplexitybymultiplicationinSSDdistortionmeasure,SAD is adopted in weight selection for each segment in IBS. In this appendix, from the statistical modeling of residual error, we study the penalty incurred for using SAD instead of SSD in IBS weight selection. Let p a be the weighted sum of p 0 and p 1 as p a =® 0 p 0 +® 1 p 1 with® 0 +® 1 =1and0<® 0 <1. Then, associatedresidualerror n a isrepresented as n a =® 0 n 0 +® 1 n 1 : Let ¹ i and ¾ 2 i be the mean and the variances of n i . Then from (B.1), ¹ a =Efn a g=® 0 ¹ 0 +® 1 ¹ 1 ¾ 2 a =Ef(n a ¡¹ a ) 2 g=® 2 0 ¾ 2 0 +2® 0 ® 1 ¾ 2 c +® 2 1 ¾ 2 1 (C.1) where ¾ 2 c =Ef(n 0 ¡¹ 0 )(n 1 ¡¹ 1 )g. With three predictors (p 0 , p 1 and p a ) available in each segment of the mac- roblock, p a will be selected when it gives the minimum distortion. For MSD dis- tortion measure, from (B.6) and (B.7) MSD a <MSD 0 & MSD a <MSD 1 , Efn 2 a g<Efn 2 0 g & Efn 2 a g<Efn 2 1 g , ® 2 1 1¡® 2 0 < ~ ¾ 2 0 ~ ¾ 2 1 < 1¡® 2 1 ® 2 0 (C.2) 95 where ~ ¾ 2 i = ¾ 2 i +¹ 2 i ¡¾ 2 c ¡¹ 0 ¹ 1 for i2f0;1g. If (i) n 0 and n 1 are uncorrelated and (ii) ¹ i » 0 and/or ¹ i ¾ i » 0, then ~ ¾ 2 i = ¾ 2 i +¹ 2 i ¡¾ 2 c ¡¹ 0 ¹ 1 » ¾ 2 i thus (C.2) is equal to ® 2 1 1¡® 2 0 < ¾ 2 0 ¾ 2 1 < 1¡® 2 1 ® 2 0 : (C.3) For MAD distortion measure, MAD a <MAD 0 & MAD a <MAD 1 , Efjn a jg<Efjn 0 jg & Efjn a jg<Efjn 1 jg , Efj® 0 n 0 +® 0 n 1 jg<Efjn 0 jg & Efj® 0 n 0 +® 0 n 1 jg<Efjn 1 jg: (C.4) Due to the absolute operation, it is not straightforward to ¯nd a generic closed form of solution for (C.4). Therefore, we solve (C.4) assuming (i) normal and (ii) Laplace distribution models for n 0 and n 1 . (i) For normal distribution model n i » N(¹ i ;¾ 2 i ), from (A.8) with ¹ i » 0 and/or ¹ i ¾ i »0 Efjn i jg= r 2¾ 2 i ¼ e ¡ ¹ 2 i 2¾ 2 i ¡¹ i µ 1¡2Q(¡ ¹ i ¾ i ) ¶ » r 2¾ 2 i ¼ (C.5) Ifn 0 andn 1 areuncorrelated, from(C.1), ¾ 2 a =® 2 0 ¾ 2 0 +® 2 1 ¾ 2 1 . Therefore, Efjn a jg» q 2¾ 2 a ¼ = q 2(® 2 0 ¾ 2 0 +® 2 1 ¾ 2 1 ) ¼ . Thus (C.4) is equal to MAD a <MAD 0 & MAD a <MAD 1 , r 2¾ 2 a ¼ < r 2¾ 2 0 ¼ & r 2¾ 2 a ¼ < r 2¾ 2 1 ¼ , ® 2 1 1¡® 2 0 < ¾ 2 0 ¾ 2 1 < 1¡® 2 1 ® 2 0 : (C.6) From (C.3) and (C.6), it is shown that if the residual errors by p 0 and p 1 are uncorrelated and follow the normal distributions respectively with ¹ i » 0 and/or ¹ i ¾ i »0, the same weight index is selected by both MSD and MAD. (ii) If n 0 and n 1 follow Laplace distributions, from (A.11) and (A.12), MAD 0 and MAD 1 are found as MAD 0 =Efjn 0 jg= Z 1 ¡1 jn 0 j 2a e ¡ jn 0 ¡¹ 0 j a dn 0 =ae ¡ j¹ 0 j a +j¹ 0 j= r ¾ 2 0 2 e ¡ j¹ 0 j ¾ 0 p 2 +j¹ 0 j (C.7) 96 MAD 1 =Efjn 1 jg= Z 1 ¡1 jn 1 j 2b e ¡ jn 1 ¡¹ 1 j b dn 1 =be ¡ j¹ 1 j b +j¹ 1 j= r ¾ 2 1 2 e ¡ j¹ 1 j ¾ 1 p 2 +j¹ 1 j: (C.8) Probability density function (pdf) of n a are calculated for n 0 ?n 1 as f n a (n a )= Z 1 ¡1 1 ® 0 f n 0 ( n a ¡® 1 n 1 ® 0 )f n 1 (n 1 )dn 1 = 1 2((a® 0 ) 2 ¡(b® 1 ) 2 ) ³ a® 0 e ¡ jn a ¡¹ a j a® 0 ¡b® 1 e ¡ jn a ¡¹ a j b® 1 ´ (C.9) thus, MAD a is given as MAD a =Efjn a jg= Z 1 ¡1 jn a j 2((a® 0 ) 2 ¡(b® 1 ) 2 ) ³ a® 0 e ¡ jn a ¡¹ a j a® 0 ¡b® 1 e ¡ jn a ¡¹ a j b® 1 ´ dn a = (a® 0 ) 3 e ¡ j¹ a j a® 0 ¡(b® 1 ) 3 e ¡ j¹ a j b® 1 (a® 0 ) 2 ¡(b® 1 ) 2 +j¹ a j (C.10) With the assumption that `¹ i » 0' or ` ¹ i ¾ i » 0 and ¹ 0 » ¹ a » ¹ 1 ', from (C.7) and (C.10), MAD a <MAD 0 , (a® 0 ) 2 +(b® 1 ) 2 +a® 0 b® 1 a® 0 +b® 1 <a , a b < ¡® 1 ¡ p ® 2 1 +4® 0 ® 1 2® 0 or a b > ¡® 1 + p ® 2 1 +4® 0 ® 1 2® 0 (C.11) and from (C.8) and (C.10), MAD a <MAD 1 , (a® 0 ) 2 +(b® 1 ) 2 +a® 0 b® 1 a® 0 +b® 1 <b , ® 0 ¡ p ® 2 0 +4® 0 ® 1 2® 0 < a b < ® 0 + p ® 2 0 +4® 0 ® 1 2® 0 (C.12) 97 Tab. C.1: Sub-optimality of MAD with respect to MSD Range of ¾ 2 0 ¾ 2 1 Predictor Predictor by MSD by MAD (a) ¡ 1 3 ;0:382 ¢ p a p 0 (b) (2:618;3) p a p 1 Thus, from (C.11) and (C.12) MAD a <MAD 0 & MAD a <MAD 1 , ¡® 1 + p ® 2 1 +4® 0 ® 1 2® 0 < a b < ® 0 + p ® 2 0 +4® 0 ® 1 2® 0 , à ¡® 1 + p ® 2 1 +4® 0 ® 1 2® 0 ! 2 < ¾ 2 0 ¾ 2 1 < à ® 0 + p ® 2 0 +4® 0 ® 1 2® 0 ! 2 (C.13) For the additional weight used in Chapter 4, ® 0 =® 1 = 1 2 , MSD a <MSD 0 & MSD a <MSD 1 , ® 2 1 1¡® 2 0 < ¾ 2 0 ¾ 2 1 < 1¡® 2 1 ® 2 0 : , 1 3 < ¾ 2 0 ¾ 2 1 <3 (C.14) and MAD a <MAD 0 & MAD a <MAD 1 , à ¡ 1 2 + r 5 4 ! 2 < ¾ 2 0 ¾ 2 1 < à 1 2 + r 5 4 ! 2 ,0:382< ¾ 2 0 ¾ 2 1 <2:618 (C.15) Therefore, MSD and MAD select a di®erent predictor in two separate intervals as in Tab. C.1. For (a), p a is the predictor with minimum distortion by MSD but p 0 is chosen by MAD. Similarly, for (b) p a is the predictor with minimum distortion 98 by MSD but p 1 is chosen by MAD. The sub-optimal choice by MAD increases the distortion as (a) ¢D 0 =MSD 0 ¡MSD a =(1¡® 2 0 )¾ 2 0 ¡® 2 1 ¾ 2 1 = 3 4 ¾ 2 0 ¡ 1 4 ¾ 2 1 (b) ¢D 1 =MSD 1 ¡MSD a =(1¡® 2 1 )¾ 2 1 ¡® 2 0 ¾ 2 0 = 3 4 ¾ 2 1 ¡ 1 4 ¾ 2 0 : (C.16) From two intervals in Tab. C.1, it can be derived that (a) 2:618¾ 2 0 <¾ 2 1 <3¾ 2 0 (b) 2:618¾ 2 1 <¾ 2 0 <3¾ 2 1 (C.17) thus, (a) 0<¢D 0 <0:10¾ 2 0 »0:11MSD a (b) 0<¢D 1 <0:10¾ 2 1 »0:11MSD a : (C.18) Therefore, when the residuals follow Laplace distribution, MAD will choose p 0 or p 1 which is not the predictor with minimum distortion (p a ). The distortion by this sub-optimal choice can be higher than the optimal distortion MSD a by up to 11%. Fig. C.1 demonstrates the distribution of ¾ 2 0 ¾ 2 1 based on the coding results of Foreman with QP 24. In this example, ¾ 2 0 ¾ 2 1 is clustered around 1 and about 68% and 82% lies between ¡ 1 2 ;2 ¢ and between ¡ 1 3 ;3 ¢ , respectively. The probability that ¾ 2 0 ¾ 2 1 belongs to the interval (a) and (b) in Tab. C.1 is equal to P( 1 3 < ¾ 2 0 ¾ 2 1 < 0:382 & 2:618 < ¾ 2 0 ¾ 2 1 < 3) = 6:7%. Thus, the penalty using MAD instead of MSD is negligible practically. 99 −1 0 1 2 3 4 5 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 σ 0 2 /σ 1 2 Probability Fig. C.1: Distribution of ¾ 2 0 ¾ 2 1 from IBS coding results of Foreman with QP 24 100
Abstract (if available)
Abstract
Multi-view video sequences consist of a set of monoscopic video sequences captured at the same time by cameras at different locations and angles. These sequences contain 3-D information that can be used to deliver new 3-D multimedia services. Due to the amount of data, it is important to efficiently compress these multi-view sequences to deliver more accurate 3-D information.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced intra prediction techniques for image and video coding
PDF
Focus mismatch compensation and complexity reduction techniques for multiview video coding
PDF
Complexity scalable and robust motion estimation for video compression
PDF
Disparity estimation from multi-view images and video: graph models and algorithms
PDF
Texture processing for image/video coding and super-resolution applications
PDF
Advanced techniques for high fidelity video coding
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Efficient coding techniques for high definition video
PDF
Rate control techniques for H.264/AVC video with enhanced rate-distortion modeling
PDF
Random access to compressed volumetric data
PDF
A stochastic block matching algorithm for motion estimation in video coding
PDF
Algorithms for scalable and network-adaptive video coding and transmission
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Techniques for compressed visual data quality assessment and advanced video coding
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Supporting multimedia streaming among mobile ad-hoc peers with link availability prediction
PDF
Application-driven compressed sensing
PDF
Robust video transmission in erasure networks with network coding
PDF
Efficient management techniques for large video collections
PDF
Line segment matching and its applications in 3D urban modeling
Asset Metadata
Creator
Kim, Jae Hoon
(author)
Core Title
Predictive coding tools in multi-view video compression
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/10/2008
Defense Date
01/31/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D video,bit allocation,block segmentation,illumination compensation,multi-view video,OAI-PMH Harvest,video coding
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ortega, Antonio (
committee chair
), Kuo, C.-C. Jay (
committee member
), Neumann, Ulrich (
committee member
)
Creator Email
i.jaekim@gmail.com,papihoon@chol.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1651
Unique identifier
UC1144427
Identifier
etd-Kim-2128 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-116023 (legacy record id),usctheses-m1651 (legacy record id)
Legacy Identifier
etd-Kim-2128.pdf
Dmrecord
116023
Document Type
Dissertation
Rights
Kim, Jae Hoon
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
3D video
bit allocation
block segmentation
illumination compensation
multi-view video
video coding