Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Low complexity mosaicking and up-sampling techniques for high resolution video display
(USC Thesis Other)
Low complexity mosaicking and up-sampling techniques for high resolution video display
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LOW COMPLEXITY MOSAICKING AND UP-SAMPLING TECHNIQUES FOR HIGH RESOLUTION VIDEO DISPLAY by Ming-Sui Lee A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Ful¯llment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2006 Copyright 2006 Ming-Sui Lee Dedication To my beloved family ii Acknowledgements ItiswithgratefulthankstomyadvisorC.-C.JayKuoforhisguidanceandencouragement during the whole work of this dissertation. I would also like to thank Akio Yoneyama san and Tomoyuki Shimizu san from KDDI Laboratories, Inc., Japan, for their precious comments and collaboration. I'm very grateful to Meiyin Shen, Yu Hu and May Kuo for their kindly help. Also, I'd like to give special thanks to Chia-Hao Chiang for his fully support and encouragement throughout the whole process. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures ix Abstract xiii 1 Introduction 1 1.1 Signi¯cance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Comparison of Raw and Coded Image/Video Mosaic Techniques . . . . . . 3 1.3 Comparison of Image-based and Block-based Super Resolution Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Outline of the Dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Research Background: Raw and Coded Video Mosaicking 10 2.1 Problems in Image/Video Mosaicking . . . . . . . . . . . . . . . . . . . . . 10 2.2 Review of Traditional Image Registration Techniques . . . . . . . . . . . . . 13 2.2.1 Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Matching of Image Areas and Features . . . . . . . . . . . . . . . . . 15 2.2.3 Geometric Image Transforms . . . . . . . . . . . . . . . . . . . . . . 17 2.2.4 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Review of Traditional Super Resolution and Image Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Spatial-domain Algorithms . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.2 Frequency-domain Algorithms. . . . . . . . . . . . . . . . . . . . . . 25 2.4 Challenges of Coded Video Mosaicking and Research Objectives . . . . . . 28 3 Color Matching and Compensation of Coded Images 32 3.1 Pre-processing via White Balancing . . . . . . . . . . . . . . . . . . . . . . 33 3.1.1 Fundamentals of Gray World Assumption (GWA) . . . . . . . . . . 34 3.1.2 White Balancing in DCT Domain . . . . . . . . . . . . . . . . . . . 35 3.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 36 iv 3.2 Histogram Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Fundamentals of Histogram Matching Technique . . . . . . . . . . . 37 3.2.2 Pixel-domain Color Adjustment. . . . . . . . . . . . . . . . . . . . . 40 3.2.3 DCT-domain Color Adjustment. . . . . . . . . . . . . . . . . . . . . 41 3.3 Polynomial Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.1 Pixel-domain Contrast Stretching . . . . . . . . . . . . . . . . . . . . 43 3.3.2 DCT-domain Contrast Stretching . . . . . . . . . . . . . . . . . . . . 43 3.4 Post-processing via Linear Filtering. . . . . . . . . . . . . . . . . . . . . . . 44 3.4.1 Pixel-domain Post-processing . . . . . . . . . . . . . . . . . . . . . . 44 3.4.2 DCT-domain Post-processing . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5.1 Stitched Images After Color Matching . . . . . . . . . . . . . . . . . 46 3.5.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.3 Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4 Fast and Accurate Block-Level Registration of Coded Images 55 4.1 Block-level Image Registration with Edge Estimation . . . . . . . . . . . . . 55 4.1.1 Image Segmentation for Foreground Extraction . . . . . . . . . . . . 55 4.1.2 Edge Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1.3 Displacement Parameter Estimation . . . . . . . . . . . . . . . . . . 60 4.1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Block-level Image Registration based on Edge Extraction . . . . . . . . . . 64 4.2.1 Edge Detection on DC Maps . . . . . . . . . . . . . . . . . . . . . . 68 4.2.2 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.3 Displacement Parameter Estimation . . . . . . . . . . . . . . . . . . 70 4.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Robustness of the Proposed Alignment Method . . . . . . . . . . . . . . . . 77 5 Advanced Mosaic Techniques for Coded Video 79 5.1 Hybrid Block/Pixel Registration . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1.1 Alignment of Projected Boundary Blocks . . . . . . . . . . . . . . . 80 5.1.2 Alignment of Selected 2D Blocks in the Pixel Domain . . . . . . . . 81 5.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Block-based Video Registration . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2.1 Static Background Alignment . . . . . . . . . . . . . . . . . . . . . . 88 5.2.2 Moving Object Alignment . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2.3 Displacement Parameter Estimation . . . . . . . . . . . . . . . . . . 91 5.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3 DCT Block Analysis and Classi¯cation . . . . . . . . . . . . . . . . . . . . . 94 5.3.1 DCT-Domain Block Classi¯cations . . . . . . . . . . . . . . . . . . . 95 5.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 98 v 6 Block-Adaptive Image Upsampling and Enhancement Techniques 102 6.1 Block-Adaptive Image Up-Sampling Techniques . . . . . . . . . . . . . . . . 103 6.1.1 Complexity Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.1.2 Visual Quality Comparison . . . . . . . . . . . . . . . . . . . . . . . 105 6.1.3 Image Re-sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.1.4 Initialization for Block MAP Iteration . . . . . . . . . . . . . . . . . 111 6.2 Image Up-Sampling with Adaptive Enhancement . . . . . . . . . . . . . . . 112 6.2.1 Facet Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2.2 Unsharp Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7 Conclusion and Future Work 126 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Reference List 135 vi List of Tables 3.1 Comparisonofprocessingtime(seconds)betweentwodi®erentdomainsand two di®erent approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 MSE for di®erent order polynomial approaches. . . . . . . . . . . . . . . . . 51 4.1 Comparison between the proposed and the traditional one in processing time. 61 4.2 Comparison between the proposed and the traditional approaches in pro- cessing time (sec). (a) traditional (b)proposed method. (c) and (d) are the savings in terms of seconds and percentages. . . . . . . . . . . . . . . . . . . 73 4.3 Comparison between the displacement parameters (d i ;d j ) derived based on the proposed approach and the actual displacement parameters, (d i 0 ;d j 0 ). . 77 5.1 The ¯ne-tuning parameters (f i corner ;f j corner ) determined by the pixel in- formation of each detected corner block pair. . . . . . . . . . . . . . . . . . 84 5.2 The ¯ne-tuning parameters (f i edge ;f j edge ) determined by the edge infor- mation of each detected corner block pair. . . . . . . . . . . . . . . . . . . . 85 5.3 The ¯ne-tuning parameters (f i corner ;f j corner ) determined by the pixel in- formation of each detected corner block pairs. . . . . . . . . . . . . . . . . . 85 5.4 The ¯ne-tuning parameters (f i edge ;f j edge ) determined by the edge infor- mation of each detected corner block pairs. . . . . . . . . . . . . . . . . . . 86 5.5 Execution time comparison (in the unit of seconds) of (a) the traditional method, (b) the proposed DCT-domain algorithm and (c) the proposed hybrid method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.6 Comparison of displacement vectors: (d i ;d j ) is obtained by the block-level alignment, (d i hybrid ;d j hybrid ) is obtained by the hybrid block/pixel align- ment and (d i actual ;d j actual ) is the actual one. . . . . . . . . . . . . . . . . . 87 vii 6.1 Comparison of processing time (sec.) of di®erent interpolation methods, including the zero-order-hold (ZOH), bilinear interpolation (BLI), block- adaptive super resolution (BSR) and traditional MAP estimation (MAP). . 104 viii List of Figures 2.1 Temporal synchronization required for video mosaicking. . . . . . . . . . . . 11 2.2 Need of focal length compensation: (a) illustration of the focal length and (b) barrel distortion e®ects. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Illustration of image registration and mosaicking. . . . . . . . . . . . . . . . 13 2.4 Illustrationofcolormatchingandcompensation: (a)twoinputcolorimages and (b) the image mosaic without color adjustment. . . . . . . . . . . . . . 13 2.5 The observation model for super resolution. . . . . . . . . . . . . . . . . . . 21 2.6 Illustration of the proposed video mosaicking system. . . . . . . . . . . . . . 30 3.1 Experimental results of applying the Gray World Assumption to all image pixels: (a) the input image and (b) the output image. . . . . . . . . . . . . 35 3.2 ExperimentalresultsofapplyingGWAtoanimagewithlowsaturationand low intensity components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Experimental results of applying GWA in the DCT domain. . . . . . . . . . 37 3.4 The histograms of three components of an image: (a) before and (b) after histogram matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 The block diagram of histogram matching in the pixel domain. . . . . . . . 40 3.6 The DCT blocks from images 1 and 2 have (a) the exact matched location and (b) an o®set of (m,n). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.7 The bilinear interpolation of four DCT blocks to synthesis the pseudo DCT blocks and vice versa. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.8 The curves of two weighting parameters used in the linear combination. . . 44 3.9 The original two input images. . . . . . . . . . . . . . . . . . . . . . . . . . 46 ix 3.10 The stitched image without color matching. . . . . . . . . . . . . . . . . . . 46 3.11 Theoutputimageaftercoloradjustmentinthepixeldomainwithhistogram matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.12 TheoutputimageaftercoloradjustmentintheDCTdomainwithhistogram matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.13 The output image after color adjustment in the pixel domain with polyno- mial approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.14 The output image after color adjustment in the DCT domain with polyno- mial approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.15 Stitched images with polynomial approximation in the DCT domain (a) without block displacement and (b)with block displacement. . . . . . . . . . 50 3.16 Image mosaic with the 2nd order polynomial in the DCT domain: (a) DC plus the ¯rst three AC values and (b) DC plus the ¯rst ¯ve values. . . . . . 54 4.1 Conversion from a DC map to a binary activity map: (a)the DC map and (b)the binary activity map. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 (a)The relationship between original images and binary maps and (b) the geometrical representation for displacement. . . . . . . . . . . . . . . . . . . 57 4.3 The sign patterns of weighted pixel values for the ¯rst few AC values. . . . 59 4.4 The eight quantized levels for coarse-scale edge orientation estimation. . . . 59 4.5 The four test image pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.6 Performance comparison in processing time. . . . . . . . . . . . . . . . . . . 62 4.7 Imageregistrationresultsof(a)theproposedDCT-domainand(b)thespace- domain approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.8 Imageregistrationresultsof(a)theproposedDCT-domainand(b)thespace- domain approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.9 Imageregistrationresultsof(a)theproposedDCT-domainand(b)thespace- domain approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.10 Imageregistrationresultsof(a)theproposedDCT-domainand(b)thespace- domain approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.11 Comparison between the ¯rst- and the second-order derivative ¯lters. . . . . 69 x 4.12 The di®erence maps of (a)image 1 and (b)image 2 using ¯lter H 1 and the corresponding binary activity maps of (c)image 1 and (d) image 2. . . . . . 70 4.13 A detailed overview of the proposed method. . . . . . . . . . . . . . . . . . 72 4.14 The original test images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.15 Performance comparison in processing time. . . . . . . . . . . . . . . . . . . 74 4.16 The stitched images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.17 Figure 4.16 continued. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.18 Thecompositionresultswithdi®erentlevelsofGaussiannoise: 12.5%,25%, and 37.5%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.1 Project two-dimensional data to one-dimensional. . . . . . . . . . . . . . . . 80 5.2 The 3£3 block patterns that have a higher probability to contain one or multiple corners at the central block. . . . . . . . . . . . . . . . . . . . . . . 83 5.3 The °ow chart of the proposed system. . . . . . . . . . . . . . . . . . . . . . 89 5.4 The ¯rst frames of two input sequences for the 1st experiment. . . . . . . . 92 5.5 The portion around the boundaries of stitched frames: (a) the 15th frame, (b) the 30th frame, and (c)the 45th frame. . . . . . . . . . . . . . . . . . . . 92 5.6 The ¯rst frames of two input sequences for the 2nd experiment. . . . . . . . 93 5.7 The portion around the boundaries of stitched frames: (a) the 15th frame (b) the 30th frame and (c)the 45th frame. . . . . . . . . . . . . . . . . . . . 93 5.8 The 8£8 array of basis images for the 2D DCT. . . . . . . . . . . . . . . . 95 5.9 Area grouping of DCT coe±cients for de¯ning the ratios for block classi¯- cation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.10 The block classi¯cation diagram. . . . . . . . . . . . . . . . . . . . . . . . . 99 5.11 The block classi¯cation results - 1st test image. . . . . . . . . . . . . . . . . 100 5.12 The block classi¯cation results - 2nd test image. . . . . . . . . . . . . . . . 101 6.1 The complexity for di®erent methods measured in terms of the processing time (in the unit of seconds) as a function of the image size (in the unit of pixels). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 xi 6.2 Visualqualitycomparisonofdi®erentimage-upsamplingmethodsforblocks of size 8£8 in RGB domain ((a) and (b)) and in YCbCr domain ((c) and (d)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.3 Visualqualitycomparisonofdi®erentimage-upsamplingmethodsforblocks of size 16£16 in RGB domain ((a) and (b)) and in YCbCr domain ((d) and (e)). (c) and (f) are di®erence maps. . . . . . . . . . . . . . . . . . . . 107 6.4 Visualqualitycomparisonofdi®erentimage-upsamplingmethodsforblocks of size 32£32 in RGB domain ((a) and (b)) and in YCbCr domain ((d) and (e)). (c) and (f) are di®erence maps. . . . . . . . . . . . . . . . . . . . 108 6.5 Visual comparison of output images with image size of 64£ 64((a),(d)), 128£128((b),(e)), and 256£256((c),(f)), respectively. . . . . . . . . . . . . 109 6.6 Comparison between the original and the degraded images due to image resizing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.7 Detecting di®erence between the original and the resized images using bi- linearly interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.8 The block-based MAP estimator with di®erent initialization methods: (a) zero-order-hold, and (b) bilinear interpolation. . . . . . . . . . . . . . . . . 120 6.9 Comparison of di®erences between two initialization methods. . . . . . . . . 120 6.10 The coordinates of a facet model. . . . . . . . . . . . . . . . . . . . . . . . . 120 6.11 Experimental results of (a) bilinear interpolation, (b) the facet model and (c) 1D directional unsharp masking. . . . . . . . . . . . . . . . . . . . . . . 121 6.12 Comparison of pixel intensity before and after applying an unsharp mask. . 122 6.13 Experimental results of unsharp masked texture patterns: (a) the original texture patterns and (b) the unsharp masked texture patterns. . . . . . . . 122 6.14 Experimental results of ¯rst two test patterns: (a) bilinear interpolation and (b) the proposed content-adaptive upsampling method. . . . . . . . . . 123 6.15 Experimental results of the other two test patterns: (a) bilinear interpola- tion and (b) the proposed content-adaptive upsampling method. . . . . . . 124 6.16 The 1D image data across an edge. . . . . . . . . . . . . . . . . . . . . . . . 125 xii Abstract Severalchallengingissuesforapplicationsofimage/videomosaickingandupsamplingwith high resolution are addressed here, all of which are mainly conducted in DCT (Discrete Cosine Transform) domain so that lower computation complexity can be achieved. First of all, color matching and compensation techniques are proposed to remove the seam lines between image boundaries due to the di®erent color tones of the inputs. Color deviation of each input image is corrected ¯rst and color di®erences between input images are then compensated using the polynomial-based contrast stretching technique. The proposed approach is attractive for its lower computational complexity. Experimental results demonstrate that the color-matching problem can be satisfactorily solved in the compressed domain even when the DCT blocks of original input images are not aligned. Two block-level image registration techniques for compressed video such as motion JPEG or the I-picture of MPEG are investigated. The proposed methods are based on edge estimation and extraction in DCT domain so that the computational cost of image registration is reduced dramatically as compared with the pixel-domain edge-based regis- tration techniques while achieving certain quality of composition. In order to reach higher accuracyofregistration,apost-processingtechnique,hybridblock/pixellevelalignment,is xiii proposed so that the displacement vector resolution can be enhanced from the block level to the pixel level. As compared with the traditional spatial-domain processing, we do not perform the inverse DCT transform to the whole image but to some selected blocks. It is shownbyexperimentsthattheproposedalgorithmsavesaround40%ofthecomputational complexity while achieving the same quality. In the last part, a content adaptive technique is proposed to upsample an image to an outputimageofhigherresolution. Theproposedtechniqueisalsoablock-basedprocessing algorithm that o®ers the °exibility in choosing the most suitable up-sampling method for a particular block type. Block classi¯cation is ¯rst conducted in the DCT domain to categorize each image block into several types: smooth areas, textures, edges and others. For the plain background and smooth surfaces, simple patches are used to enlarge the image size without degrading the resultant visual quality. The unsharp masking method is applied to the textured region to preserve high frequency components. Since human eyes are more sensitive to edges, we adopt a more sophisticated technique to process edge blocks. Thatis,theyareapproximatedbyafacetmodelsothattheimagedataatsubpixel positionscanbegeneratedaccordingly. Apost-processingtechniquesuchas1Ddirectional unsharpmaskingcanbeusedtoenhanceedgesharpnessfurthermore. Experimentalresults are given to demonstrate the e±ciency of the proposed techniques. xiv Chapter 1 Introduction 1.1 Signi¯cance of the Research Since the ¯rst camera was invented in 1816, there have been great interests in developing more advanced image/video capturing devices and technologies. Many innovative ideas have been brought forth: from analog to digital signals, from monochromatic to color sys- tems, from images to video clips, and from low resolution to high resolution video. Digital video has become popular recently as evidenced by the quick growth of the consumer elec- tronic markets, including DVD (digital versatile disk), DTV (digital television) and other related entertainment products and services. DTVisanewbroadcastingtechnologythatsupportsmultipledigitaltelevisionformats. Among all formats, high de¯nition television (HDTV) o®ers the highest quality. HDTV uses a wide screen format and supports very high resolution that contains more than twice as many lines as current analog TVs. Although the HDTV display device becomes a®ordable to the household these days, HD video capturing devices are prevalently used by professionals due to the high cost of the equipment. There is a growing demand for 1 general consumers to generate their own high quality multimedia content at a low price. One way to create high resolution video is through video authoring using the image/video mosaic technique. Image/videomosaickingistheprocessofstitchingtwoormoreimages/videostakenby di®erentcamerasfromdi®erentviewpoints. Applicationsofimage/videomosaictechniques canbefoundincomputervision, patternrecognitionandremotelysenseddataprocessing. Wheninputimage/videocontentsaretakenfromdi®erentviewpoints, samplingtimesand sensors, image registration is needed to integrate these image/video tiles together. Over the past few decades, a lot of research has been done to obtain an image/video mosaic. For an extensive survey of previous work, we refer to [7], [56]. Generally speaking, the image registration technique consists of two major steps: feature detection and feature matching. They will be reviewed in Chapter 2. Eventhoughimage/videomosaickinghasbeenstudiedforyears, mosttechniqueswere primarily developed using the information of raw video (or called uncoded video). They are implemented in the space-time domain (or the image pixel domain). In this research, we consider mosaicking of coded video since each individual captured video content is often coded before its transmission. If we perform image/video mosaicking in the raw video domain, we have to perform image/video decoding ¯rst. This involves inverse DCT when the input video is in the compressed format such as motion JPEG and MPEG. The approach may not be suitable for the implementation in real-time embedded systems due to a much larger memory requirement and the extra decoding procedure demanded. For nowadays applications, multimedia capturing and display devices of di®erent res- olutions can be easily connected by networks and there is a great need in developing 2 techniques that facilitate °exible image/video format conversion and content adaptation amongheterogeneousterminals. Qualitydegradationduetodown-sampling, up-sampling, blurring, coding/decoding and some content adaptation mechanism in the transmission process is inevitable. Thus, techniques for super resolution and image enhancement are required to generate high quality multimedia outputs. Super resolution and image en- hancement both have been investigated for a long time and are able to acquire contents with high performance. The challenge under the current context is to strike a balance between low computational complexity and high quality of resultant image/video. The above observation has motivated us to study image/video mosaicking and super resolution enhancement directly in the compressed domain. Since motion JPEG, MPEG and H26x coding standards all adopt the DCT representation in the coding process, our goal of this research is to conduct the registration process in the DCT (Discrete Cosine Transform) domain to generate the corresponding high-quality compressed image/video mosaic from multiple compressed video inputs. 1.2 Comparison of Raw and Coded Image/Video Mosaic Techniques Most traditional image/video mosaic techniques are conducted in the raw image/video domain, which is essentially an image registration problem. Image registration usually consists of two steps: feature extraction (or detection) and feature matching. They are brie°y discussed below. 3 Feature detection can be done either manually or automatically. Since human eyes are sensitive to geometric patterns, it is straightforward for people to choose matched patterns. However, it is desirable to develop an automatic feature selection process based on the particular application context. Feature detection techniques can be classi¯ed into twocategories: thefeaturepoint-basedandthearea-basedapproaches. Thefeaturepoint- based approach extracts salient points such as corners, line intersections, line ends and centroids of closed-boundary regions. For example, the wavelet transform was used in [19] to extract the local maxima. The partial derivatives of image pixel values were proposed in [25] for corner detection. However, this process is time consuming and sensitive to noise. The area-based approach uses the correlation function to determine the degree of closeness of two regions. For example, it computes the cross-correlation of intensities of regions of input images to ¯nd the best match. This approach is suitable for images that do not have many details. However, its computational complexity is still high. Once the feature information is available, the next step is to ¯nd the optimal correspondence between extracted features. Feature matching is a process to determine the relationship betweensimilarobjectscontainedbydi®erentimages. Thiscanbeachievedby¯ndingthe spatial relations between extracted features of input images. Although existing methods lead to good results under a high SNR (Signal-to-Noise- Ratio) environment, they are only applicable in the raw image/video domain, where the operations are applied to image pixels. This process is computationally expensive in gen- eral. In practice, it is seldom to have raw video contents in storage and/or transmission in real world applications. Once some image/video content is captured, it is compressed (or coded) for storage and/or transmission. Commonly used video coding standards such 4 as motion JPEG, MPEG and H26x all adopt the DCT representation in the coding pro- cess. Thus, it is desirable to conduct the registration process directly in the DCT domain for multiple coded video inputs to synthesize an image/video mosaic, which is the main di®erence between this research and the traditional image/video mosaic techniques. 1.3 Comparison of Image-based and Block-based Super Resolution Techniques Similar concerns to the raw image/video mosaic techniques, image-based super resolution techniques su®er intensive computation complexity although high performance is guaran- teed. Block-based processing is preferred due to several advantages. First of all, block- based algorithm reduce the degree of freedom dramatically. The dimension of the block that is taken into account at a time is much smaller compared to the original image size. Also, the block-based method provides a mechanism which is capable of segmenting the image into several types so that content adaptive image processing can be chosen accord- ingly. Moreover, since each block can be treated as a smaller image individually, parallel processing is applicable to speed up the processing time even more. Flexibility and low computation complexity of the block-based algorithm make it more attractive than the traditional image-based algorithms. 5 1.4 Contributions of the Research In this proposal, we ¯rst consider the color matching problem of two input image/video content. Then,westudytheimageregistrationoftwoarbitrarilytranslatedimages/videos in the DCT domain. Speci¯c contributions of this research are highlighted below. ² Development of DCT-domain color adjustment techniques Several color matching algorithms that compensate color di®erences from two input image sequences captured by di®erent cameras is proposed. Some of them are con- ductedinthepixeldomainwhileothersarecarriedoutintheDCTdomain. Thetwo proposed techniques, i.e. histogram matching and polynomial contrast stretching, can eliminate the seam lines successfully at a much low computational complexity as compared with the color adjustment technique in the pixel domain. Moreover, when comparing twoproposed approaches, the polynomial-based contraststretching methodoutperformsthehistogrammatchingmethodintermsoftheprocessingtime andthememoryrequirementsincesolvingasecondorderlinearsystemisfasterthan performing histogram adjustment and only three matchingcoe±cients are needed to be stored (rather than the whole histogram matching table). ² Development of DCT-domain registration techniques The DCT-domain registration techniques for MPEG video are developed for video mosaic authoring with indoor and outdoor scenes. Both of them can achieve certain quality of composition while the computational cost can be reduced signi¯cantly in comparisonwiththepixel-domainbasedtechniques. Furthermore, apost-processing technique called \hybrid block/pixel level alignment", which is conducted partially 6 in the pixel domain, is introduced to enhance the accuracy of the alignment. For hybrid block/pixel level alignment, an algorithm is proposed to detect corner blocks based on the DCT coe±cients. Only corner blocks detected in the DCT domain are converted back to the spatial domain for alignment ¯ne-tuning. This hybrid technique can achieve excellent alignment results at the cost of slightly increased complexity. ² Robustness of DCT-domain registration in a low SNR environment It is observed that the proposed DCT-domain registration techniques are robust in the presence of noise. This phenomenon is studied and explained. Rather than dealing with pixel intensities directly, the proposed DCT-domain methods adopt the DC component of the DCT coe±cients for block alignment. The DC coe±cient can be viewed as the down-sized version of the original image since it is the average energy of the whole 8£8 DCT block. The DCT-domain algorithms are robust in a low SNR environment since noise can be removed by the averaging process. ² Development of DCT-domain block classi¯cation techniques Propertiesof8£8DCTblocksareinvestigatedforthepurposeofblockclassi¯cation. A decision tree is formed to decide whether a block contains background, texture or edges. The decision is made by some thresholds de¯ned based on the energy distribution of DCT coe±cients. By following the decision tree, an image can be classi¯ed into several categories, which helps reduce the computational complexity of further processing such as super resolution. For example, a simple interpolation schemecanbeusedinthebackgroundregionandsmoothsurfaces. Texturesynthesis 7 techniques can be performed in the texture area. A more sophisticated algorithm with high performance should be adopted for blocks that contain important visual information, such as corners and edges since human eyes are more sensitive to these features. ² Development of DCT-domain super resolution and image enhancement techniques A content adaptive technique is proposed to upsample an image to an output image ofhigherresolutioninthiswork. Theproposedtechniqueisablock-basedprocessing algorithmthato®ersthe°exibilityinchoosingthemostsuitableup-samplingmethod for a particular block type. Block classi¯cation is ¯rst conducted in the DCT do- main to categorized each image block into several types: smooth areas, textures, edges and others. For the plain background and smooth surfaces, simple patches are used to enlarge the image size without degrading the resultant visual quality. The unsharpmaskingmethodisappliedtothetexturedregiontopreservehighfrequency components. Since human eyes are more sensitive to edges, we adopt a more sophis- ticated technique to process edge blocks. That is, they are approximated by a facet model so that the image data at subpixel positions can be generated accordingly. A post-processing technique such as 1D directional unsharp masking can be used to enhance edge sharpness furthermore. 1.5 Outline of the Dissertation This dissertation is organized as follows. The problem of raw and coded image/video mosaicking is explained in Chapter 2. Several DCT-domain algorithms of color matching 8 and adjustment are presented in Chapter 3. DCT-based image registration techniques are proposed in Chapter 4. Properties of DCT-based image/video registration techniques are investigated in Chapter 5. A content-adaptive up-sampling technique for image resolution enhancement is proposed in Chapter 6. Finally, concluding remarks and future research directions are given in Chapter 7. 9 Chapter 2 Research Background: Raw and Coded Video Mosaicking Image/video mosaic, which combines several image/video inputs into a panorama output, has been widely used in image processing, computer graphics, computer vision, and re- motely sensed data processing. For a generic scenario, we may consider multiple video sources captured by an arbitrary number of cameras with di®erent parameter settings. The discrepancies among smaller video tiles have to be resolved for seamless composition. 2.1 Problems in Image/Video Mosaicking Due to di®erent camera calibrations, special attention has to be paid on compensating those disparities such as temporal synchronization, focal length reparation, image regis- tration and color di®erence adjustment. These issues are described in detail below. 10 ² Temporal Synchronization Consider two input image sequences used to form a video mosaic. The ¯rst problem encountered is that temporal sampling points of these two sequences are di®erent. As shown in Fig. 2.1, there is a gap between sampled frames in these two sequences. The goal of temporal synchronization is to perform temporal alignment between the two sequences, which can be achieved by camera calibration or temporal frame interpolation so that the time di®erence between sequences is signi¯cantly reduced. Figure 2.1: Temporal synchronization required for video mosaicking. ² Focal Length Compensation The focal length is the distance along the optical axis from the lens center to its focus (or focal point) as shown in Fig. 2.2(a). The longer the focal length, the smaller the ¯eld of view and the smaller the radial distortion. Radial distortion is a lens aberration in which the focal length varies radially outward from the center. It makes a straight line curved around the border of an image, which is also called barrel distortion. An example of various distortion e®ects of di®erent cameras is 11 shown in Fig. 2.2(b) [14]. Since the distortion would a®ect the quality of the mosaic output, it has to be corrected as well. (a) (b) Figure 2.2: Need of focal length compensation: (a) illustration of the focal length and (b) barrel distortion e®ects. ² Image/Video Registration Image registration is a technique that aligns several partly overlapped images prop- erly so as to create a panorama view. An example is shown in Fig. 2.3. In this step, thecriticalfeaturesofeachimagehavetobedetectedandtheircorrespondencehave tobefoundtodeterminethedisparitiesofallimages. Thedisparitiesbetweenimages may include translation, rotation and scaling e®ects. Translation means that there exists a displacement vector along vertical, horizontal or both directions between a pair of two images. By rotation, we refer to an angle di®erence between the axis systems of two capturing systems. The scaling e®ect, also known as the zoom-in and zoom-out e®ects, is a result of the focal length change. Once the disparities are determined, input images can be aligned so as to form an image with a larger ¯led of view. 12 Figure 2.3: Illustration of image registration and mosaicking. (a) (b) Figure 2.4: Illustration of color matching and compensation: (a) two input color images and (b) the image mosaic without color adjustment. ² Color Matching and Compensation The calibration of di®erent cameras may be di®erent so that their color preference may vary. This could result in di®erent color tones of images. As shown in Fig. 2.4 (b), the stitched image mosaic looks unpleasant since it contains two apparent seam lines around the image boundary. The object of this process is to adjust the pixel values of two images so that the color tones will become similar to each other. As a consequence, the seam lines in the stitched image will be eliminated. 2.2 Review of Traditional Image Registration Techniques When input image/video contents are taken from di®erent cameras with a di®erent view- point, sampling time and sensor, image registration is needed to integrate these im- age/video tiles together. Most traditional image registration techniques are developed 13 in the pixel domain. They consist of two major steps: feature detection and feature matching, which will be discussed in detail in this section. 2.2.1 Feature Detection The main task of feature detection is to extract salient features such as region, line and point features as explained below. ² Region features Examples include lakes [18], forest [44], or any closed-boundary areas. They are detected by segmentation, which is done iteratively along the registration process. ² Line features Examples include object contours [34], line segments [53], which can be extracted by many methods. The standard ones are Canny edge detector and an edge detector based on the Laplacian of Gaussian. ² Point features Examples include line intersections [52], line ends and centroids of closed-boundary regions[19],[35],andcorners[54],[55]. Theyareusuallydeterminedatpositionsthat have a high variance such as local extrema of the wavelet transform or a curvature. Somefeatureextractionmethodsarebasedontheinformationprovidedbythe¯rst-or second-order derivatives while others investigate the image behavior around corners. The speci¯c features to use may vary according to image contents and applications. However, all of them have something in common. That is, they are locally unique, distributed over 14 the image, and easy for detection. Once the feature information is available, the next step is to ¯nd the optimal correspondence of features between image tiles. 2.2.2 Matching of Image Areas and Features Feature (or image) matching is a process to determine the relationship between similar objects contained in di®erent images (or between individual images) by ¯nding spatial relations among extracted features. There are several ways to de¯ne the similarity or the di®erence measure for image/feature pairs. ² Cross-correlation The cross-correlation between images I 1 and I 2 is de¯ned as CC(i;j)= P (I 1 (i;j)¡E(I 1 ))(I 2 (i;j)¡E(I 2 )) p P (I 1 (i;j)¡E(I 1 )) 2 p P (I 2 (i;j)¡E(I 2 )) 2 : (2.1) where I 1 (i;j) and I 2 (i;j) are intensity values of two areas under alignment. The correlationfunctionisusedtodeterminethedegreeofcloseness. Tobemorespeci¯c, it computes the cross-correlation of intensities of a certain region of input images to ¯nd the best match. ² Fourier transform Fouriertransformconvertsanimagefromthespacedomaintothefrequencydomain. The cross-power spectrum of two images is de¯ned by F 1 (! x ;! y )F 2 ¤ (! x ;! y ) jF 1 (! x ;! y )F 2 ¤ (! x ;! y )j =e ! x dx+! y dy (2.2) 15 where dx and dy are the displacement parameters. The displacement is determined by the peak of the cross-power spectrum. ² Mutual information The mutual information is de¯ned as I(I 1 ;I 2 )=H(I 2 )¡H(I 2 jI 1 )=H(I 1 )+H(I 2 )¡H(I 1 ;I 2 ); (2.3) whereH(I)=¡ P i2I p(i)logp(i)istheentropyofsourceI. andp(i)istheprobability functionofi. Thegoalhereistomaximizethevalueofmutualinformation, I(I 1 ;I 2 ). ² Norms of image di®erence The sum of absolute di®erences (SAD) and the sum of squared di®erence (SSD) of image pixels are two commonly used metrics. They are de¯ned as SAD = X i;j j(I 1 (i;j)¡I 2 (i;j))j; (2.4) SSD = X i;j (I 1 (i;j)¡I 2 (i;j)) 2 : (2.5) ² Hausdor® distance The Hausdor® distance is de¯ned as H(A;B)=maxfh(A;B);h(B;A))g (2.6) 16 where A and B are two point sets, h(A;B) = sup a2A inf b2B k a¡b k, and where k¢k is the Euclidean norm of a and b. It was reported in [22] that it outperforms the cross-correlation method. The feature matching process can be also treated as an optimization problem, which maximizes the similarity measures (e.g. cross correlation, Fourier transform and mutual information) or minimizes the di®erence measures (e.g. norms of image di®erence and the Hausdor® distance). Several solutions have been proposed to solve this optimization problem, includingtheGaussian-Newtonminimization, thegradientdescentoptimization, and the Levenberg-Marquart optimization. Other than the measures mentioned above, some researchers adopt the multi-scale ap- proachwhichregistersimagesfromthecoarsetothe¯nescale. Thewaveletdecomposition is a representative of this hierarchical method, where the image is divided iteratively into four subbands of di®erent frequencies. 2.2.3 Geometric Image Transforms Another approach to solve the registration problem is ¯nding a geometric transform be- tweentwoimages. Geometrictransformsmayincludetherigid,a±ne,projective,perspec- tive and polynomial transforms. They are explained below. 17 ² A±ne transform An a±ne transform is usually composed of translation, rotation and scaling. It can be expressed in a general form as 2 6 6 6 4 x 2 y 2 3 7 7 7 5 = 2 6 6 6 4 t x t y 3 7 7 7 5 + 2 6 6 6 4 cosµ sinµ ¡sinµ cosµ 3 7 7 7 5 2 6 6 6 4 s x 0 0 s y 3 7 7 7 5 2 6 6 6 4 x 1 y 1 3 7 7 7 5 : (2.7) ² Perspective transform The perspective transform is used to model the e®ect of projecting a 3D scene onto a 2D image plane. Considerthat theCartesian coordinates inthe 3D space, denoted by(x;y;z). Then, itscorrespondingcoordinatesintheimageplanecanbeexpressed by x 0 = ¡fx z¡f ; y 0 = ¡fy z¡f ; (2.8) where f is the focal length of the camera. ² Projective transform If a scene plane is not parallel to the image plane, the scene is mapped onto the image plane through the following projective transform: x 0 = a 11 x+a 12 y+a 13 a 31 x+a 32 y+a 33 ; y 0 = a 21 x+a 22 y+a 23 a 31 x+a 32 y+a 33 ; (2.9) where a ij are constants. 18 ² Polynomial transform The polynomial transform is adopted for the case where the geometric model of the camera is unknown. The transformation can be written as the following form: x 0 = m X i=0 i X j=0 a ij x i y j¡i ; y 0 = m X i=0 i X j=0 b ij x i y j¡i : (2.10) 2.2.4 Optical Flow The optical °ow approach has been recently proposed for video registration with good performance [3]. Let I(x;y;t) be a function of the image intensity. Then, the behavior of the neighborhood centered at position (x;y) over a short period of time can be expressed as I(x+dx;y+dy;t+dt)=I(x;y;t)+ @I @x dx+ @I @y dy+ @I @t dt+¢¢¢; (2.11) where higher order derivatives are assumed to be negligible. If point (x;y) moves to a new position(x+dx;y+dy)overaperiodoftimedt,wehaveI(x+dx;y+dy;t+dt)=I(x;y;t). Then, (2.11) can be rewritten as @I @x dx+ @I @y dy+ @I @t dt¼0: (2.12) Dividing both sides of Eq. (2.12) by dt and letting dx dt =u, dy dt =v leads to ¡ @I @t = @I @x u+ @I @y v (2.13) 19 which is called the optical °ow equation. It represents the intensity changes along x and y directions and along time t. The problem is an ill-posed one and its solution demands some additional constraints. Once the constraints are added, the problem can be solve by the Lagrange multiplier method. Generallyspeaking, automaticimageregistrationisstillanopenproblem. Researchers still continue to look for better algorithms with robust performance in various application environments. 2.3 Review of Traditional Super Resolution and Image Enhancement The super resolution problem is formulated below. For a more detailed discussion on the super resolution problem, we refer to [4] and [49]. LetfY k g N k=1 and X be the set of N low resolution input images and the desired high resolution image, respectively. Then, by taking various degradation e®ects into consider- ation, the relationship between Y k and X can be written as Y k =D k B k W k X +N k k =1;¢¢¢;N; (2.14) where W k is the warping matrix, B k the blur matrix, D k the subsampling matrix and N k the noise. This is called the image observation model [39] and shown in Fig. 2.5. Note that, for most cases, N k is assumed to be white Gaussian noise with correlation function EfN k N k T g = ¾ 2 I. The super-resolution problem is to recover X based on observations Y k with 1·k·N. 20 Figure 2.5: The observation model for super resolution. Although many existing super-resolution techniques can provide high quality output, mostofthemdealwithrawvideointhepixeldomainratherthancompressedvideosuchas MPEG in the DCT domain. Even some of them adopt the frequency-domain information, they still demand the manipulation in the pixel (or space) domain. 2.3.1 Spatial-domain Algorithms Generally speaking, spatial-domain algorithms can be categorized into following types basedontheunderlyingtechniques: interpolation,iteratedbackprojection(IBP),stochas- ticreconstructionmethods,settheoreticreconstructionmethods,hybridML/MAP/POCS methods and optimal adaptive ¯ltering. Each of them will be discussed in the following. ² Interpolation Interpolation is the most intuitive method to enhance the resolution of an image. The most commonly used one is the bilinear interpolation. There are however other interpolationschemesavailable. Forexample, Landweberalgorithm[27]wasusedby Komatsu et al. [26] and Shah et al. [45] and a weighted nearest neighbor interpola- tion was adopted by Alam et al. [1]. A wavelet-based algorithm was introduced by Nguyenet al. [36]todealwithinterlacedtwo-dimensionaldata. Generallyspeaking, interpolationoperationsareeasytoimplement. However, theyarenotrelatedtothe 21 observation model. Important issues such as image degradation due to optical blur, motion blur and noise, cannot be well treated by this approach. ² Iterated Backprojection Let ~ x, ~ Y and H be the estimates of the desired super-resolution image, the esti- mated LR image and the image observation model. Then, we have ~ Y = H~ x. The iterated backprojection (IBP) is a process that backprojects the error between the k th estimated LR image ~ Y (k) and the observed LR imageY by a matrix denoted by H BP . The generalized iterative equation can be written as ~ x (k+1) = ~ x (k) +H BP (Y¡ ~ Y (j) ) = ~ x (k) +H BP (Y¡H~ x (j) ) (2.15) The above procedure is performed iteratively until the error between the estimated and the observed ones converges. Note that this method is not applicable if there is some a priori information on x that has to be taken into account. ² Stochastic Reconstruction When there is a priori information on x, a stochastic method called the Bayesian approach can be adopted. As mentioned before, the image observation model is of the form: Y = Hx+N. The Maximum A-Posterior (MAP) estimate of x can be derived by maximizing the power density function (PDF) PxjY as ~ x (k+1) = argmaxP(xjY) = argmaxPflnP(Yjx)+lnP(x)g; (2.16) 22 where lnP(Yjx) is the log-likelihood function and P(x) is the a priori density ofx. NotethatP(x)isoftenrepresentedbytheMarkovrandom¯eld(MRF)accordingto thelocalneighboringinteractionmodel. TheHuberMRFwasadoptedbyStevenson in [43]. The Gaussian MRF was used by Hanson et al. [9] and Hardie et al. [17], where the latter takes not only the global motion information but also the PSF of the sensor/optical system into consideration. Another class of stochastic reconstruction methods is based on the maximum likeli- hood (ML) formulation for registration, interpolation and restoration [48], [49], [11]. Katsaggelos [49] used this scheme to estimate the e®ect of sub-pixel shifts and noise variance and the super-resolution image at the same time. The problem was solved by a method called the expectation-maximization (EM) algorithm. Since the a prior information plays an important role in the ill-posed super-resolution problem, the ML algorithm is less preferable than the MAP estimation since the ML algorithm may not incorporate all prior knowledge properly. Generally speaking, stochastic methods including both MAP and ML estimates pro- vide a powerful suit of tools in the modeling of noise and the stochastic nature of underlying image and video. ² Set Theoretic Reconstruction The projection onto the convex set (POCS) is one of the prominent methods for solving the super-resolution problem. This idea was ¯rst introduced by Stark and Oskoui [37]. The a priori knowledge about the solution can be treated as imposing constraints on the solution so that it is an element of the intersection of several 23 convex sets denoted by C i , i=1;¢¢¢k, where each C i consists of vectors that satisfy certain properties. Given point x in the space, P i is the projection that projects x onto the closest point of set C i . After applying the process iteratively, i.e. projecting point x onto all constraint sets, we have x (n+1) = P k P k¡1 ¢¢¢P 2 P 1 x to fall in the intersection set, C I = \ m i=1 C i , which meets all constraints. Note that the closedness and the convexity of the constraint sets only guarantee the convergence of the iteration but nottheuniquenessofthesolution. Actually, the¯nalsolutionhighlydependsonthe initial guess. The POCS method is popular due to its simplicity, °exibility of the spatialdomainobservationmodelandtheeaseofincorporatinga prioriinformation. ² Hybrid ML/MAP/POCS Methods TocombinetheadvantagesofstochasticreconstructionmethodsandPOCS,ahybrid method was proposed in [43], [12]. If there are M constraints, the optimization can be modi¯ed as Minimize ² 2 =[y k ¡H k x] T R n ¡1 [y k ¡H k x]+®[S x ] T V[S x ] subject to x2C k ; 1·k·M; (2.17) where R n is the autocorrelation matrix of noise, S is the Laplacian operator, V is the weighting matrix to control the smoothing strength of each pixel, and C k is the constraint set. This hybrid method bene¯ts from the optimal estimates of stochastic reconstruction methods and the °exibility of including linear or nonlinear a priori 24 information of POCS. Thus, it is applicable under a more generic setting with good performance. ² Adaptive Filtering Approach The inverse ¯ltering technique can also be used in solving the super-resolution prob- lem. Jacquemod et al. [23] proposed a deconvolution process for observed images obtained through sub-pixel translation motion. A linear minimum mean squared er- ror(LMMSE)algorithm, whichcanbeviewedasamotioncompensatedmulti-frame Wiener ¯lter, was proposed by Erdem et al. [13] to process images with a spatial blur and additive noise. In addition to the Wiener ¯lter, the Kalman ¯lter was also adopted for super-resolution reconstruction in [40], [10]. This computationally e±- cient scheme can deal with images degraded by the spatially-varying blur. However, it cannot handle nonlinear modeling constraints e®ectively. 2.3.2 Frequency-domain Algorithms To solve the super-resolution problem, the frequency-domain method [50] utilizes the shift property of the Fourier transform, the relationship between the continuous Fourier trans- form (CFT) and the discrete Fourier transform (DFT) and the assumption that the un- derlying image is band-limited. Although there are disadvantages associated with the frequency-domain approach, it is still computationally attractive while degraded images onlyhavesub-pixelglobaltranslationmotion. Thefrequency-domainapproach[5]applied to the super-resolution problem is reviewed below. 25 Let f(x;y) denote a continuous scene. Consider the following R shifted images f r (x;y)=f(x+¢x r ;y+¢y r ); r =1;2;¢¢¢R: (2.18) TheircontinuoustransformsaredenotedbyF(u;v)andF r (u;v),respectively. Byapplying the Fourier transform to (2.18), we obtain F r (u;v)=exp j2¼(¢x r +¢y r v) F(u;v): (2.19) The observed images, y r [m 1 ;m 2 ], can be obtained by sampling the original image. That is, we have y r [m 1 ;m 2 ]=f(mT x +¢x r ;nT Y +¢y r ); m;n=0;1;¢¢¢;M¡1: Let Y r [k;l] be the DFT of y r , r =1;2;¢¢¢R. Then, we have Y r [k;l]=® 1 X p=¡1 1 X q=¡1 F r à k MT x +pf sx ; 1 NT y +qf sy ! ; r =1;2;¢¢¢R; (2.20) where f sx =1=T x and f sy =1=T y are the sampling rates along the horizontal and vertical directions, respectively. Based on (2.19), Eq. (2.20) can be rewritten in form of Y =©F; (2.21) whereY istheDFTcoe±cientvectorwithelements Y r [k;l], r =1;2;¢¢¢R, F isthevector consisting of samples of the CFT of the high resolution image, and © is the matrix that 26 models the relationship between Y and F. Then, the solution to the super-resolution problem can be obtained by solving a linear system for F and then applying the inverse DFT to the resulting F to reconstruct the space-domain image f. This frequency domain solution procedure saves a lot of computation. However, the assumptions made here are not realistic since the optical point spread function (PSF) as well as the observation noise are not considered. Some extensions of [4] have been made by Kim et al. [24] and Tekalp et al. [47], who took PSF and noise into consideration and solved the problem by the least squares method. Later on, the recursive least squares solution for (2.21) was proposed by Bose et al. [6], where the problem is modi¯ed to minimize k©F ¡Y k 2 +¸kF ¡ck 2 ; (2.22) and where c is an approximation to the desired solution. The solution to this problem becomes ~ F=(© T ©+¸I) ¡1 (© T Y +¸c); (2.23) which can be solved iteratively rather than directly via matrix inversion. With a fast convergence rate, the computational complexity of an iterative method can be reduced dramatically. Kim et al. [6] proposed a recursive total least squares method to solve a problem where errors appear in observations as well as the system matrix. Then, the observation can be expressed as Y =[©+E]F+N; (2.24) 27 where E is the motion estimation error in ©, and N is additive noise. The problem can be further converted to a constrained optimization problem as Minimizek[N . . .E]k F ; Subject to Y¡N=[©+E]F: (2.25) If there exists a simple expression of the relationship between the low-resolution and high-resolution images, frequency-domain methods are computationally attractive. How- ever, these methods can handle global translational motion and spatial invariant degrada- tiononly. Itisdi±cultforthemtodealwithgenericdegradationmodelsandtoincorporate a priori information. These are their main limitations. 2.4 Challenges of Coded Video Mosaicking and Research Objectives Even though image/video mosaicking has been studied for decades, most techniques have been developed based on raw video data (or uncoded video). Thus, most algorithms are implemented in the space-time domain (or the image pixel domain). In this research, we considermosaickingofcodedvideosinceeachindividualcapturedvideoiscodedbeforeits transmission. If we perform image/video mosaicking in the raw video domain, it demands image/video decoding ¯rst. For example, it will involve tedious inverse/forward DCT when the input video is in the compressed format such as motion JPEG and MPEG. They are not suitable for implementation in embedded real-time systems due to the heavy computation and large memory needed in this process. Thus, it is desirable to study 28 image/video mosaicking in the compressed domain directly, which is one of the main tasks of this research. The super resolution problem has been studied for a long while. However, all existing algorithms have been designed for raw video inputs. For multiple coded image sequences, how to get super resolution video based on coded video in the DCT domain (without de- coding them back to the raw video domain fully) to save computation as well as storage is clearly a great challenge. Some information in the coded video domain, such as DCT co- e±cients, quantization step sizes, motion vectors or even the residuals, could help improve the performance of super-resolution outputs. However, a super resolution method fully in the DCT domain could be very di±cult due to the limitation of the frequency domain methods as discussed in the last subsection. Thus, to develop the super-resolution tech- nique for multiple MPEG video sequences, our initial goal is to develop a hybrid method that utilizes raw as well as coded video data to produce a high-resolution MPEG video output. However, in order to save the computational cost, we prefer to perform opera- tions in the compressed domain as much as possible. The pixel domain process will be considered only when it is absolutely needed. DCT provides a powerful tool for energy compaction (by removing spatial redundancy of the underlying image), and it is widely used in image coding standards such as JPEG andMPEG.Thus,westudyimage/videoregistration,colormatching,andsuperresolution techniques based on multiple coded video clips in the DCT domain. Besides, motion vectors can provide auxiliary temporal information for image alignment. The proposed system is illustrated in Fig. 2.6. 29 Figure 2.6: Illustration of the proposed video mosaicking system. The ultimate object of this research is to develop an e±cient system to generate a mosaic image/video using several images/videos captured by di®erent sensors under dif- ferent conditions. Suppose there are few input video sequences of SD level produced by di®erent cameras and then passing through the MPEG encorders to generate compressed MPEGformatstreams. Astitchingschemeisperformedmostlyinthecompresseddomain without going through the conversion to the spatial domain. Note that there are some assumptions made in the proposed system. For example, the temporal synchronization parameters are well calibrated and the radial distortions caused by various focal length are also compensated in advance. Then, there are two major research issues remaining: namely, color matching and image/video registration. They need to be addressed to com- pensatethediscrepancybetweentwoimage/videoinputstomakeonehighqualitystitched output. In our research, most of process are performed in the DCT domain so that a high resolution image sequence can be obtained from several low resolution image sequences 30 via video mosaicking. Furthermore, a post-processing technique called super resolution can be applied to any region-of-interest or even the whole image for the highlight purpose. 31 Chapter 3 Color Matching and Compensation of Coded Images A video source captured by a camera usually has its unique color preference and response to the light. Consequently, the brightness and color of di®erent video sources may vary signi¯cantly. Eventhoughimagesareperfectlystitchedgeometrically,apparentseamlines maystillexistbetweenvideotiles[35]. Tomaketheimage/videomosaiclookmorenatural, it is important to ¯nd a way to adjust the color tones of image/video inputs as close as possible. Several ideas in comparing the color similarity between two images have been studied. Forexample,astructuredlightapproachwasproposedbyTsukadaandTajima[51]. More recently,anewtechniquewasproposedbyHuandMojsilovic[21]toextractrelevantcolors, where a new color distance measure was used to assure the optimal matching of di®erent colors from two images. Most of previous color matching algorithms are conducted in the spatial domain, which demands a larger amount of computation. Since many video sources are encoded using the motion-compensated predictive coding technique such as 32 theMPEGandH.26xcodingstandards, itispreferredthatthetaskofcolormatchingand correction is done in the DCT transform domain (rather than in the pixel domain) in the video mosaic authoring process. Under the simplifying assumption that all other di®erences have been compensated, we examine and compare various color-matching techniques in the pixel and the DCT (Discrete Cosine Transform) domains in this chapter. Algorithms in the pixel domain as well as the transform domain for color di®erence compensation and seam line removal among video tiles are proposed. They are compared in terms of visual quality. It will be demonstrated by experimental results that the transform-domain algorithm can achieve good quality of composition while dramatically reducing the computational burden of decoding/encoding. Theoverlappingregionoftwoimagescanbedeterminedintheimage/videoregistration process, which will be described in Chapter 4. Here, it is assumed that the overlapping region of two adjacent images has I columns. For these I columns, we perform some manipulation to make each color component of two images to share a similar range of values. Two approaches are considered. They are the histogram-based approach and the polynomial-based contrast stretching approach, which will be described in Sec. 3.2 and Sec. 3.3, respectively. 3.1 Pre-processing via White Balancing Beforethecompensationofthecolordi®erencebetweentwoimages,apre-processingcalled whitebalancingmaybeneededtocorrectthecolortoneofeachimageindividually. White balancing is the process of adjusting image colors under di®erent illumination conditions 33 so that the color bias of each image can be removed. In other words, an object that appears in white in human eyes are rendered white in the photo. It is not di±cult for our eyes to determine the white color under a di®erent light condition automatically but it is a di±cult task for cameras to do so. A technique called the Gray World Assumption (GWA) has been developed to achieve this goal and will be reviewed below. 3.1.1 Fundamentals of Gray World Assumption (GWA) The GWA states that a given image is assumed to have a su±cient amount of color variations. In other words, the average value of the R, G and B components of an image should average out to a common gray value. Under this assumption, the three channels are adjusted individually but the adjustment ratios for each channel should be kept the same for all pixels. This procedure can be done detailed as follows. Consider R, G, and B three color components. The ¯rst step is that the mean of each component as well as the total mean are calculated and denoted by m R , m G , m B and m RGB , respectively. The ratio of each component is then de¯ned as r R = m RGB m R ; r G = m RGB m G ; r B = m RGB m B : (3.1) According to these ratios, the color components R, G, and B can be adjusted by R GWA =r R ¢R; G GWA =r G ¢G; B GWA =r B ¢B: (3.2) AnexampleofGWAisshowninFig. 3.1. Weseethatthecolortonehasbeencompensated and the output image quality has been improved. 34 low quality image Image under Gray World assumption (a) (b) Figure 3.1: Experimental results of applying the Gray World Assumption to all image pixels: (a) the input image and (b) the output image. 3.1.2 White Balancing in DCT Domain By examining HSI components and the corresponding histograms of the two images in Fig. 3.1, we see that the main di®erence lies in the H component while there is not much di®erence in the S and I components. It indicates that GWA focuses more on the adjustment of the H component and less on that of the S and the I components. If we examine GWA in the YCbCr space, we see that it has more impacts on the chrominance components, i.e. Cb and Cr. Two di®erent cases are considered below to verify the idea of performing GWA on Cb and Cr components. For the ¯rst case, an image with lower saturation and lower intensity is shown in Fig. 3.2(a) and the corresponding output of the GWA enhanced image is shown in Fig. 3.2(b). We see that the yellowish color tone has been corrected to some degree but the output image quality is poor due to the low intensity. For the second case, an image with a di®erent hue component but the same two other channels is shown in Fig. 3.1(a) and the corresponding output of the GWA enhanced image is shown in Fig. 3.1(b). The quality of the output image is signi¯cantly enhanced. 35 Finally, by comparing Fig. 3.1 (b) and Fig. 3.2(b), we see that GWA works better for the hue adjustment (color tone manipulation) but cannot do much for the adjustment of the saturation and intensity components. low quality image Image under Gray World assumption (a) (b) Figure 3.2: Experimental results of applying GWA to an image with low saturation and low intensity components. 3.1.3 Experimental Results A preliminary experiment of performing GWA in the DCT domain has been conducted to verify the proposed idea of improving color quality of an image. As shown in Fig. 3.3, we ¯nd that the result is similar to the one in Fig. 3.1(b). In other words, GWA is applicable to the chrominance components to compensate the color tone of an image successfully. Note the GWA works well under the assumption that the underlying image has a su±cient amount of color variations. Thus, if the image content is dominated by a certain color, the algorithm may fail to provide a high quality output image. Another color processing approach is proposed in the next section to deal with this situation. 36 Image under Gray World assumption Figure 3.3: Experimental results of applying GWA in the DCT domain. 3.2 Histogram Matching In this section, we consider the problem of adjusting histograms of two images to elimi- nate the seam lines around boundaries in an image mosaic using the histogram matching method. The color associated with each pixel can be represented using the RGB or the YCbCr color coordinates. In the following discussion, we focus on the histogram adjust- ment for any single color component (i.e. a monochrome image). The same process can be applied to the three color components separately. 3.2.1 Fundamentals of Histogram Matching Technique The histogram of an image is obtained by choosing a bin size, which is usually 256 or fewer since we adopt the 8-bit representation for a monochrome image, and counting the number of pixels with values belonging to each bin. Finally, we may divide the number of each bin by the total number of pixels for the normalization purpose. If we treat an image as a source and its gray level value a random variable, then the histogram can be viewed as the probability density function (pdf). 37 Byhistogrammatching,werefertoaprocessthatadjuststhehistogramofanimageto besimilartothedesiredone. WeuseI 1 ,H I 1 andH d todenoteaninputmonochromeimage, its histogram and the histogram of the desired output monochrome image, respectively. Our goal is to ¯nd the mapping F so that H F(I 1 ) =H d . To perform histogram matching, we need the cascade of the following two steps. ² Step 1: Map from I 1 to I 0 using the cumulative distribution function (cdf) of input image I 1 , which is denoted by F 1 , and ² Step 2: Map from I 0 to F(I 1 ) using the inverse cumulative distribution function (icdf) with respect to the desired histogram H d , which is denoted by F 2 . Please note that the cdf of an image is computed from its histogram by summing up successive bin counts from 0. The cdf is a function that maps the interval [0,256] to [0,1] while the inverse cdf is another function that maps [0,1] back to [0,256]. Thus, the histogram equalization can be written as F(I 1 )=F 2 [F 1 (I 1 )]: Fig. 3.4 shows an example, where the histograms of three components (RGB) of an image are adjusted by applying the histogram matching technique. The three sub-¯gures in the top row of (a) are histograms of the R, G, B color components of the ¯rst original input and the bottom row are those of the second original input. The histograms of ¯rst andsecondoutputimagesafterhistogrammatchingareshowninthetopandbottomrows of (b), respectively. 38 (a) (b) Figure 3.4: The histograms of three components of an image: (a) before and (b) after histogram matching. 39 3.2.2 Pixel-domain Color Adjustment Since it is in the pixel domain, the image color is typically represented by its red (R), green (G), and blue (B) components. Take the R component for example. Let ~ R 1 (x;y) and ~ R 2 (x;y)betwo2-Dsequencesoftheredcomponentintheoverlappingregionforimage 1 and image 2, respectively. We ¯rst compute the mean m R (x;y) of the corresponding pixels in these two sequences. Then, we can generate histograms for three sequences separately: ~ R 1 (x;y) and ~ R 2 (x;y) and the mean sequence m R (x;y). Afterwards, we can de¯ne a mapping table, L R 1 , for pixels of image 1 in the overlapping region so that the histogram of ~ R 1 (x;y) alone can be converted to that of the mean sequence. Then, by adopting the same mapping rule, we are able to update the R components of all other pixel values for image 1. The same procedure can be applied to image 2. After the R, G, and B components of both images are properly updated with the above procedure, these two new images should have the same color tone in the common (i.e. overlapping) region as well as the two disjoint regions. Figure 3.5: The block diagram of histogram matching in the pixel domain. 40 3.2.3 DCT-domain Color Adjustment InsteadofdealingwithRGBcomponents,wemanipulatetheDCvaluesofYCbCrforcolor adjustment when performing color adjustment in the DCT domain, since image/video coding is usually done in the YCbCr domain. 1. Basic scheme for exactly matched block locations For the DCT domain processing, we ¯rst consider a simpli¯ed case where the DCT blocksfromimage1andimage2havetheexactmatchasshowninFig. 3.6(a). Then, we perform the color matching process by adjusting the Y, Cb, Cr values of the DC coe±cient of each DCT block. For each color component of the DC coe±cient, we can adopt the histogram matching method which is described in Sec. 3.2 for color adjustment. 2. Modi¯ed scheme for blocks with displacement Next, let us consider a more complicated case, where the DCT blocks from image 1 and image 2 are o®set by an amount of m and n pixels along the horizontal and the vertical directions, respectively, where m and n are in the range of 0 and 8. Then, we can proceed with the following three steps. ² Step 1: Once the spatial displacement (m,n) is known, we use the bilinear interpolationtointerpolatetheDCcomponentsoffourDCTblocksfromimage 1, i.e. a, b, c, and d shown in Fig. 3.7, that surround the target DCT block of image 2 so that the interpolated block has the same spatial location as the target DCT block. The interpolated DC value is called the DC component of the pseudo DCT block in image 1. 41 ² Step 2: We adjust the color values of the DC components of the DCT block in image 1 and the pseudo DCT blocks in image 2 using the algorithm presented in Sec. 3.2. ² Step 3: We use the bilinear interpolation again to interpolate the DC com- ponents of the DCT blocks of image 2 based on the DC components of their surrounding pseudo DCT blocks. This process is illustrated in Fig. 3.7. (a) (b) Figure 3.6: The DCT blocks from images 1 and 2 have (a) the exact matched location and (b) an o®set of (m,n). Figure 3.7: The bilinear interpolation of four DCT blocks to synthesis the pseudo DCT blocks and vice versa. 42 3.3 Polynomial Approximation 3.3.1 Pixel-domain Contrast Stretching The color tone adjustment can be achieved by using contrast stretching as detailed below. First, we compute the mean value of each color component of pixels in the overlapping region. Then, we can approximate the mapping from the original color component to this newly calculated color component with a polynomial which is described in Sec. 3.3 of di®erentordersforoneinputimage. Wehaven+1coe±cientsforan-th-orderpolynomial. These coe±cients can be determined using the least square methods for all pixels in the overlapping regions. Once these coe±cients are found, they can be used to update the values in the non-overlapping region of this image. The same procedure can be applied to the other input image. In our experiment, we found that it is su±cient to have a low order polynomial such as n = 2. The improvement comes from a larger n is very limited. More details will be shown in Sec. 3.5.2. 3.3.2 DCT-domain Contrast Stretching Here, westillconsidertwocases: abasicschemewithexactlymatchedblocklocationsand a modi¯ed scheme with blocks with displacement. Note that the data sequences here are DC values of YCbCr in the overlapping region instead of RGB components. After solving the linear system for coe±cientsa andb, we are able to update all the other DC values in the non-overlapping region. Note that the data size is 1/64 of the one in the pixel domain since each 8£8 block has only one DC value. 43 3.4 Post-processing via Linear Filtering Consider two data sequences, x = [x 0 x 1 ¢¢¢ x n ] and y = [y 0 y 1 ¢¢¢ y n ]. We would like to combine them together to form a new data sequence which contains some information of the original two sequences with di®erent weights respect to their spatial positions. In terms of mathematics, the relationship between the new sequence and the original two data sequences can be expressed as z(i)=®£x(i)+(1¡®)£y(i) for i=0;¢¢¢;n: (3.3) where 0·®·1 is the weighting parameter whose values are shown in Fig. 3.8. Figure 3.8: The curves of two weighting parameters used in the linear combination. 3.4.1 Pixel-domain Post-processing After the above step, we still see two seam lines between the overlapping and the non- overlapping regions. To remove seam lines, we weigh values from images 1 and 2 and combine these weighted values into one ¯nal value. In terms of mathematics, the relation- ship between the new image and the original two images can be expressed as ~ R 0 (i;j)=®£ ~ R 0 1 (i;j)+(1¡®)£ ~ R 0 2 (i;j); (3.4) 44 where ~ R 0 (i;j) is the new value at position (i;j) in the overlapping region of the output stitched image, ~ R 0 1 (i;j) is the updated value at position (i;j) in the overlapping region of image 1, ~ R 0 2 (i;j) is the updated value at position (i;j) in the overlapping region of image 2, and 0· ®· 1 is the weighting parameter varying according to the pixel position. In our experiment, we let the value of ® increase linearly from 0 to 1 when (i;j) moves from the boundary of image 1 to the boundary of image 2 along the overlapping region. 3.4.2 DCT-domain Post-processing After updating all DC values, the seam lines between image boundaries might still exist. To remove seam lines, we can convert the color space from the YCbCr values back to the RGB values, and apply the same method as described in Sec. 3.4. 3.5 Experimental Results In this section, we present some preliminary experimental results with two input images as shown in Figs. 3.9(a) and (b). These two test images have a very di®erent color tone. Besides, there is a signi¯cant overlapping region between them. It is assumed that the registration part has been done so that the two images are well aligned in the common area. Fig. 3.10 shows the stitched image without color matching, where a simple color component averaging operation is applied in the overlapping part of two images. As we can see from the output image, there exist three regions of di®erent color tones and two apparent seam lines between two adjacent regions. 45 (a) (b) Figure 3.9: The original two input images. Figure 3.10: The stitched image without color matching. 3.5.1 Stitched Images After Color Matching 3.5.1.1 Histogram Matching The stitched image using histogram matching in the pixel domain is shown in Fig. 3.11. As we can see, the color tone of the output image has been adjusted to lie in the middle of the two input images. Furthermore, the seam line has been eliminated satisfactorily. The output image looks just like a natural image with a wider angle of view. Imagestitchingusing theDCT domainprocessing basedon histogram matchingin the DCcoe±cientofDCTblocksisgiveninFig. 3.12. Weseethecolordeviationphenomenon inthenon-overlappingregionssincemostofthepixelvaluesfalloutsidethedynamicrange 46 Figure 3.11: The output image after color adjustment in the pixel domain with histogram matching. of [0;255] after converting the updated values of Y, Cb, and Cr back to the RGB color coordinates. 3.5.1.2 Polynomial Approximation The result of using the polynomial-based contrast stretching in the pixel domain is shown in Fig. 3.13. The performance is similar to that of histogram matching. There are no seam lines and no color transition inside it. The result using the polynomial-based contrast stretching as shown in Fig. 3.14 looks signi¯cantly better if compared to Fig. 3.12. The two input image have similar color tones, which are close to that of the mean color values of the overlapping region. This phenomenon can be explained below. 47 Figure 3.12: The output image after color adjustment in the DCT domain with histogram matching. Since the histogram describes the overall distribution of a random variable, the more data we generate, the more accurate the description is. As to the polynomial approxi- mation, it can be viewed as a point-wise approximation method. In our experiments, the region of the overlapping part is more than half of the original one. Thus, there is a lot of information for us to ¯nd out the relationship between these two images in the pixel domain so that both the histogram matching method and the polynomial approximation methodworkwell. However,inthetransformdomain,wehaveoneDCvalueforeach8£8 DCTblockto¯ndouttherelationship,whichissigni¯cantlylessthanthedatainthepixel domain. Thus, the result is less robust. On the other hand, the degree of freedom in the polynomial approximation (i.e. the number of coe±cients of the polynomial) is usually 3 or 4, which is still much fewer than the number of constraints in either the pixel domain or the DCT domain. Thus, the result is more robust. 48 Figure3.13: Theoutputimageaftercoloradjustmentinthepixeldomainwithpolynomial approximation. Fig. 3.15(a) and (b) shows the stitched image using the polynomial-based contrast stretching technique in the DCT domain with exactly matched blocks and in the DCT domain with block displacement, respectively. To make the results more visible, we zoom in speci¯c regions of the output image. As we can see, the color tone of the output image has been adjusted to lie in the middle of the two input images. Furthermore, the seam lines in Fig. 3.10 have been eliminated satisfactorily. The output image appears to be a natural image with a wider angle of view. The result for the DCT domain techniques has similar performance as the one in the pixel domain whether the block is aligned or not. 3.5.2 Performance Comparison 1. Comparison of processing time ThecomparisonofprocessingtimebetweenthepixelandtheDCTdomainsisshown in Table 3.1. The computation time for polynomial approximation is less than that 49 Figure3.14: TheoutputimageaftercoloradjustmentintheDCTdomainwithpolynomial approximation. for histogram matching. As far as the memory is concerned, polynomial only needs to store three coe±cients for each pair of sequences while the histogram needs an array of size at least 256 by 1. Polynomial approximation seems to be superior to histogram matching in both the computational cost and the memory requirement. 2. Comparison of MSE with di®erent order of polynomial approaches Figure 3.15: Stitched images with polynomial approximation in the DCT domain (a) without block displacement and (b)with block displacement. 50 Asmentionedbefore, polynomialbasedcontraststretchinghasabetterperformance in image quality, computation complexity and memory requirement. Note that the polynomial adopted for approximation is of order 2. Theoretically, the higher the order, the more accurate the approximation. However, higher order polynomial will cost more computation complexity. Table 3.2 shows how the order would a®ect the performance in terms of MSE (mean squared error). As it is shown in this table, higher order approximation does improve the performance but not that obviously. Therefore, we may say that the second order polynomial approximation is su±cient to provide outputs to some satisfactory degree. Table 3.1: Comparison of processing time (seconds) between two di®erent domains and two di®erent approaches. Pixel Domain DCT Domain Save Histogram Matching 71.243 (sec) 8.623 (sec) 87.90% Polynomial Approximation 52.135 (sec) 6.810 (sec) 86.94% Save 26.82% 21.02% Table 3.2: MSE for di®erent order polynomial approaches. order Y Cb Cr 2 image 1 5574.6 1132.4 664.8 image 2 7001.3 1343.8 2035.5 3 image 1 5449.1 1131.9 646.9 image 2 6980.4 1285.6 2037.3 4 image 1 5298.0 1123.8 649.5 image 2 7011.7 1282.0 2045.6 51 3.5.3 Other Considerations For each 8£ 8 DCT block, each coe±cient represents the weight of its corresponding spatial frequency, and there are 64 basis patterns. Typically, the DC value contains most information about the 8£ 8 block so that we only adjust DC values of two images to be the same in the DCT domain. Figs. 3.16(a) and 3.16(b) show the output images, where we try to match not only the DC values but also the ¯rst three and ¯ve AC values, respectively. The color tone is similar to the one shown in Fig. 3.14. However, we observe some arti¯cial patterns appearing all over the whole image which degrades the quality of the output image signi¯cantly. It is not surprising since modifying DCT coe±cients for eachblockisequivalenttochangingtheweightsofthebasispatterns. Oncetheproportion relationshipsbetweenspatialfrequencieshavebeenmodi¯ed, theoutputimagewillnotbe the same. This explains the occurrence of arti¯cial patterns if we attempt to match AC values as well. 3.6 Conclusion Several color matching algorithms that compensate the color di®erences from two input imagesequencescapturedbydi®erentcameraswerestudiedinthischapter. Someofthem are conducted in the pixel domain while others are carried out in the DCT domain. It was demonstrated by experimental results that the DCT domain technique saves more than 80% computational cost as compared to the pixel domain technique while the quality of the resulting image mosaic is about the same for both approaches. It was also observed that the polynomial-based contrast stretching method has better performance than the 52 histogram matching method in terms of the processing time and memory requirement. Experimental results presented in this chapter are restricted to the image mosaic only. It is worthwhile to consider to generalize this technique to the video mosaic in the near future. 53 (a) (b) Figure 3.16: Image mosaic with the 2nd order polynomial in the DCT domain: (a) DC plus the ¯rst three AC values and (b) DC plus the ¯rst ¯ve values. 54 Chapter 4 Fast and Accurate Block-Level Registration of Coded Images 4.1 Block-level Image Registration with Edge Estimation In this section, we study the registration of two images that contain only translation displacement in the horizontal and vertical directions. We address the problem in the DCT domain with the DC and AC values of the luminance component of each block available. We attempt to align the images to some satisfactory degree using these DCT coe±cients. The proposed algorithm basically contains three steps: image segmentation for foreground extraction, edge estimation and parameter estimation. 4.1.1 Image Segmentation for Foreground Extraction At the ¯rst step, a DC map, which contains only the DC values of the Y component of the DCT blocks, of each input image is formed. Note that the size of the DC map is 1/64 of that of the original one. To simplify the alignment process in the later stages, 55 (a) (b) Figure 4.1: Conversion from a DC map to a binary activity map: (a)the DC map and (b)the binary activity map. we ¯rst perform image segmentation on the DC map to extract the regions of interest. Otsu[38]proposedamethodtodividethelightintensityhistogramintotwodistinctparts automatically. It is stated below and will be used for our image segmentation task. Given a histogram, we can compute its statistical properties. In particular, we use !(k), ¹(k) and ¹ T (k) to represent the zeroth cumulative moment, the ¯rst cumulative moment and the mean of a bin with index k, respectively. The histogram can be split using threshold k ¤ such that ¾ 2 (k ¤ )= max 1·k<L ¾ 2 (k) (4.1) where ¾ 2 (k) = [¹ T !(k)¡¹(k)] 2 !(k)[1¡!(k)] . Once the optimal value of k ¤ is determined, we can obtain a binary activity map B from the DC map. Ideally, the foreground (set to be 1) and the background (set to be 0) can be separated in the binary activity map. One example of converting the DC map to the binary activity map results is shown in Fig. 4.1. In Fig. 4.1 (a), we show the DC maps of two images that we intend to align. Their corresponding binary activity maps are shown in Fig. 4.1(b). Even with only the DC values, the DC maps shown in Fig. 4.1(a) still provide many ¯ne details that make the alignment task di±cult. 56 Suppose that the two original input images are of size P i £P j . Then, their DC maps are of size N i £N j , where N i = P i =8 and N j = P j =8 (as shown in Fig. 4.2(a)). Note that d i and d j in Fig. 4.2(b) are the displacement parameters. To search for the optimal (a) (b) Figure 4.2: (a)The relationship between original images and binary maps and (b) the geometrical representation for displacement. alignment with integer block accuracy, we can compute the sum of absolute di®erence of the left and the right images ( DC maps ) with a displacement d i and d j : min d i ;d j X b i ;b j jL DC (b i ;b j )¡R DC (b i +d i ;b j +d j )j (4.2) where the summation sums up all the b i 's and b i 's belonging to the overlapping region. If we do an exhaustive search based on either the DC maps or the binary activity maps, the complexity, C, will be proportional to C/N i £N j : (4.3) However, the computation with the binary activity maps is much faster since it takes only 0 and 1 two values. It is possible to further simplify the search if we consider the edge 57 information present in the foreground part of the binary activity map. This is done in the next step. 4.1.2 Edge Estimation In this subsection, we only consider 8£8 DCT blocks located in the foreground region. Our objective is to extract the edge strength and orientation information. This can be done by examining the de¯nition of the discrete cosine transform: F uv = C u C v 4 7 X i=0 7 X j=0 cos (2i+1)u¼ 16 cos (2j+1)v¼ 16 f(i;j): (4.4) It is easy to verify that the ¯rst few AC coe±cients represent some speci¯c ways to sum up the pixels in an 8£8 block as shown in Fig. 4.3. Let us take F 10 as an example. The coe±cientisobtainedbyaweightedsummationof64pixels, wherethetop4rowstakethe positivesignwhilethebottom4rowstakethenegativesign. If F 10 hasalarge magnitude, it implies that there is a good chance that we will have a horizontal edge. Similarly, if F 20 has a large magnitude, then we may have two horizontal edges. The same argument applies to other low frequency AC coe±cients as indicated in Fig. 4.3. Shen et al. [46] proposed a rough way to estimate the edge orientation in a block as tanµ = 7 X v=1 F 0v = 7 X u=1 F u0 : (4.5) Thisformulaisreasonablesincethenumeratorandthedenominatorindicatethestrengths of the horizontal and vertical edges, respectively. Although the obtained edge orientation information is kind of rough, the coarse edge detection technique in the DCT domain was 58 Figure 4.3: The sign patterns of weighted pixel values for the ¯rst few AC values. proved to be about 20 times faster than traditional edge detectors in the pixel domain. Furthermore, the edge strength can be computed according to the following formula: h vertical = F 10 +®F 20 9:111tanµ and h horizontal = F 01 +®F 02 9:111tanµ : (4.6) Here, we exploit the fast edge orientation estimation algorithm with some modi¯cation. That is, we only consider edges that are aligned with the straight lines without any o®set fromthecenterofeach8£8blockasshowninFig. 4.4(a). Furthermore,sinceitisablock- based estimation, it is di±cult to represent a large number of possible edge orientations withgoodaccuracy. Thus, werestricttheestimatededgeorientationtoeightquantization levels as shown in Fig. 4.4(b). To summarize, we use Eq. 4.5 to compute the edge orientation in blocks of the foreground and then quantize the edge orientation into 8 levels for further processing. (a) (b) Figure 4.4: The eight quantized levels for coarse-scale edge orientation estimation. 59 4.1.3 Displacement Parameter Estimation Based on the results obtained from Section 4.1.2, we apply the edge-based image regis- tration technique to determine the displacement parameters to align the left and right two images. One simple way to achieve the alignment is to compute the cross-correlation between the two edge maps, where the edge strength is set to be zero in the background region. r(i;j) = j 1 <(N j =2) X j 1 i 1 <(N i =2) X i 1 µ ~ E 1 µµ i 1 + N i 2 ¶ ; µ j 1 + N j 2 ¶¶ ¡m ~ E 1 ¶ ¤ ³ ~ E 2 ((i+i 1 );(j+j 1 ))¡m ~ E 2 ´ (4.7) where N i and N j represent the width and the height of the overlapping region and m ~ E 1 and m ~ E 2 are the mean values of the overlapping regions of ~ E 1 and ~ E 2 , respectively. The vector, r(i;j), that leads to the maximal correlation value, gives the optimal displacement in the horizontal and the vertical directions. Since the horizontal or the vertical size of the edge map is 1/8 of that of the original one, the actual amount of displacement should be scaled up by a factor of 8 with respect to the coordinates of the original input images. 4.1.4 Experimental Results In this section, we present some experimental results with four test image pairs as shown in Figs. 4.5, where (a) and (b) are indoor scenes and (c) and (d) are outdoor scenes (600x448). Generally speaking, images of the outdoor scene usually consist of a higher noise level so that it is more di±cult to extract the accurate edge information. The image registration results are usually poorer. 60 (a)The ¯rst test image pair (b)The second test image pair (c)The third test image pair (d)The forth test image pair Figure 4.5: The four test image pairs. 4.1.4.1 Performance Comparison in Processing Time Here we compare the two image registration systems using the traditional pixel domain approach and the proposed DCT domain approach. The computational saving comes from two di®erent parts. First, the DCT domain approach avoids the inverse DCT and theforwardDCTprocessesrequiredbythespacedomainapproach. Second,theresolution ofthespacedomainimagepairismuch¯nerthanthatoftheDCTdomainimagepair, i.e. 64 versus 1. Thus, the search for the displacement vector demands much more time. The execution time for the alignment of the 4 test image pairs is compared in Table 4.1 and Fig. 4.6. We see the time saving ranges from 95-97% as compared with the traditional space domain approach. Table 4.1: Comparison between the proposed and the traditional one in processing time. 1st 2nd 3rd 4th traditional 34.2340 36.1410 35.8910 34.1100 proposed 1.0310 1.7030 1.7500 1.6880 save(sec) 33.2030 34.4380 34.1410 32.4220 save(%) 96.9884 95.2879 95.1241 95.0513 61 Figure 4.6: Performance comparison in processing time. 4.1.4.2 Comparison of Output Image Quality We compare the registered output image results in Figs. 4.7-4.10, where (a) shows the results based on the proposed DCT-domain technique and (b) gives the results based on Canny's edge detection. As shown in Fig. 4.7(a), although the estimated edge maps are not as accurate as the ones obtained using Canny's edge detector, the proposed method does catch the general trend well. Besides, the stitched output image in Fig. 4.7(a) looks very similar to the corresponding one in Fig. 4.7(b). Please note that the two input image pairs as shown in Fig. 4.5(a) have a color-mismatch problem. However, the color mismatch does not a®ect the registration results since the proposed approach does not rely much on the color information. Instead, it is based on the extracted edge with the luminance component only. As shown in Fig. 4.8, the proposed DCT domain approach misses a lot of edge in the facial and cloth areas. Only the edge information in the hair region is caught. This is partly due to the poor performance of the image segmentation step for the foreground extraction and partly due to the limitation of edge detection in the DCT domain. However, since both left and right input images are processed with the same technique, the missed edges do not hurt the image alignment job at the later stage. Actually, the hair information is su±cient to do the alignment. As a result, the proposed 62 (a) (b) Figure 4.7: Image registration results of (a)the proposed DCT-domain and (b)the space- domain approaches. 63 methodcanprovidealmostthesameperformanceasthetraditionalspacedomainapproach at a much lower computational complexity. Fig. 4.9 provides the edge detection and image registration results of an image pair of the outdoor scene. The edge maps appear to be quite complicated. However, the proposed technique still provides a simpler edge map to make the later alignment task easier. Furthermore, the two output image mosaics are indistinguishable by human eyes. Finally, fortheimagepairshowninFig. 4.5(d), weshowtheedgedetectionandtheimage registration results in Fig. 4.10. By examining Fig. 4.10(a) carefully, we can ¯nd that the output image is not perfectly registered in terms of the DCT blocks using the proposed method. We see from the windows of the white building that there exists a misalignment of the two images and the amount of misalignment is around 3 DCT blocks (or 24 pixels). This could be explained by the periodic pattern of roofs that occupy a quite large area of the two input images. Thus, there exist multiple local maxima that make the global maximumselectiondi±cult. Onewayto¯xthisproblemistoconsidermultiplethresholds sothatwecanweighedgesinthewindowareamoretoavoidtheconfusioncausedbyedges of the roof region. Since the proposed approach is block-based so that a single error in the edge map will lead to a block error in the image domain. 4.2 Block-levelImageRegistrationbasedonEdgeExtraction A multi-scale DCT-domain image registration technique for two MPEG video inputs is proposed in this section. Several edge detectors are ¯rst applied to the luminance com- ponent of DC coe±cients to generate the so-called di®erence maps for each input image. 64 (a) (b) Figure 4.8: Image registration results of (a)the proposed DCT-domain and (b)the space- domain approaches. 65 (a) (b) Figure 4.9: Image registration results of (a)the proposed DCT-domain and (b)the space- domain approaches. 66 (a) (b) Figure 4.10: Image registration results of (a)the proposed DCT-domain and (b)the space- domain approaches. 67 Then, a threshold is selected for each di®erence map to ¯lter out regions of lower activ- ity. Following that, we estimate the displacement parameters by examining the di®erence maps of the two input images associated with the same edge detector. Finally, the ul- timate displacement vector is calculated by averaging the parameters from all detectors. It is shown that the proposed method reduces the computation complexity dramatically as compared to pixel-based image registration techniques while reaching a satisfactory re- sult in composition. The major four parts of the algorithm are detailed in the following sections. 4.2.1 Edge Detection on DC Maps A DC map of each input image that contains DC values of the luminance (Y) component of all blocks is formed. Since only the DC value is considered for each 8£8 block, the size of the DC map is 1/64 of that of the original image. This means the data we are dealing with are much less than that in the traditional pixel-domain approach. Based on those DC maps, four di®erent edge detectors (H 1 , H 2 , H 3 and H 4 ) are applied to each of them. Those edge detectors are: H 1 = 2 6 6 6 6 6 6 6 4 ¡1 2 ¡1 ¡1 2 ¡1 ¡1 2 ¡1 3 7 7 7 7 7 7 7 5 H 2 = 2 6 6 6 6 6 6 6 4 ¡1 ¡1 ¡1 2 2 2 ¡1 ¡1 ¡1 3 7 7 7 7 7 7 7 5 H 3 = 2 6 6 6 6 6 6 6 4 ¡1 ¡1 2 ¡1 2 ¡1 2 ¡1 ¡1 3 7 7 7 7 7 7 7 5 H 4 = 2 6 6 6 6 6 6 6 4 2 ¡1 ¡1 ¡1 2 ¡1 ¡1 ¡1 2 3 7 7 7 7 7 7 7 5 68 They measure the variation of the image in vertical, horizontal, 45 degree, and 135 degree directions, respectively. Each detector can produce one di®erence map so that there are four di®erence maps of each input image. We use D ij to denote the di®erence map of image i = 1;2 with edge detector H j , j = 1;2;3;4. Di®erence maps are normalized so that all of their values fall between 0 and 1 for further processing. Note that H 1 , H 2 , H 3 and H 4 are the second-order derivative ¯lters. The ¯rst-order derivative ¯lter and the second-order derivative ¯lters in the horizontal direction are given below: 2 6 6 6 6 6 6 6 4 ¡1 2 ¡1 ¡1 2 ¡1 ¡1 2 ¡1 3 7 7 7 7 7 7 7 5 ; 2 6 6 6 6 6 6 6 4 1 0 ¡1 1 0 ¡1 1 0 ¡1 3 7 7 7 7 7 7 7 5 : The reason to adopt the second-order derivative ¯lter than the ¯rst-order derivative ¯lter Figure 4.11: Comparison between the ¯rst- and the second-order derivative ¯lters. can be explained using Fig. 4.11. As we can see, the line detector (i.e. the 2nd-order detector) is able to extract more features than the gradient detector (i.e. the 1st-order detector). Note that the detectors are applied to the DC map, we should choose the one that can extract out the region of interest with more active features. Thus the di®erence mapsgeneratedbythe2ndorderderivative¯ltersaremoresuitableforthealignmenttask in the next stage. 69 (a) (b) (c) (d) Figure 4.12: The di®erence maps of (a)image 1 and (b)image 2 using ¯lter H 1 and the corresponding binary activity maps of (c)image 1 and (d) image 2. 4.2.2 Thresholding In this step, a content adaptive threshold is set up for each pair of di®erence maps to generate corresponding binary maps. The main purpose of this step is to ¯lter out minor changes. It reduces confusion and speeds up the following alignment step. The di®erence maps obtained using ¯lter H 1 for two original DC maps are shown in Figs. 4.12 (a) and (b) while their corresponding binary activity maps, D 11 and D 21 , are shown in Figs. 4.12 (c) and (d), respectively. As we see from D 11 andD 21 , only the vertical di®erence are preservedfordisplacementparametersestimation. Similarly,horizontal,45degreeand135 degree features are extracted after applying H 2 , H 3 and H 4 , respectively. Those features help determine the displacement parameters more accurately and reduce the processing time since the unnecessary detail information has been eliminated. 4.2.3 Displacement Parameter Estimation Let P i £P j and N i £N j be the sizes of two original input images and their DC maps, respectively, as shown in Fig. 4.2. Then, we have N i = P i 8 and N j = P j 8 : (4.8) 70 Based on obtained D ij , i = 1;2 and j = 1;2;3;4, our task is to determine the alignment parameter for the four sets of binary images. The two-dimensional normalized cross- correlation is computed and the optimal displacement parameter is determined at the position where the maximum value occurs in both vertical and horizontal directions. Let (d i1 ;d j1 ) be the parameter pair we get from the binary images obtained by applying detector H 1 . Similarly, we have (d i2 ;d j2 ), (d i3 ;d j3 ) and (d i4 ;d j4 ) by following the same procedures. Oncethosefoursetsparametersareavailable,the¯nalestimateddisplacement canbeacquiredbyeitheraveragingorsimplychosingthebestonefromthesefourvectors. Then, a coordinate conversion, scaled up by a factor of 8, is performed due to the size di®erence between the original and binary images. The ¯rst three steps of the proposed procedure can be described in a °ow chart as shown in Fig. 4.13. 4.2.4 Experimental Results Experimental results with six test image pairs are shown in this section. Those test images are shown in Fig. 4.14 where (a) and (b) are indoor scenes while (c) to (f) are outdoor scenes with di®erent content complexity, di®erent amount and di®erent types of displacement but all with the same size (600£448). Note that the experimental results show that color mismatch does not a®ect the quality of registration since the proposed method is not color dependent. Thus, all test images presented here are under the same light conditions to reveal the exact quality of output composition. 71 Figure 4.13: A detailed overview of the proposed method. 4.2.4.1 Performance Comparison in Processing Time Thecomparisonofexecutiontimeofthetraditionalpixel-domainprocessandtheproposed DCT-domain technique is shown in Table 4.2 and Fig. 4.15. The major computation saving of the proposed method comes from two parts: the pixel-DCT domain conversion and information reduction. For the DCT-based method, the time consuming steps, such as inverse DCT and forward DCT, are avoided. Also, as mentioned before, the data being manipulated in the DCT domain has been cut down to 1/64 of the original images so that much less time is required for displacement searching. Those two reasons reduces the processing time over 95% as compared to the traditional pixel-domain approach. 72 (a) (b) (c) (d) (e) (f) Figure 4.14: The original test images. Table4.2: Comparisonbetweentheproposedand thetraditionalapproachesin processing time (sec). (a) traditional (b)proposed method. (c) and (d) are the savings in terms of seconds and percentages. 1 2 3 4 5 6 (a) 35.22 35.28 39.13 35.45 36.11 37.64 (b) 1.407 1.406 1.437 1.516 1.407 1.422 (c) 34.19 34.25 38.13 34.43 35.08 36.58 (d) 97.07 97.07 97.44 97.13 97.14 97.18 4.2.4.2 Comparison of Output Image Quality The¯nalestimateddisplacementparametersandcompositeoutputsareshowninTable4.3 and Fig. 4.16, respectively. Since we know the exact amount of displacement in advance, the estimation errors can be calculated by subtracting the actual ones and the ones that determined by the proposed method. As we see from Table 4.3, estimation errors are within two pixels as compared to the actual displacements. In other words, the precision 73 Figure 4.15: Performance comparison in processing time. can be reached to the sub-block accuracy. The reason why this accuracy can be attained is to consider several displacement parameters generated by di®erent detectors. Each of detectors has a speci¯c directional feature so that the bias among those displacement parameterscanbecompensatedbyaveragingallofthem. Inotherwords,ifmoredetectors applied to the images, more robust parameters we would get. Thus, the pixel accuracy or even the sub-pixel accuracy is possible once appropriate detectors with a good feature can be designed. However, more processing time would be required for applying more ¯lters to the images. We have to ¯nd a balance between the processing time and the alignment accuracy. 4.2.4.3 Discussion Theoretically speaking, the proportion of overlapping area to the original size would af- fect the quality of the composition since a larger area provides more information such as corners, lines and some other useful features while small area does not. Our experimental resultsshowthat, forthesamecontentofimages, alargeroverlappingarearesultsinmore accurate alignment. However, it also depends on how much useful features are within the overlapping parts. If there are only few feature points in the original images, then 74 (a) (b) (c) Figure 4.16: The stitched images. 75 (d) (e) (f) Figure 4.17: Figure 4.16 continued. 76 Table 4.3: Comparison between the displacement parameters (d i ;d j ) derived based on the proposed approach and the actual displacement parameters, (d i 0 ;d j 0 ). 1 2 3 4 5 6 8¤d j1 296 304 304 200 400 504 8¤d i1 448 600 304 448 520 496 8¤d j2 296 304 304 200 408 504 8¤d i2 448 600 296 448 520 496 8¤d j3 304 296 296 200 400 504 8¤d i3 448 600 296 448 528 496 8¤d j4 296 296 296 200 400 504 8¤d i4 448 600 296 448 528 496 8¤d j 298 300 300 200 402 504 8¤d i 448 600 298 448 524 496 dj actual 300 300 300 200 400 503 di actual 448 600 300 448 524 495 8¤(d j ¡d 0 j ) -2 0 0 0 +2 +1 8¤(d i ¡d 0 i ) 0 0 -2 0 0 +1 no matter how large the overlapping area is, the performance is similar since the number of useful features is the same. Also note that whether the overlapping area is a multiple of eight a®ects the quality of composition since if those DCT block are not well aligned, the corresponding DC values of two images at the same position represent di®erent in- formation. In this case, a suitable process, such as interpolation, must be performed in advance. 4.3 Robustness of the Proposed Alignment Method Inthissection,therobustnessoftheproposedalignmentalgorithmareexaminedbytaking the images with noise as the inputs to the system. The input images of size 480£640 considered here are of three di®erent levels of Gaussian noise which are 12.5%, 25% and 77 37.5%. The DCT-domain registration technique is applied on those images and the ex- perimental results of noise levels being equal to 12.5%, 25% and 37.5% are shown in Fig. 4.18(a), (b), and (c), respectively. Note that the results shown here are the part of the output images that are locally enlarged in order to make it easy to see the quality of the alignment. (a) (b) (c) Figure 4.18: The composition results with di®erent levels of Gaussian noise: 12.5%, 25%, and 37.5%. AsrevealedinFig. 4.18, thequalitylooksgoodforallcaseswhilethetraditionalpixel- domain approach can reach high accuracy for the case with noise level up to 12.5%. Our block-based method takes an advantage of dealing with DC maps of the original images. Since the DC map can be treated as a down-sized version of the original image, many features will be averaged out during this process. Therefore, the e®ect caused by noise can be eliminated and the confusion will be reduced as well while doing the alignment. This veri¯es the robustness of the proposed block-based alignment algorithm while noise is involved. 78 Chapter 5 Advanced Mosaic Techniques for Coded Video In this chapter, we describe three advanced topics for coded video mosaicking: hybrid block/pixel registration, block-based video registration and block DCT analysis and clas- si¯cation. 5.1 Hybrid Block/Pixel Registration ItwasshowninChapter4thattheblock-basedregistrationalgorithmssavelotofcomputa- tionsusingeitherDCorACcoe±cients. However,theaccuracyofblock-basedregistration ismeasuredonlyoftheresolutionofablock,whichisofsize8£8. Itisdesirabletoenhance the resolution of the displacement vector to the pixel-level accuracy. To get such a result, some pixel-domain registration can be made after the block-level registration. In other words, the block-level registration can be viewed as a coarse-level alignment while the pixel-levelregistrationyieldsthe¯ne-levelalignment. However, wedonothavetoperform the inverse transform on all blocks but some selected blocks to save the computation. The reason is simple. There are blocks that correspond to the °at background so that they do 79 not carry much information. On the other hand, there are blocks that contain valuable spatial domain information such as edges and corners. Thus, we may perform inverse DCT on these blocks and use the spatial domain information to enhance the registration accuracy. In this case, the computational complexity of this hybrid approach will still be signi¯cantly lower than that of the traditional pixel-domain approach. 5.1.1 Alignment of Projected Boundary Blocks To enhance the accuracy of the alignment, one method is to convert the boundary blocks of two overlapping images back to the pixel domain, add two-dimensional pixel values along horizontal or vertical directions followed by a normalization procedure to get one- dimensional data vectors of both images at the same data point. Then, we can perform the 1D alignment for projected lines. This concept is illustrated in Fig. 5.1. The best match would happen at the position where the maximum correlation value occurs. Figure 5.1: Project two-dimensional data to one-dimensional. It is observed in our experiments that the alignment can be ¯ne-tuned to reach the pixel-level accuracy and the estimation errors are within two pixels. This process is fast 80 and easy to implement. However, it is not robust in dealing with all kinds of di®erent images since some information may be lost by projecting data from the 2-D domain to the 1-D domain. Also, if the areas of two input images to be transformed back to the pixel domain are not exactly the same, the relevant and irrelevant data could be mixed togethertoconfusethealignmenttask. Moreover,iftheimagecontentconsistsofrepeated patterns, thenormalizeddatawouldcontainseveralpeakssothattheexactmatchismore di±cult to achieve. The performance of this processing highly depends on the quality of composition obtained from the block-level registration and the image content. 5.1.2 Alignment of Selected 2D Blocks in the Pixel Domain Itisknownthatsalientfeaturesofimagesplayanimportantroleintheregistrationprocess. Areas without salient features such as the plain background or smooth surfaces contribute little to the ¯nal registration result. Thus, we may identify those blocks that contain the salient features in the DCT domain and then transform them back to the pixel domain to ¯ne-tune the coarse alignment result obtained at the block level. 5.1.2.1 Corner Block Detection As presented in Sec. 4.2, line detectors H 1 and H 2 can ¯lter out simple vertical and horizontal edges. Based on these two types of edge information, corner blocks can be roughly determined by the following procedure. 1. Computing Horizontal and Vertical Edge Maps By applying H 1 and H 2 to the DC map of an input image, we get two normlized edge magnitude maps, i.e., D H1 and D H2 , which takes values between 0 and 1. To 81 eliminate areas with minor activities such as the background, an adaptive threshold methodisappliedtothemagnitudemapstocreatebinaryimagesB H1 andB H2 . This threshold valuedetermines thenumberof blocksof higher activities to be selected in thenextstep. Ononehand,themoreblocksselected,themoreinformationprovided. On the other hand, the more blocks selected, the higher the computational cost. For example, if higher accuracy is required and a little sacri¯ce at the complexity is acceptable, the threshold should be set to a lower value. On the contrary, if the computational speed is the main concern, the threshold value should be raised. 2. Computing the Weighted Edge Map Given two binary images, B H1 and B H2 , from the prevous step, we will combine them into a new map using the following weighted scheme: C(i;j)=! 1 £B H1 (i;j)+! 2 £B H2 (i;j); ! 1 6=! 2 ; (5.1) where C(i;j) has four possible values: 0, ! 1 , ! 2 and ! 1 +! 2 , which mean that the block is a °at block, a block with a strong horizontal edge, a block with a strong vertical edge and a block with strong horizontal and vertical edges, respectively. 3. Decision Making for Corner Blocks Once the weighted map C is formed, the next step is to determine which block, C(i;j), hasahigherpossibilitytobeacornerblockbyexaminingtheactivitiesofits eight neighboring blocks. There are several patterns observed that may have one or multiple corners at position (i;j). For C(i;j)6=0, if its neighboring activities match 82 one of the patters shown in Fig. 5.2, we claim that this block to be a corner block and set its corner map °ag, B corner (i;j), to 1. Otherwise, its corner map °ag is set to zero. Figure 5.2: The 3£ 3 block patterns that have a higher probability to contain one or multiple corners at the central block. It is worthwhile to check the validity of the above rule by observing some image exam- ples. To do so, corner blocks in the overlapping region of two input images are checked. First, several corner blocks of one input image are selected automatically using the above rule. Then, their corresponding blocks of the second input image are picked manually. Two cases are examined. The ¯rst case is to perform the inverse DCT on selected blocks for examination. The second case is to perform the inverse DCT on those selected blocks as well as their eight neighbors. In this test example, the two input images are of size 480£640, and ten corner blocks are selected from each image. Let f i corner and f j corner denote the ¯ne-tuned horizontal and vertical displacement parameters, where 0 ·j f i corner j· 7 and 0 ·j f j corner j· 7, since the coarse scale alignment has been done at the block level. In the example of concern,thecoarse-scaledisplacementvectorisequalto(480;408),whichisalsotheactual 83 displacement vector. Thus, the ¯ne-scale alignment parameters f i corner and f j corner are both equal to zero. Case 1: Inverse DCT applied to selected corner blocks only Takeoneselectedcornerblockofimage1forinstance. Afterbeinginversetransformed back to the space domain, two di®erent features, corners and edges, are extracted so that they can be aligned with its corresponding block in image 2. 1. Pixel-wise Matching Foraselected8£8blockinthepixeldomain, cornerdetectionisperformedfollowed by a pixel-wise comparison. The results are shown in Table 5.1, where each column indicates one alignment result based on each of the 10 selected blocks. As shown in the table, most blocks yields accurate ¯ne-tune parameters. For blocks indexed by 8-10, all 64 pixels in the pixel domain have the same value. That is, the selected block is part of background which provides no clue for ¯ne-scale alignment at all. Table 5.1: The ¯ne-tuning parameters (f i corner ;f j corner ) determined by the pixel infor- mation of each detected corner block pair. 1 2 3 4 5 6 7 8 9 10 f i corner 0 0 0 -7 0 0 -7 N/A N/A N/A f j corner 0 0 0 -7 0 0 -7 N/A N/A N/A 2. Line-wise Matching Rather than pixel-wise comparison in a block, the edge information is extracted for the alignment purpose. The alignment results are presented in Table 5.2. Note that there is no "N/A" for all blocks which means more information can be provided by edges. 84 Table5.2: The¯ne-tuningparameters(f i edge ;f j edge )determinedbytheedgeinformation of each detected corner block pair. 1 2 3 4 5 6 7 8 9 10 f i corner 0 0 0 0.5 0 0 0 0 2 -2 f j corner 0 0 0 0 0 0 0.75 0 -1.5 -0.5 From the above two tables, we see that most matching pairs can yield the correct results while some cannot. The ¯nal displacement parameters can be determined by the best three pairs (rather than averaging the displacement vectors of all 10 sets). Case 2: Inverse DCT applied to selected corner blocks and their eight neigh- bors Intuitively, a larger area should contain more useful information for registration re¯ne- ment. Thus, a better result is expected in this case. 1. Pixel-wise Matching First, the corner information is extracted out for each 24£24 area. Then, the ¯ne- tuning parameters are determined by the pixel-wise comparison. The results are given in Table 5.3. Table 5.3: The ¯ne-tuning parameters (f i corner ;f j corner ) determined by the pixel infor- mation of each detected corner block pairs. 1 2 3 4 5 6 7 8 9 10 f i corner 0 0 0 0 0 0 0 0 6 -4 f j corner 0 0 0 0 0 0 -2 0 2 -2 2. Detecting Edges - Line-wise Matching 85 The edge information is ¯rst extracted in the same area. Then, the ¯ne-tuning parameters are determined by the edge information. The resulting parameters are given in Table 5.4. By comparing results in Table 5.3 and 5.4, we see that the alignmentbasedontheedgeinformationismoreaccuratesinceithassmallererrors. The pixel-wise alignment is usually more sensitive to noise. This explains the reason why the line-wise alignment based on the edge information provides a better choice. Table5.4: The¯ne-tuningparameters(f i edge ;f j edge )determinedbytheedgeinformation of each detected corner block pairs. 1 2 3 4 5 6 7 8 9 10 f i edge 0 0 0 0 0 0.5 0 0 3.75 -1.25 f j edge 0 0 0 0 0 0 -0.5 0 4.5 -0.5 Since the corner blocks are determined in a downsized image. Sometimes salient fea- tures are split between two adjacent blocks. Thus, it is not very reliable to perform the inverse DCT to an isolated block without considering its neighboring blocks. We see from experimental results that the useful information will not be missed if 3£3 blocks centered at the corner block are transformed back to the pixel domain. In summary, the alignment based on the information of a larger area is more robust, and it is better to compare the edge information in these two blocks to determine the ¯ne-scale displacement vector. 5.1.3 Experimental Results The performance of the hybrid block/pixel registration technique is demonstrated in this section. The execution time comparison of the traditional pixel-domain edge-based method, the proposed DCT-domain alignment technique, and the proposed hybrid block 86 or pixel alignment method is shown in Table 5.5, where the size of test images 1 to 8 is 448£600 and that for test images 9 and 10 is 480£640. As compared with the pure DCT-domain alignment method, the hybrid method has to do some extra work, including corner block detection, inverse DCT transform, as well as edge detection in the pixel do- main. This is the reason why that the processing time is longer than that of the proposed block-based algorithm. The ¯nal displacement parameters are shown in Table 5.6. It is clear that the hybrid block/pixel method improves the accuracy to the pixel level for all test images at the price of increased complexity. Table5.5: Executiontimecomparison(intheunitofseconds)of(a)thetraditionalmethod, (b) the proposed DCT-domain algorithm and (c) the proposed hybrid method. 1 2 3 4 5 6 7 8 9 10 (a) 19.11 19.14 19.53 19.44 19.53 19.45 19.42 20.39 23.02 22.94 (b) 1.36 1.34 1.36 1.34 1.34 1.34 1.36 1.34 1.41 1.36 (c) 9.22 10.61 8.97 9.02 9.44 10.23 9.69 12.03 11.64 12.47 Table 5.6: Comparison of displacement vectors: (d i ;d j ) is obtained by the block- level alignment, (d i hybrid ;d j hybrid ) is obtained by the hybrid block/pixel alignment and (d i actual ;d j actual ) is the actual one. 1 2 3 4 5 6 7 8 9 10 d i 600 448 448 520 496 448 296 522 480 480 d j 298 200 296 400 504 306 302 398 408 408 d i hybrid 600 448 448 524 495 448 296 524 480 480 d j hybrid 300 200 298 400 503 300 300 400 408 408 d i actual 600 448 448 524 495 448 296 524 480 480 d j actual 300 200 298 400 503 300 300 400 408 408 d i actual ¡d i 0 0 0 4 -1 0 0 2 0 0 d j actual ¡d j 2 0 2 0 -1 -6 -2 2 0 0 d j actual ¡d i hybrid 0 0 0 0 0 0 0 0 0 0 d j actual ¡d j hybrid 0 0 0 0 0 0 0 0 0 0 87 5.2 Block-based Video Registration We assume that inputs to the system are two synchronized MPEG videos at a frame rate of 30 fps. Also, there are only translation di®erences between them, both containing some moving objects. In order to avoid the ambiguity caused by pure image-to-image alignmentandtrajectory-basedalignment,theinputsequencestotheproposedsystemare ¯rst segmented into two parts: the static background and the moving objects. Then, they are separately registered based on their associated spatial and the temporal information. The °ow chart of the proposed algorithm is given in Fig. 5.3. The unit of the process is 1 GOP(15fpsinourexperiments). Inotherwords,thedisplacementparametersareupdated for each GOP. Take the ¯rst GOP as an example. After applying four edge detectors (4.2) to the DC map of the I frame, the ¯rst set of alignment parameters is determined. Since thisistheblock-basedalignment, afurtherre¯nementprocessisrequiredinordertoreach higher accuracy. In the second pass, the motion information of objects from each frame within the same GOP is used to obtain several other sets of re¯nement parameters. Based onalignmentandre¯nementparameters,wecanestimatethe¯naldisplacementparameter using a weighted average of them. 5.2.1 Static Background Alignment Based on the I frames of two input sequences, DC maps are available by extracting out theDCcoe±cientsofallblocksintheluminance(Y)component. SinceonlytheDCvalue is considered for each 8£8 block, the size of the DC map is 1/64 of that of the original image. This means that the data we are dealing with are much less than that in the traditional pixel-domain approach. Based on the information provided by those two DC 88 Figure 5.3: The °ow chart of the proposed system. maps, a rough alignment can be done by applying a DCT-domain registration algorithm as described in section 4.2. Four edge detectors are applied on the DC map so that there are four di®erence maps of each input I frame. For each pair of di®erence maps, a content adaptivethresholdisdeterminedtogeneratethecorrespondingbinarymaps. Basedonthe four sets of obtained binary images, we determine the alignment parameter by computing two-dimensional normalized cross-correlation followed by a coordinate conversion due to the size di®erence between the original and binary images. Then, the ¯nal estimated alignment parameter, (a i ;a j ), can be acquired by either averaging or simply choosing the best one from these four vectors. 5.2.2 Moving Object Alignment Inthisstep,themotionvectorsofthemajormovingobjectobtainedfromallframesinone GOPareaccumulatedsothatthetrajectorycanbeformed. Accordingtothisinformation, 89 a trajectory-based alignment process can be applied to enhance the alignment accuracy. Note that the diameter of the bouncing ball in our experiments is around 48 pixels, which corresponds to 3 macroblocks. Thus, in order to avoid incorrect information provided by motion estimation, the motion vector is not considered if its size exceeds a predetermined threshold value (which is set to 3 times the macroblock size in the given example). Once thecandidatemotionvectorsofeachframearedetermined, aone-dimensionalcorrelation- based sequence alignment process is performed and the optimal parameter is determined at the position where the maximum value occurs. Since the GOP of the input sequence consists of 15 frames, we have 14 re¯nement parameters for each GOP in total, denoted by(r ik ;r jk ), k =1;2;¢¢¢;14. Thereexistsatradeo®betweenthesizeofthemovingobject and the speed of the process. Usually, a larger moving object is preferred since it clearly andstronglyrepresentsthebehavioroftheclusterofmacroblocksthatcontainstheobject. That is, it is easy to tell whether a macroblock belongs to the actual movingobject or just anestimationerror. However,inthiscase,wehavetoconsidermoremotionvectors,which requires some more processing time. On the other hand, if the moving object is not that big, say within one macroblock, then only one motion vector is taken into consideration. Eventhoughthecomputationalcomplexityislower,therobustnessoftheestimationresult is also lower. 90 5.2.3 Displacement Parameter Estimation Following the procedures described in the last two subsections, coarse-alignment and motion-based re¯nement parameters, (a i ;a j ) and (r ik ;r jk ), k = 1;2;3;¢¢¢;14, are ob- tained. The ¯nal displacement parameter, (d i ;d j ), can be computed as (d i ;d j )=®£(a i ;a j )+(1¡®)£ " 1 14 £ 14 X k=1 (r ik ;r jk ) # : (5.2) In words, (d i ;d j ) is a weighted average of (a i ;a j ) and (r ik ;r jk ), k = 1;2;3;¢¢¢;14. In our experiments, we tried di®erent values of ® and found that ® = 0:5 provides a reasonably good result. The same procedure is applied to all GOPs of the input sequences. Note that the GOP of the generated input videos is 15 frames and since the frame rate is 30 fps (frames/sec), one can update the displacement parameters every I frame, i.e. every 0.5 sec. Thus, if there is an error occurring in P and B frames, it will not propagate for too long so that severe visual degradation of the output can be avoided. If there is an abrupt scene change occurring in one GOP, the residual signal in one particular frame will become quite large. It is not di±cult to ¯nd a threshold to detect such a scene change frame. Then, we are able to split the GOP into two separate parts. Thus, the proposed alignment process can be applied to each individual part separately. 5.2.4 Experimental Results For the ¯rst example, the leading I frames of the two input MPEG2 sequences are shown in Fig. 5.4. As shown in this ¯gure, the moving object is a yellow bouncing ball in front of a poster with a horizontal translation motion only. Fig. 5.5 shows the portion of the 91 15 th , 30 th , and 45 th stitched frames from the two input sequences. The displacement parameters determined by the I frames and motion vectors of the ¯rst three GOP's are (410;480),(408;480),and(410;480),respectively. Thus,weareabletousethebackground and motion information to do the alignment to generate a mosaic video of high quality. Figure 5.4: The ¯rst frames of two input sequences for the 1st experiment. (a) (b) (c) Figure 5.5: The portion around the boundaries of stitched frames: (a) the 15th frame, (b) the 30th frame, and (c)the 45th frame. ThetwoinputsequencesforthesecondexampleareoutdoorsceneasshowninFig. 5.6. The 15 th , 30 th , and 45 th stitched frames are shown in Fig. 5.7. We see that these stitched frames have good quality. When comparing obtained displacements with the actual ones, we observe that the estimation errors are no larger than one half block (i.e. 4 pixels). 92 Figure 5.6: The ¯rst frames of two input sequences for the 2nd experiment. (a) (b) (c) Figure 5.7: The portion around the boundaries of stitched frames: (a) the 15th frame (b) the 30th frame and (c)the 45th frame. 5.2.4.1 Discussion Generallyspeaking,theproposedDCT-domainandmotionvector-basedalignmentcannot provide su±ciently accurate information to reach 100% alignment accuracy since they are either block- or macrblock-based features. Some estimation error will result. However, by averaging and weighting, those e®ects can be reduced to some satisfactory degree. Also, our algorithm belongs to area-based alignment techniques, which is usually more robust than feature-point based alignment since some feature points may disappear in one of two frames and feature tracking is not easy. Experimental results show that certain accuracy can be reached in the ¯rst step based on the alignment of DC coe±cients alone in most cases. However, the proportion of the overlapping area to the original size and how many useful features are within the 93 overlapping parts would a®ect the quality of the composition. If there are few textured feature points in original images, the performance degrades. On the other hand, if the overlapping region contains highly regular periodic textured patterns, the accuracy of the alignment will decrease, too. In the second step, since only moving objects are considered, the characteristics of the objects plays an important role. The two steps of the proposed algorithm are both conducted using coded video data. Thus, we do not have to seek additional image/video features and the tedious conversion between the spatial and compressed domain can be avoided. As a result, we can save a lot of computation. Also, since only DC coe±cients are taken into consideration for rough alignment, it can be treated as a downsized version of the original image with the factor of 1/64. For those two reasons, the computation complexity is reduced a lot when it is compared to the traditional spatial domain processes. This is the main advantage of the proposed algorithm. 5.3 DCT Block Analysis and Classi¯cation Traditionalimageregistrationtechniquescanbecategorizedintotwogroups: feature-based and area-based. Both of them are conducted in the pixel domain. In other words, the detectionprocessisperformedoverthewholeimage. Tosavethecomputationcomplexity, we may ¯nd an e±cient way to analyze the image block content in the DCT domain so that di®erent processing techniques can be applied to blocks of di®erent characteristics. For example, an image can be classi¯ed into the background, textures, edges, and so on. Then, based on the group type, we can treat them di®erently. 94 5.3.1 DCT-Domain Block Classi¯cations A block classi¯cation scheme based on the DCT domain information is proposed here. By examining the de¯nition of 2D (two-dimensional) DCT transform for an 8£8 block given below F uv = C u C v 4 7 X i=0 7 X j=0 cos (2i+1)u¼ 16 cos (2j+1)v¼ 16 f(i;j); (5.3) we see that each DCT coe±cient is a linear combination of 64 basis functions as shown in Fig. 5.8. Figure 5.8: The 8£8 array of basis images for the 2D DCT. Each basis function has di®erent vertical and horizontal space frequencies. The most left upper corner, F 00 is called the DC coe±cients and the rest 63 coe±cients, F ij are called AC coe±cients. The DCcoe±cientrepresents the weighting of the lowest frequency withinthe8£8blockwhilethemostrightbottomACcoe±cientsrepresentstheweighting of the highest space frequency. In other words, a guess of the content of this 8£8 block in the pixel domain can be made by observing those 64 DCT coe±cients. As shown in Fig. 5.9, some groups are formed according to some speci¯c geometric properties. 95 Figure5.9: AreagroupingofDCTcoe±cientsforde¯ningtheratiosforblockclassi¯cation. ² F ij for i = 0;:::;3 and j = 0;:::;3, are clustered as G lowFreq . These shadowed blocks form a group called G simpleEdge , which contains only simple vertical and hor- izontal edges. ² The ¯rst row, F 0j for j =0;:::;7, characterizes the behavior of vertical edges while the¯rstcolumn,F i0 fori=0;:::;7,illustratetheactivityofhorizontaledges. They form the groups called G verEdge and G horEdge , respectively. ² The remaining blocks are grouped as G highFreq which consists of blocks with high space frequency and the whole 8£8 block is de¯ned as G total . 96 Lets lowFreq ,s simpleEdge ,s horEdge ,s verEdge ,s highFreq , ands total denotethesummations ofthetotalenergyofeachgroup. Then, someratioscanbede¯nedforblockclassi¯cation. They are given as R DC (i;j) = DC(i;j) s total (i;j) ; R simpleEdge (i;j) = s simpleEdge (i;j) s total (i;j) R lowFreq (i;j) = s lowFreq (i;j) s total (i;j) ; R highFreq (i;j) = s highFreq (i;j) s total (i;j) R verEdge (i;j) = s verEdge (i;j) s total (i;j) ; R horEdge (i;j) = s horEdge (i;j) s total (i;j) for 0·i·h and 0·j·w (5.4) where h and w are 1=8 of the height and the width of the original images, respectively, and the values of those ratios are between 0 and 1. An appropriate threshold is set for each ratio so that the weak activities of each group can be eliminated. For example, suppose that only the top 10% of those edge blocks are needed, a threshold is adopted to ¯lter out the 90% of blocks which have smaller ratio values. These blocks are called inactive blocks and they are not going to be taken into consideration for the following steps so that the computation complexity can be saved to some degree. After the threshold for each group is de¯ned, every block can be categorized into a di®erent group by following the tree structure as given in Fig. 5.10. Note that the blocks of each group have relative strong strength with respect to a speci¯c geometric property. The block classi¯cation diagram can be explained below. ² A block in an image is ¯rst separated into the plain background and complex areas accordingtotheR DC value. SincetheDCvaluecanbetreatedasanaverageofeach 8£8 block and its corresponding geometric pattern is a plain area with its space 97 frequencies equal to (0,0). Thus, if the DC energy dominates the total strength of the block at position (i;j), i.e. higher R DC (i;j), then the block has high possibility tobeapartofbackgroundorsmoothareawithoutcontaininganyusefulinformation for registration. ² For blocks in the group of complex areas, they can be further categorized into the texture group and the non-texture group by setting an threshold on R highFreq since the texture blocks usually have higher space frequencies. ² As to the non-texture group, some blocks might contain only simple edges which are purely vertical or horizontal edges and can be extracted out by considering R simpleEdge , i.e. the behavior of the ¯st few AC coe±cients. If the vertical edges are of interest, then blocks with only vertical edges can be taken out from the group of simple edges. One advantage of this tree structure is that one can choose 'any leaf' of it. In other words, blocks can be classi¯ed based on di®erent features. Each path to the leaf is just a combination of several decisions. For the results of horEdge/verEdge, we can even use the method proposed before to compute the edge strength and classify blocks into even smaller groups. 5.3.2 Experimental Results The experimental results of two test images of size 480£640 are displayed in Fig. 5.11 andFig. 5.12. Inthese¯gures, blockswith intensityonein(a)and(b)showtheextracted background areas, blocks containing texture are displayed in (c) and (d), and blocks with 98 Figure 5.10: The block classi¯cation diagram. simple edges are shown in (e) and (f). The background are perfectly extracted for both cases so that some processing tasks, such as registration and super resolution, can be skipped for those blocks. For blocks in the texture group, they can also be ignored in registration since the repetitive pattern may cause confusion in the alignment process and human eyes are not sensitive to the displacement of the textured region. It is apparent that more attention should be paid to blocks containing edges since they provide useful featuresforregistration. Sharpedgesareespeciallypreferred. Forreasonsdescribedabove, di®erent weights can be assigned to blocks of di®erent characteristics. Toconclude,featuresare¯rstextractedandmorecomplicatedenhancementtechniques are applied only to those blocks that have a higher weight. Then, the total computational complexitycanbereducedontheblocksoflessimportancewhilethevisualqualityremains satisfactory. 99 (a) background blocks of image 1 (b) background blocks of image 2 (c) texture blocks of image 1 (d) texture blocks of image 2 (e) simple edge blocks of image 1 (f) simple edge blocks of image 2 Figure 5.11: The block classi¯cation results - 1st test image. 100 (a) background blocks of image 1 (b) background blocks of image 2 (c) texture blocks of image 1 (d) texture blocks of image 2 (e) simple edge blocks of image 1 (f) simple edge blocks of image 2 Figure 5.12: The block classi¯cation results - 2nd test image. 101 Chapter 6 Block-Adaptive Image Upsampling and Enhancement Techniques Severaltechniquesforcolorcompensationandregistrationofcodedimageswereintroduced in previous chapters. The proposed methods provide e±cient ways to correct the color distortion and composite images to form a panorama output. Note that the degradation of the image/video quality during the capturing process has not yet been considered in both cases. If the image/video contents are displayed among various electronic devices of di®erent resolutions, quality degradation may become severe. To overcome this problem and generate an output image of higher quality, super resolution and image enhancement techniques are discussed in this chapter. These techniques will be integrated with the DCT-domain block classi¯cation technique to lead to an integrated enhancement system. Block classi¯cation, which can be conducted in either the pixel domain or the DCT domain, categorizes each image block into several types. It serves as a pre-processing step to analyze the image content so that di®erent processing techniques can be selected for di®erent groups. In this chapter, image upsampling and enhancement techniques are 102 developed based on block classi¯cation results. They are explained in Section 6.1 and Section 6.2. 6.1 Block-Adaptive Image Up-Sampling Techniques A block-adaptive super resolution technique for image up-sampling is proposed in this section. Several issues are examined, including the computation complexity, visual quality and the di®erence between the original HR image and the image with down- and up- sampling. 6.1.1 Complexity Comparison With the Maximum A Posteriori (MAP) approach as the backbone for the image upsam- pling system, we would like to compare the computational complexity between traditional image-based and block-adaptive approaches. The whole image is treated as a single data vector in the image-based method. In contrast, the image is divided into several blocks of equal size in the block-adaptive method, where each sub-block is viewed as a small image for individual processing. Block size 8£8 is chosen here for its compatibility with prevalent image/video coding schemes. Di®erent interpolation methods, including the zero-order-hold (ZOH), bilinear inter- polation (BLI), block-adaptive super resolution (BSR) and traditional MAP estimation (MAP), are applied to several images with di®erent sizes for the processing time com- parison in Table 6.1. Note that the original image is treated as a data vector for the zero-order-hold, bilinear interpolation and MAP methods, while the original image is di- vided into several 8x8 blocks in the block-adaptive algorithm. As shown in Table 6.1, the 103 zero order hold method and bilinear interpolation methods have a lower computation cost as compared with the block-based super resolution algorithm or traditional MAP method. Table 6.1: Comparison of processing time (sec.) of di®erent interpolation methods, in- cluding the zero-order-hold (ZOH), bilinear interpolation (BLI), block-adaptive super res- olution (BSR) and traditional MAP estimation (MAP). Image Size ZOH BLI BSR MAP 8£8 0.0161 0.0160 0.5620 0.5620 16£16 0.0162 0.0150 2.1250 3.4852 32£32 0.0150 0.0161 6.4220 119.8280 64£64 0.0320 0.0620 23.8910 N/A 128£128 0.2030 0.1250 80.4690 N/A 256£256 1.2650 1.4680 241.2650 N/A Fig. 6.1 shows the relationship between the image size and the normalized processing time for each interpolation method. Note that the x-axis denotes the original image size that ranges from 8£8 to 256£256 while the y-axis represents the normalized processing time for each method (in the unit of seconds per pixel). As shown in Fig. 6.1, the normalized processing time of the zero order hold or the bilinear interpolation does not °uctuate a lot as the image size increases. For traditional image-based MAP (the black line) as shown in Fig. 6.1, the computational complexity increases dramatically as the inputimagesizeincreases. Foranimageofsize N£N, itwillberepresentedbyan N 2 £1 vector as the input to the MAP function. Then, its gradient (1st-order derivative) is an N 2 £1 data vector and the 2nd-order derivative would be of size N 2 £N 2 . When the image size N becomes larger enough, the matrix size will be of O(N 4 ), which explains the lack of experimental data for MAP when the image size is larger than 64£ 64 in the table and the ¯gure. In the proposed block-adaptive algorithm, we apply di®erent interpolation techniques based on di®erent block types. Since bilinear interpolation has a 104 lower computation complexity than MAP, the proposed algorithm has a lower complexity than MAP. Figure6.1: Thecomplexityfordi®erentmethodsmeasuredintermsoftheprocessingtime (in the unit of seconds) as a function of the image size (in the unit of pixels). 6.1.2 Visual Quality Comparison By comparing the block-adaptive algorithm and the traditional MAP method, we see that their outputs have similar perceptual quality except that the formal one has blocking artifacts. As shown in Figs. 6.3 to 6.5, the di®erence between these two results lies only in boundary areas. Moreover, there is little di®erence when the algorithm is applied in either the RGB domain or the YCbCr domain. This is because super resolution techniques are more related to the spatial characteristics but less to the color characteristics. Thus, we can choose either the RGB or the YCbCr domain to apply the superresolution techniques. 105 (a) BSR (b) MAP (c) BSR (d) MAP Figure 6.2: Visual quality comparison of di®erent image-upsampling methods for blocks of size 8£8 in RGB domain ((a) and (b)) and in YCbCr domain ((c) and (d)). 106 (a) BSR (b) MAP (c) Di®erence (d) BSR (e) MAP (f) Di®erence Figure 6.3: Visual quality comparison of di®erent image-upsampling methods for blocks of size 16£16 in RGB domain ((a) and (b)) and in YCbCr domain ((d) and (e)). (c) and (f) are di®erence maps. 107 (a) BSR (b) MAP (c) Di®erence (d) BSR (e) MAP (f) Di®erence Figure 6.4: Visual quality comparison of di®erent image-upsampling methods for blocks of size 32£32 in RGB domain ((a) and (b)) and in YCbCr domain ((d) and (e)). (c) and (f) are di®erence maps. 108 (a) (b) (c) (d) (e) (f) Figure 6.5: Visual comparison of output images with image size of 64£64((a),(d)), 128£ 128((b),(e)), and 256£256((c),(f)), respectively. 6.1.3 Image Re-sizing Consider an image that is ¯rst downsized by a factor of two horizontally and vertically and then up-sampled back to its original image size. Some content information is lost during this process so that the output image has poorer quality. It is worth to mention that the degradation degree is not all the same throughout the whole image. It actually content-dependent,anditisnote±cienttoadoptthesameprocessingforthewholeimage. If the severely degraded areas can be localized, we can focus on enhancing those regions only to improve the visual quality. Thedi®erencebetweentheoriginalHRimageandtheresizedLRimageiscomparedin Fig. 6.6, wherethetwoimagesaredividedinto8£8blocksandthedi®erenceiscomputed block by block. Then, blocks that have a di®erence above a certain threshold are marked. 109 Those blocks are examined for their group type after block classi¯cation. Then, we can apply a more advanced processing technique to handle these di±cult regions. Figure 6.6: Comparison between the original and the degraded images due to image resiz- ing. We show the original and the resized images in Figs. 6.7 (a) and (b), respectively, where only downsampling/upsampling is considered without any blurring for the resized image. Their absolute di®erence is shown in Fig. 6.7(c). We see from this ¯gure that the majordi®erenceoccursintheedge/textureareas,whichisconsistentwithourexpectation. Thus, we only have to focus on the edge/texture blocks for visual enhancement. According to the above observations, we conclude that the proposed block-adaptive algorithm can save computational complexity while keeping good performance. First, the block-adaptive processing reduces the computational cost e±ciently since the degree of freedom is much less than the original image. Second, the visual quality of image-based and block-adaptive methods is close except for regions close to block boundaries. If the boundary can be manipulated carefully, we can improve the image quality at a lower cost. Third,degradationmainlyislocalizedinregionsthatcontainedgesand/ortexturesduring the resizing process. Thus, to reduce the complexity more, image enhancement techniques 110 can be applied to those area rather than the whole image. Due to the above three reasons, the proposed content-adaptive image up-Sampling method provides a good solution that has a good balance between processing complexity and resulting image quality. 6.1.4 Initialization for Block MAP Iteration As mentioned above, bilinear interpolation is used as the basic image enhancement tech- nique for smooth blocks while the block-based MAP estimator is adopted for blocks that contain edges. To perform the block MAP, we need some initial image for iterative en- hancement. Two MAP optimized images initialized by the bilinear interpolation method and the zero-order-hold method are shown in Fig. 6.8 (a) and (b), respectively. We see that the bilinear interpolation method yields a better result in terms of smoothness. The reason is explained by Fig. 6.9. Consider two blocks: an edge block (the green one) and a texture block (the blue one). The MAP method is applied to the edge block while bilinear interpolation is performed on the texture block. If the zero-order-hold method is utilized in the initialization, the expanded matrix will look like the right hand side in the ¯gure. There is some discontinuity between those two expanded blocks in the beginning. Since MAP is performed based on this initialization, the discontinuity tends to exist even after several iterations. Thus, bilinear interpolation should be adopted so that those two blocks has similar behavior in the very beginning. Although the block-adaptive algorithm provides an e±cient way to enhance the visual quality and spatial resolution for edges and textures, the output images contain blocking artifactsduetodi®erentprocessingappliedtoadjacentblocks. Althoughtheuseofbilinear interpolationforinitializationhelpseliminatetheblockingartifact, someofthemmaystill 111 remain. Thus, a more multi-mode enhancement technique is required for even better quality of output. In other words, instead of dealing complicated and non-complicated cases, the image should be divided into several groups such as plain area, texture, edges and others, so that adaptive enhancement algorithms can be chosen for di®erent needs. For the plain area, an e®ortless method can be utilized so that the computation cost can be saved for edge blocks to apply more complicated processing. The details and the experimental results are given in the next section. 6.2 Image Up-Sampling with Adaptive Enhancement In this section, another approach to achieve image quality enhancement is introduced. Again, the DCT-domain block classi¯cation is utilized to segment the image into several types, smooth areas, textures, edges and others. Instead of enhancing the quality of edge blocks only, the algorithm introduced in this section is separated into four parts according to the block types. Blocksthatbelongtotheplainbackgroundgroupcontainsmoothsurfaces. Sincethere isnotmuchvariationinthoseareas,ane®ortlesszero-orderholdmethodcanbeadoptedto expand the image content without degrading the visual quality much. In contrast, texture can be treated as the spatial repetition of a certain local pattern. Bilinear interpolation followed by a technique called \unsharp masking" [41] are applied to texture blocks to enlarge the block size while magnifying the variations at the same time. This cascaded operation yields an output image block of good quality. The parameters of the unsharp mask,e.g. thesizeoftheimpulseresponsearrayandtheweightingcoe±cients,controlthe sharpness of the output image. They can be chosen adaptively for di®erent applications. 112 Since human eyes are more sensitive to edges, upsampling of edge blocks demands special treatment. By viewing an image as a gray-level intensity surface, it can be approx- imated by a facet model which is built to minimize the di®erence between an intensity surface and observed image data. The facet model is modi¯ed to ¯t the need of image enhancement which is detailed in the next subsection. For the blocks which are not be- longed to these three groups, is categorized as others. The complexity of those blocks falls between the plain background and the edges so that bilinear interpolation is chosen for upsampling whose computation cost is between zero-order-hold method and the facet modeling and unsharp masking. Moredetailsofthekeyprocessing, thefacetmodelingandunsharpmasking, whichare adopted for image upsampling are given in the following two subsections. 6.2.1 Facet Modeling Asmentionedbefore,afacetmodelisbuilttominimizethedi®erencebetweenanintensity surfaceandobservedimagedata. ThepiecewisequadraticpolynomialisusedinHaralick's facet model [41]. That is, an image F(j;k) is approximated by ^ F(r;c) = k 1 +k 2 r+k 3 c+k 4 r 2 +k 5 rc+k 6 c 2 +k 7 rc 2 +k 8 r 2 c+k 9 r 2 c 2 ; (6.1) where k n are weighing coe±cients to be determined and r and c are the row and column Cartesian indices of image F(j;k) within a speci¯ed region. The determination of coe±- cients k i , 1· k · 9, demands a least square solution. However, since polynomials r m c n , 113 m;n=0;1;2, are not orthogonal, the solution of coe±cients k i becomes an ill-conditioned problem. To convert the ill-conditioned problem to a well-conditioned, a set of orthogonal polynomials is used in the polynomial expansion instead. For example, we may consider the use of 3£3 Chebyshev orthogonal polynomials as given below: P 1 (r;c)=1; P 2 (r;c)=r; P 3 (r;c)=c; P 4 (r;c)=r 2 ¡ 2 3 ; P 5 (r;c)=rc; P 6 (r;c)=c 2 ¡ 2 3 P 7 (r;c)=c(r 2 ¡ 2 3 ); P 8 (r;c)=r(c 2 ¡ 2 3 ) P 9 (r;c)=(r 2 ¡ 2 3 )(c 2 ¡ 2 3 ); (6.2) where r;c2f¡1;0;1g. As a result, the approximation can be rewritten in the form of ^ F(r;c)= N X n=1 a n P n (r;c); (6.3) where a n are polynomial coe±cients which can be determined by convolving the image with a set of impulse response arrays. To obtain the facet model, we set up observation equations at integer parameters r and c to approximate the image value at a local region. For the image upsampling purpose, we compute ^ F(r;c) at non-integer r and c values. It can be used to interpolate an image with any upsampling factor. For example, ^ F(0:5;0:5) can be computed and inserted between ^ F(0;0) and ^ F(1;1) as shown in Fig.6.10 so that theimagesizecanbeenlargedbyafactoroftwo. Similarly,theimagesizecanbeadjusted to any desired size by assigning di®erent non-integer parameters such as ( 1 3 ; 1 3 ), ( 1 4 ; 1 4 ) into the approximating polynomial. 114 To compare the performance of bilinear interpolation and facet modeling, some test results are shown in Fig. 6.11. The ¯ve input images are a vertical rectangle, a 45- degree triangle, a fan shape, a 135-degree triangle and a horizontal rectangle while the output image are the enlarged version of the input image by a scaling factor of two in each dimension. Fig. 6.11 (a) are images upsampled by bilinear interpolation and (b) are those interpolated using facet modeling. These two methods have similar performance for vertical and horizontal edges. However, for edges with other orientations or curved lines, facet modeling outperforms bilinear interpolation. As compared to the blocky results of bilinearinterpolation,facetmodelingiscapableofcapturingthebehavioroftheedgemore accurately so that the output image has smooth edges without annoying artifacts. When other scaling factors are considered, the facet model has additional advantage. That is, the polynomial coe±cients are computed only once. To interpolate an image to a di®erent size, we only have to ¯nd the proper non-integer r and c values for facet model evaluation. Generally speaking, facet modeling is a good choice to model an edge while dealing with image up-conversion. 6.2.2 Unsharp Masking An unsharp masking is designed for sharpening an image with edges and details more emphasized. It is commonly used for most digital images due to its applicability for many editing softwares. More details including 2D and 1D unsharp making are given in the following two sections. 115 6.2.2.1 2-D Unsharp Masking for Texture Blocks The unsharp masking technique utilizes the information of the blurred version of the orig- inal image by convolving with an uniform L£L impulse response array. After generating thelowresolutionimage, theunsharpedmaskedimage G(j;k)canbederivedbysubtract- ing the blurred version F L (j;k) with a certain weighting function from the original image F(j;k), i.e, G(j;k)= c 2c¡1 F(j;k)¡ 1¡c 2c¡1 F L (j;k); (6.4) where c is the weighting constant and it usually lies in the range 3 5 · c · 5 6 . Generally speaking, the sharpening e®ect gets stronger as c decreases and L increases. An unsharp masking is not capable of adding extra details to the image. Instead, it can enhance the appearance of details by narrowing down the transition band around the edge, i.e. increasing the acutance. As shown in Fig. 6.12, the unsharp mask neither increases the spatial resolution nor transformstheedgeintotheidealone(i.e. theblueline). However,theimageafterunsharp masking (i.e. the red line) has a larger contrast that results in better visual quality. Note that the 2D unsharp mask considered here has no directional preference, which ¯ts the characteristicofisotropictexture. Therefore,theisotropic2Dunsharpmaskissuitablefor enhancing the visual quality of isotropic texture. Some examples are given in Fig. 6.13. 6.2.2.2 1-D Unsharp Masking for Edge Blocks When the edge is taken into account, the situation is slightly di®erent from texture en- hancement. Sincetheedgehasanorientation,itispossibletoenhancethesharpnessofthe 116 edges more e±ciently by adopting 1D directional unsharp masking. The mask dimension is reduced to one so that it can be oriented to performed in the direction that is normal to the edge so that the maximum performance is reached. Again, smaller c and larger L provide better performance. However, it requires longer processing time. It is a tradeo® between visual quality and computation complexity. 6.2.3 Experimental Results In this section, some preliminary experimental results are reported. Test images are of di®erent sizes and with di®erent content complexity. Images are interpolated by bilinear interpolation and the proposed content-adaptive method by a factor of two in each di- mension. Since the objective measurement such as MMSE is not able to re°ect the visual qualityaccurately,weshowfourtestresultsinFigs. 6.14and6.15,whereimagesincolumn (a) are results of bilinear interpolation and those in column (b) are results of the proposed method. It is clear that the proposed algorithm outperforms bilinear interpolation in the resulting visual quality. If we zoom in the result by examining the 1D image data across an edge as shown in Fig. 6.16, we see that the line with stars (the proposed method) has a narrower transition bandascomparewiththatofthedashedline(bilinearinterpolation). Moreover, thecurve of the proposed method has a better match with an ideal curve. Overall, the proposed method has better performance especially in areas that contain edges. Generallyspeaking,theproposedmethodtreatsanimageasacompositionofnumerous smaller blocks with di®erent contents. From this viewpoint, an image can be categorized into several groups so that adaptive algorithms can be applied more e±ciently to di®erent 117 regions. Experimental results show that the DCT-domain block classi¯cation provides fairly good segmentation of image blocks. Although it is not always able to reach 100% accuracy,itbehavesasapre-processingtoanalyzetheimagecontentstohelpinalgorithmic development to meet various requirements of di®erent applications. For the process of upsampling, di®erent methods are performed on di®erent areas accordingtotheircontentcomplexity. Foredgesusingfacetmodeling, ithasanadvantage of °exibility in scaling. It is easy to enlarge an image with any factor by only changing the coordinates without recalculation while the traditional interpolation method may require upsampling followed by downsampling in order to accommodate some desired image size. Furthermore, based on the edge orientation information, the 1D post-processing provides a way to enhance the visual quality even more. The DCT-domain processing becomes more important for applications nowadays since most image and video are compressed by DCT. In this work, a geometric property inher- ently in DCT coe±cients was investigated and used for block classi¯cation. Experimental resultsshowthattheproposedblockclassi¯cationusingthetreestructureworkswell. The proposed upsampling algorithm based on block classi¯cation is content-adaptive which adopts relatively low cost processing for regions that contain less important information to save computational complexity for critical areas that require more sophisticated pro- cessing. It was shown by experimental results that the visual quality has been improved with sharper edges and more details in texture areas. How to scale an image sequence e±ciently is an interesting topic worth further investigation. 118 upsampled LR Image (a)Original image (b)Upsampled image Absolute Difference Map Block−based Total Absolute Difference Map (c)Absolute di®erence (pixel-wise) (d)block-based absolute di®erence map Blocks with higher difference Image Blocks with higher difference (g)Blocks with higher di®erence (h)Image blocks with higher di®erence. Figure6.7: Detectingdi®erencebetweentheoriginalandtheresizedimagesusingbilinearly interpolation. 119 Original HR Image Original HR Image (a) (b) Figure 6.8: The block-based MAP estimator with di®erent initialization methods: (a) zero-order-hold, and (b) bilinear interpolation. Figure 6.9: Comparison of di®erences between two initialization methods. (0,0) (1,1) (-1,1) (0,1) (0.5,0.5) (0.5,0.5) (-1,0) (1,0) (1,-1) (0,-1) (-1,-1) r c Figure 6.10: The coordinates of a facet model. 120 (a) (b) (c) Figure 6.11: Experimental results of (a) bilinear interpolation, (b) the facet model and (c) 1D directional unsharp masking. 121 Pi x el Posi t i on i dea l i n t e n s i t y P ixel P osition overs h oot u nders ho ot i n t e n s i t y Tran sitio n ba nd P ixel P osition overs h oot u nders ho ot i n t e n s i t y Tran sitio n ba nd Figure 6.12: Comparison of pixel intensity before and after applying an unsharp mask. (a) (b) Figure 6.13: Experimental results of unsharp masked texture patterns: (a) the original texture patterns and (b) the unsharp masked texture patterns. 122 (a) (b) Figure 6.14: Experimental results of ¯rst two test patterns: (a) bilinear interpolation and (b) the proposed content-adaptive upsampling method. 123 (a) (b) Figure 6.15: Experimental results of the other two test patterns: (a) bilinear interpolation and (b) the proposed content-adaptive upsampling method. 124 Bilinear interpolation Proposed method Bilinear interpolation Proposed method Figure 6.16: The 1D image data across an edge. 125 Chapter 7 Conclusion and Future Work 7.1 Conclusion The objective of this research is to develop an e±cient system to generate an image/video mosaicfrommultipleimage/videoinputscapturedbydi®erentcamerasundervariouscon- ditions. Several techniques were proposed in this work to compensate the color discrep- ancy and spatial displacement between inputs so as to achieve a high-resolution naturally- looking mosaic output under simplifying assumptions. For example, temporal synchro- nization and focal length distortion problems are resolved in advance. The developed algorithms are brie°y summarized below. ² Color Matching of Coded Image/Video We considered the problem that two images appear di®erent in their color tones and only have translation displacement between them in Chapter 3. Under the assumption that the overlapping region is well-aligned, we emphasized on matching thecolorinthecompresseddomain. Weproposedtwomethods, histogrammatching 126 and polynomial contrast stretching, to compensate the di®erent color tones of two input images in the DCT domain. Both proposed methods are applied only to the DC values. The DC value, which is the average of 64 pixel values within each 8£8 block, represents the average behavior of the block. Therefore, if the original image size is not too small relative to 8£ 8, the pixel-domain relationship between two images can still be preserved by those DC values in the compressed domain. It was shown by experimental results that polynomial approximation outperforms histogrammatchingintermsofoutputqualityandmemoryrequirements. Notethat thepolynomialusedwasasecondorderone. Althoughhigherorderpolynomialscan reduce the bias of matching, it demands more computation and may also result in an increase of variability. This can be demonstrated by our performance evaluation experiments using the MSE measurement, which is de¯ned as the di®erence between the updated DC values and the expected mean values in the overlapped region. From the experiment, we found that increasing the order does not always lead to performance improvement. The overlapped regions of two input images were assumed to be well-aligned at the block level, i.e. the displacement vector is equal to [8m;8n] with integers m and n. This assumption is however not practical in real world applications. Thus, an improvement was made to deal with the case where input images have an arbitrary displacement vector of form [m;n]. That is, we can perform an interpolation, which computes the pseudo DC value that is located at the well-aligned position. Then, the color matching is applied to those pseudo DC values. 127 For video color matching, the proposed algorithm can be directly applied to all I frames. In other words, the stretching coe±cient is updated for every I frame or everytimewhenascenechangeisdetectedforhighere±ciency. ForPandBframes, the same procedure can be applied to residuals to obtain another set of parameters. Based on the updated I frame values and the updated residuals of P and B frames, a ¯nal estimated value can be computed. For both image and video color matching techniques, onlyDCvaluesaretakenintoconsideration. SincethereisoneDCvalue available for each 8£8 DCT block, the computational cost is reduced down to the scale of 1/64 as compared with the spatial domain methods. Also, the output of the approximationsystemisasetofthreecoe±cientswhichcanbestorede±cientlyand reused for several frames when dealing with image sequences. The proposed color matching technique can produce an image/video mosaic to a satisfactory degree while only a small amount of computation is required. The color matching work was published in [28] and [29]. ² Block-Level Coded Image Registration Weconsideredtheproblemofblock-levelcodedimageregistrationinChapter4. For image registration, we assume that the two input images are translated but without any rotation or scaling. Since our target is coded image registration, we consider image registration techniques performed in the DCT domain. We developed two algorithms based on edge estimation and edge detection, respectively. The method based on edge estimation consists of three steps. First, image seg- mentation is performed using the DC coe±cients of the luminance component for 128 foregroundextraction. Second,fortheforegroundregion,theedgeorientationwithin each 8£8 block is estimated by the DCT coe±cients of the ¯rst row and ¯rst col- umn followed by a 8-level quantization. Finally, a correlation-based technique is performed to ¯nd the displacement vector between two images. The method based on edge detection also consists of three steps: edge detection on the DC map, thresholding and parameter determination. For edge detection, four 3£3 second order edge detectors are applied to the DC coe±cients of the luminance component of each input image. Each detector can extract a di®erent edge property so that the generated di®erence maps preserve edges of various orientations, e.g. horizontal, vertical, 45-degree and 135-degree edges. Next, a threshold is set up for each di®erence map to produce a binary map to ¯lter out some minor edges. Finally, the displacement parameters are determined based on the binary maps of input images generated by the same detector, and the actual displacement vector in the pixel domain is calculated by averaging parameters obtained from all detectors. Itwasdemonstratedbyexperimentalresultsthattheproposedalgorithmssavesmore than90%ofthecomputationalcostascomparedtothetraditionalpixeldomaintech- niques while the output visual quality remains about the same. The performance is consistent regardless of indoor or outdoor scenes. Although it is a block based processing, the quality of the alignment can be enhanced to the sub-block (4-pixel) accuracy. Theresultsofthecodedimageregistrationresearchwerepublishedin[30], [31] and [32]. 129 ² Advanced Coded Image/Video Mosaic Techniques In Chapter 5, we investigated three advanced coded image/video mosaic techniques as summarized below. { Hybrid Block/Pixel Alignment Technique Apost-processingtechnique,calledhybridblock/pixellevelalignment,waspro- posed to enhance of the displacement vector resolution from the block level to the pixel level. After applying line detectors to the DC map of an image, the energy of vertical edges of each block can be obtained. Then, a threshold is set to choose the candidates which belong to the group of high energy. For those candidates, a weight is given in order to distinguish them from blocks of other behaviors. The same procedure is performed for labeling blocks of horizontal edges so that a four-value map is available of an image. Several geometric patterns of size 3£3 are prede¯ned for the purpose of determining whether the centered block contains a corner in the spatial domain or not. If a block is classi¯ed as a corner block, its eight neighboring blocks and itself are transformed back to the pixel domain for more accurate alignment. As com- pared with the traditional spatial-domain processing, we do not perform the inverse DCT transform to the whole image but to some selected blocks. It was shown by experiments that the proposed algorithm saves around 40% of the computational complexity while achieving the same quality. { Coded Video Registration 130 The problem of stitching two MPEG sequences with a frame rate of 30fps (framespersecond)togethertobecomeamosaicvideooutputwasinvestigated. This was done under the assumption that the two image sequences were well alignedinthetemporaldomainandhadthesameGOPstructure. Theproposed algorithm¯rstsegmentstheIframeofeachGOP(15framesinourexperiments) into the static background and moving objects. For the static background, the DCvaluesoftheluminancecomponentareextractedtoformaDCmap. Then, based on the DCT domain image registration technique presented in Chapter 4, a set of displacement parameters can be determined. For the moving object, motion vectors are extracted from the remaining frames within the same GOP. Someincorrectmotionvectorscanbe¯lteredoutbasedonthepriorinformation ofthemovingobject. ThedisplacementparameterscanbeupdatedeveryGOP based on the motion information. It was shown by experimental results that the proposed approach can provide satisfactory performance while keeping the computational low. The video registration results were published in [33]. { DCT Block Classi¯cation The DCT domain techniques are attractive since many image and video inputs are of the compressed format using the DCT representation. It is important to analyzethepropertiesofDCTcoe±cientssothatwecanbridgetheinformation betweentherawandthecodedimage/videodatamoreconveniently. Sinceeach DCTcoe±cientrepresentstheenergyofaspeci¯cpatternwithdi®erentvertical and horizontal spatial frequencies, we de¯ned some ratio values and developed a tree structure so as to group blocks into di®erent categories based on the 131 distributionofDCTcoe±cientsinan8£8block. Itwasshownbyexperimental resultsthattheproposedtreestructurecancapturesomeimportantblocktypes such as the plain background, smooth areas, textures, and edges. Based on the classi¯cation result, we can adopt di®erent processing techniques in di®erent areas to save computations. { Super Resolution and Image Enhancement The DCT-domain processing becomes more important for applications nowa- days since most image and video are compressed by DCT. The geometric prop- erty associated with DCT coe±cients has been investigated and used for block classi¯cationinthiswork. Experimentalresultsshowedthattheproposedblock classi¯cation using the tree structure works well. The proposed upsampling al- gorithmbasedonblockclassi¯cationiscontent-adaptive. Thatis, itappliesthe processing techniques of relatively low complexity to regions that contain less important information to save computational complexity for critical areas that require more sophisticated processing. It was shown by experimental results that the visual quality has been improved with sharper edges and more details in texture areas. 7.2 Future Work Thedemandon°exiblemediacontentconversionacrossheterogeneouscaptureanddisplay terminals will continue to grow when more and more terminals are linked by networks. Users will not be only satis¯ed by rich functionalities of an isolated device but also by 132 compatibilities between di®erent terminals so that they can get the best output based on the platform available. The di®erence between terminals has to be compensated by softwarealgorithmstofacilitatemultimediadatamigrationfromonemachinetotheother with minimal degradation. The emphasis will be the balance of computational complexity and resultant im- age/video quality. Unlike traditional methods, we conduct processing directly in DCT domain and adopting geometric property inherently in DCT coe±cients for processing speedup. However, the proposed algorithms have limitations in applicability. More re- search e®orts towards an integrated system that o®ers °exibility and compatibility among heterogeneous terminals are expected in the near future. Some research issues are high- lighted as follows. ² Eliminating Blocking Artifacts Resulting from Block-based Algorithm In our proposed system, blocks in a whole image frame are classi¯ed into several groups by following the tree structure proposed in Chapter 5 based on the distribu- tion of DCT coe±cients. Each group has its own speci¯c geometric properties. That is, an image is classi¯ed into the plain background, smooth areas, textures, or areas with strong edges or corners. Di®erent geometric properties provide di®erent visual e®ects. For example, areas with strong edges require better algorithms to improve the resolution since human eyes are more sensitive to those regions. For the areas of the plain background or smooth areas, a simple zero-order-hold method or a bilinear interpolationoperationcanproducegoodresults. Sinceblocksaremanipulatedwith di®erent processing techniques individually, there may be arti¯cial block boundaries 133 generated as a result of block partitioning. Thus, a low complexity post-processing technique is required to remove blocking artifacts. ² Enhanced Resolution of Moving Objects For multiple video sequences, the regions of interest containing target objects can be combined with their motion information of the following B and P frames for moving object extraction. Based on the movement of the object, we may develop an algo- rithm especially tailored to enhance the resolution of moving objects. As observed by some researchers, [43], [42], [15], [2], and [16], the quantization step size provides important information about the feasibility of the solution. The estimated solution can be veri¯ed using the quantization step size. If the quality of the output video is not satisfactory, some post-processing techniques to further resolution enhancement can be considered. Since we only deal with the regions of interest here, the number of iterations required for the optimal solution is expected to be fewer than that of the traditional iterative approach. Then, the computational cost can be saved while maintaining good performance in visual quality. 134 Reference List [1] M.S. Alam, J.G. Bognar, R.C. Hardie, and B.J. Yasuda, \Infrared image registration and high-resolution reconstruction using multiple translationally shifted aliased video frames," IEEE Trans. Instrum. Meas., vol. 49, pp. 915-923, Oct 2000. [2] Y. Altunbasak, A. J. Patti, and R. M. Mersereau, \Super-resolution still and video reconstructionfromMPEGcodedvideo,"IEEE Trans. Circuits, Syst., Video Technol., vol. 12, no. 4, pp. 217-226, 2002. [3] S.S.BeucheminandJ.L.Barron, \Thecomputationofoptical°ow," ACM computing Surveys, vol. 27, pp. 433{467, 1995. [4] S.Borman,andR.Stevenson,\SpatialResolutionEnhancementofLow-ResolutionIm- age Sequences A Comprehensive Review with Directions for Future Research," Tech- nical Report, Laboratory for Image and Signal Analysis, University of Notre Dame, 1998. [5] S.Borman,andR.L.Stevenson,\Super-ResolutionfromImageSequences-AReview," Circuits and Systems, 1998 Proceedings, Midwest Symposium, Aug 1998. [6] N. K. Bose, H. C. Kim, and H.M. Valenzuela, \Recursive Total Least Squares Algo- rithm for Image Reconstruction from Noisy, Undersampled Multiframes," Multidimen- sional Systems and Signal Processing, vol. 4 no. 3, pp. 253-268, July 1993. [7] LisaG.Brown,\Asurveyofimageregistrationtechniques,"ACM Computing Surveys, (24)4: pp. 325{376, 1992. [8] Yaron Caspi, and Michal Irani, \A step toward sequence- to-sequence alignment," CVPR, 2000. [9] P.Cheeseman,B.Kanefsky,R.Kraft,J.Stutz,andR.Hanson,\Super-resolvedsurface reconstruction from multiple images," In Maximum Entropy and Bayesian Methods, pp. 293-308, Kluwer, Santa Barbara, CA, 1996. [10] M. Elad and A. Feuer, \Super-Resolution Restoration of Continuous Image Sequence Using the LMS Algorithm," Proceedings of the 18th IEEE Conference in Israel, Tel- Aviv, Israel, Mar 1995. [11] M. Elad and A. Feuer, \Super-Resolution Reconstruction of an Image," Proceedings of the 19th IEEE Conference in Israel, pp. 391-394, Jerusalem, Israel, Nov. 1996. 135 [12] M. Elad and A. Feuer, \Restoration of a single superresolution image from several blurred, noisy, and undersampled measured images," IEEE Trans. Image Processing, vol. 6 , no. 12, pp. 1646-1658, Dec 1997. [13] A.T.Erdem,M.I.Sezan,andM.K.Ozkan,\Motion-compensatedmultiframewiener restoration of blurred and noisy image sequences, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. 293-296, San Fran- cisco, CA, Mar 1992. [14] J.FooteandD.Kimber, \FlyCam: practicalpanoramicvideoandautomaticcamera control," IEEE International Conference on Multimedia and Expo, vol. 3, no. 30, pp. 1419{1422, Aug, 2000. [15] B. K. Gunturk, Y. Altunbasak, and R. M. Mersereau, \Multiframe resolution- enhancement methods for compressed video," IEEE Signal Processing Letters, vol. 9, pp. 170-174, June 2002. [16] B. K. Gunturk, Y. Altunbasak, and R. M. Mersereau, \Super-Resolution Reconstruc- tionofCompressedVideoUsingTransform-DomainStatistics," IEEE Transactions on Image Processing, vol. 13, no. 1, Jan 2004. [17] R. C. Hardie, K. J. Barnard, and E. E. Armstrong, \Joint MAP Registration and High-Resolution Image Estimation Using a Sequence of Undersampled Images," IEEE Trans. IP, vol. 6, no. 12, pp. 1621-1633, Dec 1997. [18] M. Holm, \ Toward automatic recti¯cation of satellite images using feature based matching," Proceedings of International Geoscience and Remote Sensing Symposium, pp. 2439{2442, 1991. [19] J. W. Hsieh, H. Y. M. Liao, K. C. Fan, and M. T. Ko, \ A fast algorithm for image registration without predetermining correspondence," Proceedings of the International Conference on Pattern Recognition, pp. 765{769, 1996. [20] J.-W.Hsieh,H.-Y.M.Liao,K.-C.Fan,M.-T.Ko,andY.-P.Hung,\Imageregistration using a new edge-based approach," Computer Vision and Image Understanding, vol. 67, no. 2, pp. 112{130, Aug 1997. [21] J. Y. Hu and A. Mojsilovic, \Optimal color composition matching of images," Inter- national Conference on Pattern Recognition, vol. 4, pp. 47{50, 2000. [22] D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge, \Comparing images using the Hausdor® distance," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, pp. 850{863, 1993. [23] G. Jacquemod, C. Odet, and R. Goutte, \Image resolution enhancement using sub- pixel camera displacement," Signal Processing, vol. 26, no. 1, pp. 139-146, 1992. [24] S. P. Kim, N. K. Bose, and H. M. Valenzuela, \Recursive Reconstruction of High Resolution Image from Noisy Undersampled Multiframes," IEEE Trans. ASSP, vol. 38, no. 6, pp. 1013-1027, 1990. 136 [25] L. Kitchen, and A. Rosenfeld J.-W., \Gray-level corner detection," Pattern Recogni- tion Letters., vol. 1, pp. 95{102, 1982. [26] T. Komatsu, T.Igarashi, K. Aizawa, and T. Saito. \Very high resolution imaging scheme with multiple di®erent aperture cameras," Signal Processing Image Communi- cation, vol. 5, pp. 511-526, Dec 1993. [27] L.Landweber,\AniterationformulaforFredholmintegralequationsofthe¯rstkind," American Journal of Mathematics, vol. 73, pp. 615-624, 1951. [28] Ming-Sui Lee, Meiyin Shen, C. -C. Jay Kuo, \Pixel-and Compressed-Domain Color Matching Techniques for Video Mosaic Application," Electronic Imaging, Jan 2004. [29] Ming-Sui Lee, Meiyin Shen, C. -C. Jay Kuo, \Color Matching Techniques for Video Mosaic Applications," ICME, June 2004 [30] Ming-SuiLee, MeiyinShen,C.-C.JayKuo, \DCT-DomainImageRegistrationTech- niques for Compressed Video," ITCom, 2004. [31] Ming-SuiLee,MeiyinShen,C.-C.JayKuo,\Compressed-DomainRegistrationTech- niques for MPEG Video," Electronic Imaging, Jan 2005. [32] Ming-Sui Lee, Meiyin Shen, C. -C. Jay Kuo, \A Fast Compressed-Domain Image Registration Technique for Video Mosaic," ISCAS, 2005. [33] Ming-Sui Lee, Meiyin Shen, C. -C. Jay Kuo, \A DCT-Domain Video Alignment Techniques for MPEG Sequences," MMSP, 2005. [34] H. Li, B. S. Manjunath and S. K. Mitra, \A contour-based approach to multisensor image registration," IEEE Trans. on Image Processing, vol. 4, pp. 320{334, 1995. [35] AditiMajumder, GopiMeenakshisundaram, W. BrentSeales andHenry Fuchs, \Im- mersive teleconferencing: a new algorithm to generate seamless panoramic video im- agery," Proceeding of the Seventh ACM International Conference on Multimedia, Oc- tober 30 { November 5, 1999. [36] N. Nguyen and P. Milanfar, \An e±cient wavelet-based algorithm for image super- resolution," Proceedings of International Conference on Image Processing, vol. 2, pp. 351-354, 2000. [37] P. Oskoui-Fard and H. Stark, \Tomographic image reconstruction using the theory of convex projections," IEEE Transactions on Medical Imaging, vol. 7, no. 1, pp. 45-58, Mar 1988. [38] N. Otsu, \A threshold selection method from gray-level histograms," IEEE Transac- tions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62-66, 1979. [39] A. C. Park, M. K. Park, and M. G. Kang, \Super-resolution Image Reconstruction: A Technical Overview," IEEE signal processing magazine, pp. 21-36, May 2003. 137 [40] A. J. Patti, A. M. Tekalp, and M. I. Sezan, \A New Motion Compensated Reduced Order Model Kalman Filter for Space-Varying Restoration of Progressive and Inter- laced Video," IEEE Transactions on Image Processing, vol. 7, no. 4, pp. 543-554, Apr 1998. [41] W.-K. Pratt, Digital Image Processing, John Wiley & Sons, Inc., 1978. [42] M. A. Robertson and R. L. Stevenson, \DCT Quantization Noise in Compressed Images," Proc. IEEE Int. Conf. Image Processing, vol. 1, pp. 185-188 2001. [43] R. R. Schultz and R. L. Stevenson, \Extraction of highresolution frames from video sequences," IEEE Trans. IP, vol. 5, no.6, pp. 996-1011, June 1996. [44] M. Sester, H. Hild, and D. Fritsc, \De¯nition of ground control features for image registration using GIS data," Proceedings of the Symposium on Object Recognition and Scene Classi¯cation from Multispectral and Multisensor Pixels, vol. 32, pp. 537{543, 1998. [45] N.R. Shah and A. Zakhor, \Resolution enhancement of color video sequences," IEEE Trans. Image Processing, vol. 8, pp. 879-885, June 1999. [46] B. Shen and I. K. Sethi, \Direct feature extraction from compressed domain images," Storage and Retrieval for Image and Video Databases IV, vol. 2670, 1996. [47] A. M. Tekalp, M. K. Ozkan, and M. I. Sezan, \Highresolution Image Reconstruc- tion from Lower-resolution Image Sequences and Space-varying Image Restoration," ICASSP, vol. III, pp. 169-172, San Francisco, 1992. [48] B. C. Tom, A. K. Katsaggelos, and N. P. Galatsanos, \Reconstruction of a high reso- lutionimagefromregistrationandrestorationoflowresolutionimages,"Proceedings of the IEEE International Conference on Image Processing, vol. III, pp. 553-557, Austin, TX, 1994. [49] B.C.TomandA.K.Katsaggelos,\Reconstructionofahigh-resolutionimagebysimul- taneous registration, restoration, and interpolation of low-resolution images," Proceed- ings of 1995 IEEE International Conference on Image Processing, vol. 2, pp. 539-542, Washington, DC, Oct 1995. [50] R.Y. Tsai and T.S. Huang, \Multipleframe image restoration and registration," Ad- vances in Computer Vision and Image Processing, pp. 317-339, Greenwich, CT:JAI Press Inc., 1984. [51] M.TsukadaandJ.Tajima,\Colormatchingalgorithmbasedoncomputationalcolor- constancy theory," Proceedings of International Conference on Image Processing, vol. 3, pp. 60{64, 1999. [52] A. S. Vasileisky, B. Zhukov, and M. Berger, \Automated image coregistration based on linear feature recognition," Proceedings of the second Conference Fusion of Earth Data, pp. 59{66, 1998. 138 [53] W.H.Wang,andY.C.Chen,\Imageregistrationbycontrolpointspairingusingthe invariantpropertiesoflinesegments,"Patern Recognition Letters,vol.18,pp.269{281, 1997. [54] Z.Zheng,H.Wang,andE.K.Teoh,\Analysisofgraylevelcornerdetection,"Pattern Recognition Letters, vol. 20, pp. 149{162, 1999. [55] B. Zitova, J. Kautsky, G. Peters, and J. Flusser, \Robust detection of signi¯cant points in multiframe image," Pattern Recognition Letters, vol. 20, pp. 199{206, 1999. [56] Barbara Zivota, and Jan Flusser, \Image registration methods: a survey," Image and Vision Computing 21, pp. 977{1000, 2003. 139
Abstract (if available)
Abstract
Several challenging issues for applications of image/video mosaicking and upsampling with high resolution are addressed here, all of which are mainly conducted in DCT (Discrete Cosine Transform) domain so that lower computation complexity can be achieved.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Texture processing for image/video coding and super-resolution applications
PDF
Complexity scalable and robust motion estimation for video compression
PDF
Efficient coding techniques for high definition video
PDF
Rate control techniques for H.264/AVC video with enhanced rate-distortion modeling
PDF
Focus mismatch compensation and complexity reduction techniques for multiview video coding
PDF
Hybrid mesh/image-based rendering techniques for computer graphics applications
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Advanced intra prediction techniques for image and video coding
PDF
Advanced techniques for high fidelity video coding
PDF
Source-specific learning and binaural cues selection techniques for audio source separation
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Image and video enhancement through motion based interpolation and nonlocal-means denoising techniques
PDF
Efficient management techniques for large video collections
PDF
Digital signal processing techniques for music structure analysis
PDF
Techniques for efficient cloud modeling, simulation and rendering
PDF
Biologically inspired auditory attention models with applications in speech and audio processing
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
Asset Metadata
Creator
Lee, Ming-Sui
(author)
Core Title
Low complexity mosaicking and up-sampling techniques for high resolution video display
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/16/2006
Defense Date
10/19/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
compressed domain image/video processing,image/video mosaicking,low-complexity algorithms,OAI-PMH Harvest
Language
English
Advisor
Kuo, C.-C. Jay (
committee chair
), Narayanan, Shrikanth S. (
committee member
), Neumann, Ulrich (
committee member
), Zimmermann, Roger (
committee member
)
Creator Email
mingsuil@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m161
Unique identifier
UC189745
Identifier
etd-Lee-20061116 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-30395 (legacy record id),usctheses-m161 (legacy record id)
Legacy Identifier
etd-Lee-20061116.pdf
Dmrecord
30395
Document Type
Dissertation
Rights
Lee, Ming-Sui
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
compressed domain image/video processing
image/video mosaicking
low-complexity algorithms