Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
(USC Thesis Other)
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ADVANCED TECHNOLOGIES FOR LEARNING-BASED IMAGE/VIDEO ENHANCEMENT, IMAGE GENERATION AND ATTRIBUTE EDITING by Zohreh Azizi A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2023 Copyright 2023 Zohreh Azizi Dedication To my wonderful parents and lovely sister, for their unconditional love and support during this journey. ii Acknowledgements The author acknowledges the Center for Advanced Research Computing (CARC) at the University of Southern California for providing computing resources that have contributed to the research results re- ported within this publication. URL: https://carc.usc.edu. iii TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Noise-Aware Texture-Preserving Low-Light Image Enhancement . . . . . . . . . . 4 1.2.2 Self-Supervised Adaptive Low-Light Video Enhancement . . . . . . . . . . . . . . 5 1.2.3 Progressive Attribute-Guided Extendable Robust Image Generation . . . . . . . . . 5 1.2.4 A GMM-based Method for Facial Attribute Editing . . . . . . . . . . . . . . . . . . 6 1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Low-Light Image Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Low Light Video Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Deep Learning based Image Generative Models . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Unconditional and Conditional Image Generation . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Successive Subspace Learning (SSL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 SSL-based Image Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.7 Facial Attribute Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.8 GMM-based Image Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 3: Noise-Aware Texture-Preserving Low-Light Enhancement . . . . . . . . . . . . . . . . . 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Proposed NATLE Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 4: SALVE: Self-Supervised Adaptive Low Light Video Enhancement . . . . . . . . . . . . . 28 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 iv 4.2.1 NATLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.2 Video Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.2 Visual Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.3 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.5 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Chapter 5: PAGER: Progressive Attribute-Guided Extendable Robust Image Generation . . . . . . . 45 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Proposed PAGER Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.2.1 Module 1: Core Generator . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2.2.2 Module 2: Resolution Enhancer . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.2.3 Module 3: Quality Booster . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.3 Attribute-Guided Face Image Generation . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.2 Evaluation of Generated Image Quality . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3.3 Other Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.4 Comments on Extendability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter 6: AttGMM: A GMM-based Method for Facial Attribute Editing . . . . . . . . . . . . . . . 71 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2.1 Gaussian Mixture Model for Image Generation . . . . . . . . . . . . . . . . . . . . 73 6.2.2 Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2.3 Facial Attribute Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.4 Refinement of Attribute Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.3.1 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.3.2 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.3.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Chapter 7: Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.1 Vision Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.2 Image Attribute Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.3 Image Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 v Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 vi ListofTables 3.1 Objective performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1 Quantitative comparison for enhancing clean dark videos. The best scores are indicated inbold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Quantitative comparison for enhancing noisy dark videos. The best scores are indicated inbold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Average runtime (in seconds) comparison per RGB frame of size530× 942 pixels. . . . . . 42 4.4 FLOPs comparison per pixel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.1 Comparison of FID scores for MNIST, Fashion-MNIST and CelebA datasets. . . . . . . . . . 58 5.2 Training time comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.1 Reconstruction quality of the comparison methods on facial attribute editing task. . . . . . 81 6.2 FLOPs (× 10 9 ) comparison on a color image of resolution64× 64. . . . . . . . . . . . . . . 82 6.3 Ablation study on reconstruction quality without/with the refinement step. . . . . . . . . . 83 vii ListofFigures 3.1 Illustration of NATLE’s processing steps: (a) input, (b) illumination map initialization, (c) illumination map estimation, (d) noisy reflectance map, (e) reflectance map estimation, (f) output after illumination gamma correction. . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Qualitative comparison of low-light enhancement results. . . . . . . . . . . . . . . . . . . . 25 3.3 Parameter study for NATLE: (a) result withα =0, (b) result withβ =0, (c) result without denoising b R, (d) desired parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 Overview of SALVE. It decomposes the input frame into estimations of illumination ( b L) and reflectance ( b R) components. InI frames, it calculates the final illumination L and reflectance R. InB/P frames,SALVE predicts theses final components using a mapping it learnt from the lastI frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Visual comparison between our method and prior work on clean-darkened video frames from DAVIS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Visual comparison between our method and prior work on noisy-darkened video frames from DAVIS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Visualization of video frames from a real-world dark video from LoLi-Phone dataset. . . . 39 4.5 User study results, where we show user’s preference in pair-wise comparison between our method and five benchmarking methods. . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.6 Effect of parameters α andβ as well as the denoising operation on the enhanced frame’s quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.7 Effect of the regressor’s window size on the quality of enhanced frames. . . . . . . . . . . 44 5.1 Example distributions from RGB pixels (left block) and Saab transforms (right block). The top figures correspond to single vector dimensions ( I 0 ...I 2 in RGB andX 0 ...X 2 in Saab domains). The bottom figures correspond to joint distributions. Distributions are extracted from the first three components of CelebA images. . . . . . . . . . . . . . . . . . 48 5.2 Overview of PAGER generation method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 viii 5.3 Examples of attribute-guided generated images for CelebA with various attribute combinations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4 Examples of PAGER generated images for MNIST (top), Fashion-MNIST (middle), and CelebA (bottom) datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.5 Example images generated by PAGER and GenHOP for the CelebA dataset. . . . . . . . . . 65 5.6 Samples generated by PAGER and prior DL-based generative models for the CelebA dataset. 66 5.7 Comparison of FID scores of six benchmarking methods with six training sizes (1K, 2K, 5K, 10K, 20K and 60K) for the MNIST dataset. The FID scores of PAGER are significantly less sensitive with respect to smaller training sizes. . . . . . . . . . . . . . . . . . . . . . . 67 5.8 Comparison of normalized training time, where each bar represents the training time of a DL-based model corresponding to those shown in Table 5.2 and normalized by training time of PAGER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.9 Comparison of joint FID scores and GPU training time of PAGER with DL-based related work in the generation of MNIST-like images. PAGER provides the best overall performance since it is closest to the left-bottom corner. . . . . . . . . . . . . . . . . . . . . 68 5.10 Comparison of PAGER’s FID scores with six training sample sizes for CelebA, Fashion- MNIST and MNIST datasets. We see that the FID scores do not increase significantly as the training samples number is as low as 5K for CelebA and 1K for MNIST and Fashion-MNIST. 68 5.11 Illustration of PAGER’s application in image super-resolution for CelebA images: Two top rows starting from resolution 4× 4 (left block) and 8× 8 (right block) and ending at resolution 32× 32. Two middle rows starting from resolution 8× 8 (left block) and 16× 16 (right block) and ending at resolution64× 64. Two bottom rows starting from resolution16× 16 (left block) and32× 32 (right block) and ending at resolution128× 128. 69 5.12 Examples of generated CelebA-like images of resolution128× 128 and256× 256. . . . . 70 6.1 Overview of the AttGMM facial attribute editing method. . . . . . . . . . . . . . . . . . . . 73 6.2 Intermediate results of reconstruction and editing using AttGMM method. . . . . . . . . . 78 6.3 Overview of the Refiner block in AttGMM method. . . . . . . . . . . . . . . . . . . . . . . 79 6.4 Refined results of reconstruction and editing usign AttGMM method. . . . . . . . . . . . . 80 6.5 Comparison on attribute generation accuracy between AttGMM and prior work. . . . . . . 81 6.6 Attribute generation accuracy versus FLOPs. . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.7 Ablation study on visual quality with/without refinement step. . . . . . . . . . . . . . . . . 83 ix Abstract Recent technological advances have led to production of massive volumes of visual data. Images and videos are nowadays important forms of contents shared across social platforms for various purposes. In this manuscript, we propose novel methodologies to enhance, generate and manipulate visual content. Our contributions are outlined as follows: Low-light image enhancement. We present a simple and effective low-light image enhancement method based on a noise-aware texture-preserving retinex model in this dissertation. The new method, calledNATLE, attempts to strike a balance between noise removal and natural texture preservation through a low-complexity solution. Its cost function includes an estimated piece-wise smooth illumination map and a noise-free texture-preserving reflectance map. After decomposing an image into the illumination and reflectance map, the illumination is adjusted to form the enhanced image together with the reflectance map. Extensive experiments are conducted on common low-light image enhancement datasets to demonstrate the superior performance of NATLE. Low-lightvideoenhancement. We also present a self-supervised adaptive low-light video enhance- ment method, called SALVE, in this dissertation. SALVE first enhances a few keyframes of an input low- light video using a retinex-based low-light image enhancement technique. For each keyframe, it learns a mapping from low-light image patches to enhanced ones via ridge regression. These mappings are then used to enhance the remaining frames in the low-light video. The combination of traditional retinex-based image enhancement and learning-based ridge regression leads to a robust, adaptive and computationally x inexpensive solution to enhance low-light videos. Our extensive experiments along with a user study show that 87% of participants prefer SALVE over prior work. Imagegeneration. Then, we present a generative modeling approach based on successive subspace learning (SSL). Unlike most generative models in the literature, our method does not utilize neural net- works to analyze the underlying source distribution and synthesize images. The resulting method, called the progressive attribute-guided extendable robust image generative (PAGER) model, has advantages in mathematical transparency, progressive content generation, lower training time, robust performance with fewer training samples, and extendibility to conditional image generation. PAGER consists of three mod- ules: core generator, resolution enhancer, and quality booster. The core generator learns the distribution of low-resolution images and performs unconditional image generation. The resolution enhancer increases image resolution via conditional generation. Finally, the quality booster adds finer details to generated images. Extensive experiments on MNIST, Fashion-MNIST, and CelebA datasets are conducted to demon- strate generative performance of PAGER. Facial Attribute Editing. Finally, we present a facial attribute editing method based on Gaussian Mixture Model (GMM). Our proposed method, namedAttGMM, is the first to conduct facial attribute edit- ing without exploiting neural networks. AttGMM first reconstructs the given image in a low-dimensional latent space through a posterior probability distribution. Next, it manipulates the low-dimensional latent vectors into a certain attribute. Finally,AttGMM utilizes the difference between the results of the previous two steps, along with the given image, to generate a refined and sharp image which possesses the target attribute. We show that AttGMM has a great advantage in lowering the computational cost. We present several experimental results to demonstrate the performance of AttGMM. xi Chapter1 Introduction chapter1 1.1 SignificanceoftheResearch significance High quality visual content is more and more demanded in various areas in our era. From the day-to- day social interactions in consumer electronics to professional artistic applications or advanced computer vision systems, high quality visual content is appreciated. Visual content may be collected directly from camera sensors. Many factors contribute to the quality of the image/video captured by cameras like the capability of the camera sensor, photography skills, or the environmental conditions like the lightness. For this reason, image/video enhancement has been an ongoing research topic for decades. Since the environmental condition like the lightness is not always under one’s control while capturing images/videos, low-light image/video enhancement has been a stand- ing research focus and particularly popular by numerous industrial applications. Low-light visual content often suffers from poor visibility, high noise and low contrast. This affects not only the user’s visual expe- rience in consumer electronics, but also the performance of computer vision systems in light-constrained environments like warehouses or night-time monitoring systems. Visual content may also be generated from scratch or conditioned on some priors. Image generative modeling has been one of the most developing research areas in artificial intelligence during the past 1 decade and specifically the past few years. Image generative models not only have fundamental appli- cations, but also facilitate the improvement of further advanced generative machine learning systems. To name a few fundamental usages of image generative modeling, one can mention data augmentation or filling missing data. Another application is in question answering where there are multiple accept- able answers which can be predicted by generative models. Examples of further generative applications which rely on image generative modeling are tasks that require realistic image generation such as image super-resolution or image inpainting. Recently, advanced vision-language generative models are presented which their performance relies on high quality image generation in essence. Apart from the applications, image generative modeling is an excellent test for our ability to model high-dimensional complex proba- bility distributions. Manipulating visual content to edit apparent attributes is an example of conditional image generative modeling. Specifically, attribute editing in human face images has drawn great attention. Collecting paired images of different facial attributes is impracticable for certain attributes like gender or age. An important application of facial attribute editing is to create paired datasets of synthetic human face images with attribute labels. A more general application of visual content manipulating is to develop intelligent image editing tools. Besides the great demand for this application with the rise of social media, this would help digital artists to create futuristic experiences, which leads to substantial advances in animation and film industry. In the following paragraphs, we mention the challenges of accomplishing low-light image/video enhancement, image generation and facial attribute editing. Low-light image enhancement is quite a mature research field with a good understanding on its chal- lenges. The well-known challenge of low-light image enhancement lies on the fact that low-light images are often very noisy. Enhancing the image lightness naively would result in amplifying the noise level. 2 Thus, enhancing the lightness and visibility of textured details while suppressing noise is the main chal- lenge of low-light image enhancement. To strike a balance between noise suppression and texture preser- vation even in difficult corner cases is a standing challenge. It is also important and challenging to preserve the original color while enhancing the lightness. In video enhancement, the extra challenge is to also maintain temporal consistency while enhancing the frames. Very limited paired low-light/normal-light video datasets are available online. This causes a new difficulty for advancing in low-light video enhancement, in comparison with image enhancement. Both low-light image and video enhancement are widely demanded on edge devices to complement the built-in cameras and improve the social media sharing experience. Considering this, any solution for low- light image and video enhancement is desired to be computationally inexpensive and fast. Developing light-weight and low-latency solutions for these two tasks is a remaining challenge. Since the development of generative adversarial networks (GAN), image generative models have made constant breakthroughs year by year. GAN family is now capable of generating plausible high-resolution realistic images. The application of GANs is not only limited to unconditional image generation, but also they are used for image inpainting, image super-resolution or attribute-guided image generation. Despite the evident advantages of GANs, their training is a non-trivial task. GANs are very sensitive to training hyperparameters and generally suffer from convergence issues [49]. Furthermore, their training requires large-scale GPU clusters and an extensive number of training samples [91]. These concerns have led to the development of improved GAN training methods [41], techniques for stabilized training with fewer data [63, 91], or non-adversarial approaches [49]. These concerns also hold for conditional applications like GAN-based facial attribute editing methods. Recently, Probabilistic Diffusion Model (PDM) have emerged into image generation with impressive visual quality. Although they do not suffer from the concerning issues related to GANs, they have several steps during both training and inference. This causes a very costly and time-consuming training/inference, 3 which is not feasible for every research lab and not desirable for a cost-efficient industrial product. Fur- thermore, vanilla PDM lacks the semantic meaning in the latent variables. This puts a limitation to easily use their representation for other tasks like attribute editing. To extend the application of PDM to attribute editing, combined frameworks of PDMs and auto-encoders have been proposed [108]. 1.2 ContributionsoftheResearch contributions 1.2.1 Noise-AwareTexture-PreservingLow-LightImageEnhancement contrib_image Low-light image enhancement is a highly demanded task both in consumer electronics and as a pre- processing step in computer vision systems. We propose a method called NATLE for low-light image enhancement. NATLE has specifically designed cost functions to decompose an image into a piecewise smooth illumination component and a noise-free texture-preserving reflectance component. The cost func- tions have low-complexity closed form solutions by which both components are calculated. NATLE then enhances the illumination to a normal light and combines it with the reflentance to output the enhanced image. NATLE has the following key contributions: • NATLE enhances a low-light input image to a normal-light condition. • NATLE contains noise-suppression effects in every step of its algorithm. • AlthoughNATLE removes noise effectively while enhancing the light, it amplifies the input image’s texture and maintains the original color. • NATLE has a latency of 25 times faster than the best prior work. 4 1.2.2 Self-SupervisedAdaptiveLow-LightVideoEnhancement contrib_video Low-light video enhancement is crucial to many industrial applications such as night-time monitoring systems or low-light warehouse digitalization. Conducting an image-based low-light enhancement method on video frames often causes flickering problems. In order to avoid flickering, it is very important to take temporal information into account so that the enhanced video be consistent over time. We present a self- supervised method for low-light video enhancement calledSALVE. Our new method conducts an effective low-light image enhancement on a few key frames within the test video. At the same time, SALVE learns mappings from low-light to enhanced key frames. It then uses these mappings to enhance the rest of the frames within the video. SALVE has the following key contributions: • SALVE is a hybrid method. It has advantages of both traditional and learning-based methods. • The traditional component leads to robustness and adaptivity to new real-world environments. • The learning component allows a computationally inexpensive and temporally consistent solution. • Our user study confirms that at least 87% of participants prefer SALVE over prior work. 1.2.3 ProgressiveAttribute-GuidedExtendableRobustImageGeneration contrib_generation Image generation has been a hot research topic during the last decade. Along with all the advances in GANs and diffusion models, we propose an alternative approach based on successive subspace learning (SSL) and Gaussian mixture models (GMM). Our new approach, called PAGER, has the following key contributions: • PAGER is a mathematically transparent method. • The progressive content generation in PAGER allows extendability to applications such as image super-resolution or high-resolution image generation without retraining. • PAGER has a significantly lower training time than all prior work both on CPUs and GPUs. 5 • PAGER has a robust performance with lower training samples with facilitates its extended application to attribute-guided image generation. • PAGER has shown comparable performance in unconditional and conditional image generation with prior work. 1.2.4 AGMM-basedMethodforFacialAttributeEditing contrib_attribute Image manipulation has shown very pleasant results with the advances in GAN-based and diffusion-based image generative models, during the past few years. Specifically for the facial attribute editing task, we propose an alternative approach based on GMMs. Our new method, named AttGMM, has the following key contributions: • AttGMM is the first GMM-based facial attribute editing method, not utilizing neural networks, show- ing the potential of GMM-based generative models for conditional image generation. • AttGMM offers a computationally inexpensive solution for facial attribute editing. • AttGMM is self-sufficient in tackling the concern of blurry image generation in GMM-based models. • AttGMM has shown comparable performance in facial attribute editing with prior work. 1.3 OrganizationoftheDissertation organization The rest of the dissertation is organized as follows. In Chapter 2, we review the background including low-light image enhancement, low-light video enhancement, deep learning-based image generative mod- els, unconditional and conditional image generation, successive subspace learning (SSL), SSL-based image generative models, facial attribute editing, and GMM-based image generative models. In Chapter 3, we propose a noise-aware texture preserving low-light image enhancement method. In Chapter 4, we extend 6 the application of our proposed low-light image enhancement method to a self-supervised low-light video enhancement technique, which is fully adaptive to any input test video. In Chapter 5, we present a progres- sive and robust method for image generation which is easily extendable to further generative applications like attribute-guided image generation, image super-resolution, or high-resolution image generation. In Chapter 6, we propose a facial attribute editing method based on GMMs, which is computationally inex- pensive. Finally in Chapter 7, we give conclusion remarks and future research directions. 7 Chapter2 ResearchBackground chapter2 2.1 Low-LightImageEnhancement sec:chapter2_image There are two categories of traditional low-light image enhancement methods: histogram equalization and Retinex decomposition. Histogram equalization stretches the color histogram to increase the image contrast. Although it is simple and fast, it often yields unnatural colors, amplifies noise, and under/over- exposes areas inside an image. To address these artifacts, more complex priors are adopted for histogram- based image enhancement [1, 14, 52, 82, 104]. Specific penalty terms were designed and used to control the level of contrast enhancement, noise, and mean brightness in [1]. Inter-pixel contextual information was used for non-linear data mapping in [14]. To preserve the mean brightness, histogram equalization was applied to different dynamic ranges of a smoothed image in [52]. The gray-level differences between adjacent pixels were amplified to enhance image contrast based on layered difference representation of 2D histograms in [82]. Differential gray-level histogram equalization was proposed in [104] based on the concept of differential histograms. Inspired by human vision system, the Retinex theory [78] assumes that each image can be decomposed into two components: A reflectance (R) term containing inherent properties and an illumination (L) term containing the lightness condition. Based on Retinex theory, another group of works attempt to decompose 8 the input image into R and L terms and adjust the L term to a normal-light condition. Earlier works aim at identifying formulations for R and L decomposition to acquire R and L more accurately [56, 57] for a single-scale Retinex (SSR). They also extend SSR to a multiscale Retinex (MSR) to conduct color restoration for color images. In [81], An adaptive MSR is proposed which computes the weights of an SSR according to the content of the input image. LR3M [112] which is a state-of-the-art retinex-based method, adopts an optimization framework that determines a piece-wise smooth illumination map and a noise-free contrast- enhanced reflectance map, denoted by L andR, respectively, for a low-light image. Although it yields a noise-free contrast-enhanced image, it suffers from unrealistic bold borders surrounded with white halo on edges (see Fig. 3.2e). Moreover, its run time is longer (see Table 3.1). NATLE [4] carefully design optimization functions to find both R and L terms. It attempt to find a balance between suppressing noise and preserving texture through the optimization functions. STAR [138] is another low-light enhancement method based on the retinex model. It finds R andL using structure and texture maps extracted from the input image. Although STAR preserves texture well, it does not remove noise effectively. Recently, deep learning introduced a new paradigm of low-light image enhancement methods. A group of learning based methods [131, 135, 150] are based on the Retinex theory. In [135], a decomposition network and an illumination network are trained to perform Retinex decomposition and enhancement, respectively. In [150], they add another network called reflectance restoration to mitigate color distortion and noise. The Generative Adversarial Networks (GANs) offer another family of learning-based methods for low light enhancement. In [131], a generative adversarial network (GAN) [39] is employed to generate the decomposed and enhanced images. RDGAN preserves texture well. Yet, its enhanced images tend to be noisy and with faded-color. Another generative adversarial based work is [55], which is trained without paired data. EnlightenGAN can handle over-exposure without paired data, it tends to have noisy results as pointed out in [43]. The application of auto-encoders in image enhancement is also investigated in [96]. Another CNN-based method, known as Zero-DCE [42], estimates enhancement curves for each pixel in an 9 image. In [100], a multi-branch network is proposed to extract rich features in different levels and apply enhancement via multiple subnets. In [17], an end-to-end network for enhancing raw camera images is proposed. 2.2 LowLightVideoEnhancement sec:chapter2_video While low-light image enhancement is a well studied topic, low-light video enhancement is still an ongoing and challenging research topic. Applying image-based algorithms to each frame of a video yields flickering artifacts due to inconsistent enhancement results along time [76]. Thus, it is essential to take both temporal and spatial information into account in video processing. One approach is to extend 2D convolutional neural networks (CNNs) to 3D CNNs [100], which includes the 2D spatial domain and the 1D temporal domain. A 3D U-net [115] was proposed in [54] to enhance raw camera images. However, these 3D DL- based methods have huge model sizes and extremely high computational costs. Another approach is to exploit self-consistency [16, 32]. The resulting methods operate on single frames of video but impose the similarity constraint on image pairs to improve the performance and sta- bility of their models. A new static video dataset was proposed in [16], containing short- and long-exposure images of the same scene. They took two random frames from the same sequence in training and utilized the self-consistency temporal loss to make the network robust against noise and small changes in the scene. Different motion types were accounted for by imposing temporal stability using a regularized cost func- tion in [32]. Another family of self-consistency-based methods [76, 143] used the optical flow to estimate the motion information in a sequence. They utilized the FlowNet [53] to predict the optical flow between two frames, and warped the frames based on the predicted flow to avoid inconsistent frame processing. An image segmentation network was exploited to detect the moving object regions before optical flow prediction in [143]. A model to reduce noise and estimate illumination was proposed in [132] based on 10 the Retinex theory. It took each frame along with two past and two future frames as input to enhance the middle frame. All existing methods on low-light video enhancement employ deep neural networks (DNNs) as their backbone. In this work, we propose an effective and high performance method called SALVE to achieve the same goal without the use of DNNs. Our method contributes to green video processing with a lower carbon footprint [5, 74, 116]. Additionally, SALVE does not need a training dataset; it is a self-supervised approach which utilizes the frames of the test video and adapts its enhancement strategy accordingly. As such, our approach does not rely on massive training dataset and is robust against environmental changes. 2.3 DeepLearningbasedImageGenerativeModels sec:chapter2_dl DL-based image generative models can be categorized into two main classes: adversarial-based and non- adversarial-based models. GANs [39] are adversarial-based generative models that consist of a generator and a discriminator. The training procedure of a GAN is a min-max optimization where the generator learns to generate realistic samples that are not distinguishable from those in the original dataset and the discriminator learns to distinguish between real and fake samples. Once the GAN model is trained, the gen- erator model can be used to draw samples from the learned distribution. StyleGANs have been introduced in recent years. They exploit the style information, leading to better disentangability and interpolation properties in the latent space and enabling better control of the synthesis [64, 65, 66]. Examples of non-adversarial DL-based generative models include variational auto-encoders (VAEs) [68], flow-based models [28, 29], GLANN [49], and diffusion-based models [27, 47]. VAEs have an en- coder/decoder structure that learns variational approximation to the density function. Then, they gener- ate images from samples of the Gaussian distribution learnt through the variational approximation. An improved group of VAEs called Vector-Quantized VAEs (VQ-VAE) can generate outputs of higher quality. In VQ-VAEs, the encoder network outputs discrete codes and the prior is learnt instead of being static [110, 11 130]. Flow-based methods apply a series of invertible transformations on data to transform the Gaussian distribution into a complex distribution. Following the invertible transformations, one can generate im- ages from the Gaussian distribution. GLANN [49] employs GLO [10] and IMLE [87] to map images to the feature and the noise spaces, respectively. The noise space is then used for sampling and image genera- tion. Recently, diffusion-based models are developed for image generation. During the training process, they add noise to images in multiple iterations to ensure that the data follows the Gaussian distribution ultimately. For image generation, they draw samples from the Gaussian distribution and denoise the data in multiple gradual steps until clean images show up. Despite impressive results of DL-based generative models, they are mathematically not transparent due to their highly non-linear functionality. Furthermore, they are often susceptible to unexpected con- vergence problems [49], long training time, and dependency on large training dataset size. As we show in our experiments, PAGER addresses the aforementioned concerns while maintaining the quality of the images generated by DL-based techniques. 2.4 UnconditionalandConditionalImageGeneration sec:chapter2_generation In unconditional image generation, sample images are drawn from an underlying distribution without any prior assumption on the images to be generated. In conditional image generation, samples are generated under a certain assumption. One example of the latter is the generation of a high-resolution image given a low-resolution image. The proposedPAGER method contains both unconditional and conditional image generation techniques. Its core generator module employs the unconditional image generation technique. Its resolution enhancer and quality booster modules perform conditional image generation. Although PAGER is an unconditional image generator by itself, it can be easily extended to conditional image gen- eration with rich applications. We will elaborate this point with three examples, namely, attribute-guided 12 face image generation, image super resolution, and high-resolution image generation. Each task is elabo- rated below. Attribute-guidedfaceimagegeneration: For a set of required facial attributes, the goal is to gen- erate face images that meet the requirements. [97] performs attribute-guided face image generation using a low-resolution input image. It modifies the original CycleGAN [153] architecture and its loss functions to take conditional constraints during training and inference. In [70], synthetic labeled data are used to factorize the latent space into sections which associate with separate aspects of face images. It designs a VAE with an additional attribute vector to specify the target part in the factorized latent space. [109] proposes to learn a geometry-guided disentangled latent space using facial landmarks to preserve gener- ation fidelity. It utilizes a conditional VAE to sample from a combination of distributions. Each of them corresponds to a certain attribute. Image super-resolution: The problem aims at generating a high-resolution image that is consis- tent with a low-resolution image input. One solution is the example-based method [33]. Others include auto-regressive models and normalized flows [105, 129, 141]. Quite a few recent papers adopt the DL methodology [30]. Another line of work treats super-resolution as a conditional generation problem, and utilize GANs or diffusion-based models as conditional generative tools which use low-resolution images as the generation condition [23, 80, 120]. Progressive generation of very-high-resolution Images: Generation of a very-high-resolution image of high quality is challenging and treated as a separate research track. A common solution is to take a progressive approach in training and generation to maintain the model stability and generation quality. There exist both GAN-based and diffusion-based very-high-resolution image generation solutions [47, 62]. OurPAGER method can be trained for unconditional image generation as well as for conditional image generation such as attribute-guided face image generation and image super-resolution. In principle, it can 13 also be used for progressive generation of very-high-resolution images. Our PAGER serves as a general framework that can bridge different generation models and applications. 2.5 SuccessiveSubspaceLearning(SSL) sec:chapter2_ssl In order to extract abstract information from visual data, spectral or spatial transforms can be applied to images. For example, the Fourier transform is used to capture the global spectral information of an image while the wavelet transform can be exploited to extract the joint spatial/spectral information. Two new transforms, namely, the Saak transform [73] and the Saab transform [75], were recently introduced by Kuo et al. [71, 72, 73, 75] to capture joint spatial/spectral features. These transforms are derived based on the statistics of the input without supervision. Furthermore, they can be cascaded to find a sequence of joint spatial-spectral representations in multiple scales, leading to Successive Subspace Learning (SSL). The first implementation of SSL is the PixelHop system [21], where multiple stages of Saab transforms are cascaded to extract features from images. Its second implementation is PixelHop++, where channel-wise Saab transforms are utilized to achieve a reduced model size while maintaining an effective representation [22]. An interesting characteristic of the Saab transform that makes SSL a good candidate for generative applications is that it is invertible. In other words, the SSL features obtained by multi-stage Saab transforms can be used to reconstruct the original image via the inverse SSL, which is formed by multi-stage inverse Saab transforms. Once we learn the Saab transform from training data, applying the inverse Saab transform in inference would be trivial. ∗ SSL has been successfully applied to many image processing and computer vision applications [116]. Several examples include unconditional image generation [5, 83, 84, 85], point cloud analysis [59, 60, 61, 145, 146, 147, 148], fake image detection [18, 19, 20, 154], face recognition [117, 118], medical diagnosis [94, 103], low light image/video enhancement [4, 6], anomaly detection [144], to name a few. Inspired by the ∗ https://github.com/zohrehazizi/torch_SSL 14 success of SSL, we adopt this methodology in the design of a new image generative model as elaborated in the next section. 2.6 SSL-basedImageGenerativeModels sec:chapter2_genhop GenHop [83] is the contemporary SSL-based image generative model in literature. GenHop utilizes SSL for feature extraction. It applies independent component analysis (ICA) and clustering to obtain clusters of independent feature components at the last stage of SSL. Then, it finds a mapping between the distribution of ICA features and Guassian distributions. In this work, we do not perform ICA but model the distribution of SSL features via GMMs directly. As compared to GenHop, our approach offers several attractive features. First, it has lower computational complexity and demands less memory. Second, our method offers a progressive and modular image generation solution. It is capable of conditional and attribute-guided image generation. It can also be easily extended to other generative applications such as super-resolution or high- resolution image generation. 2.7 FacialAttributeEditing sec:chapter2_editing All the contemporary facial attribute editing models in literature use deep neural networks to accomplish the editing task. There are two categories of facial attribute editing models: optimization-based methods and learning-based methods. CNAI [88] and DFI [128] are optimization-based methods. CNAI introduces an attribute loss which is the difference between the given face’s features and the features of a set of faces with the target attribute, all extracted by a CNN. DFI method is based on the assumption that CNNs linearize the image manifold into an Euclidean feature subspace [8]. DFI moves the features of the given images along the direction between features of images without and with the target attribute. Then, the given face is optimized to match its features with the moved feature. Optimization-based methods are 15 not the best choice for consumer electronics due to their potentially long inference time. These methods conduct several optimization iterations during inference to reach the final edited image which can be time- consuming [45]. There are two categories of learning-based methods: single attribute editing models and multiple at- tribute editing models. Single attribute editing models [89, 93, 122, 151] train different models for different attributes or a combination of attributes. Li et al. [89] aims to add/remove a certain attribute to/from a given image utilizing an adversarial attribute loss and an identity feature loss. Based on a dual learning strategy, Shen and Liu [122] train two networks at the same time for adding and removing a certain at- tribute. GeneGAN [151] combines the features of two given images in the latent space to swap a certain attribute among them. There are several examples of multiple attribute editing models in literature. These models utilize Vari- ational Auto Encoders (VAE) [68], Generative Adversarial Networks (GAN) [39], or diffusion models [48, 123] to accomplish attribute editing in face images. VAE/GAN [79] combines VAE and GAN to learn a latent representation and a decoder. The attribute editing is achieved by decoding the modified features in the latent space. IcGAN [106] encodes images into an attribute-independent uniform latent space. Then it uti- lizes a cGAN [102] to decode the representation into images conditioned on target attributes. FaderNet [77] also learns an auto-encoder with an attribute-invariant latent representation thorough an adversarial pro- cess. Then, it takes an attribute vector as an input to decode the representation. The attribute-independent constraint in IcGAN and FaderNet causes information loss in the representation which may result in distor- tions such as over-smoothing [45]. Kim et al. [67] and DNA-GAN [137] offer methods to swap attributes like GeneGAN, but for multiple attributes. They accomplish this task by defining latent code blocks to represent different attributes, and swapping several latent code blocks. StarGAN [24] employs attribute classification loss and cycle consistency loss to train a conditional attribute transfer network. AttGAN [45] utilizes attribute classification loss and reconstruction loss to generate an image with the target attributes 16 while keeping other attributes same as in the input image. While both StarGAN and AttGAN take target attribute vectors as input, STGAN [92] takes the difference between target and source attribute vectors to selectively concatenate encoder features which are relevant to target attributes with decoder features. In comparison to StarGAN and AttGAN, there are less visual artifacts regarding the not-to-be-changed attributes in STGAN results. Recently, diffusion-based facial attribute editing models are introduced. DiffAE [108] trains an encoder to learn the high-level semantics and exploits a diffusion probabilistic model (DPM) as the decoder to model the remaining variations. DiffAE takes this approach to benefit from both DPM’s generation quality and auto-encoder’s meaningful latent variables. Collaborative Diffusion [51] takes an image, a semantic segmentation mask, and a text description as input and edits the given image with respect to the mask and the description. This method employs pre-trained uni-modal diffusion models to achieve multi-modal face editing. ChatFace [142] offers a dialog-based facial attribute editing model which takes natural language queries along with an image and edit the given image according to the query. 2.8 GMM-basedImageGenerativeModels sec:chapter2_gmm Contemporary GMM-based generative models in literature are: MFA model [114] and PAGER [5]. MFA model reduces full-size image dimensionality using Probilistic PCA (PPCA) [126, 127] or Factor Analyzer (FA) [7, 38] in multiple clusters. Thorough several iterations, data clustering and PPCA/FA parameters are optimized so that low-dimensional latent vectors in each cluster form Gaussian distributions. It then generates images thorough drawing samples out of Gaussian distributions in random clusters. The low- dimensional latent vectors are transformed into images thorough inverse-PPCA/FA. In comparison to GAN-based methods, the generated samples using MFA model are less sharp. However, MFA model is more capable of capturing the full data distribution, addressing the long-known problem of "mode col- lapse" in GAN-based models. 17 On the other hand, PAGER [5] takes a progressive approach in image generation. For CelebA dataset, it first generates 4× 4 face images and then gradually increases the resolution thorough series of residual generative modeling. PAGER exploits Green Learning [74, 116] to transform images into the features space or inverse-transform the feature space back to the image space. Green Learning has shown great potential in generative modeling in various applications such as texture/image generation [83, 85], image/video enhancement [4, 6], and super resolution [133]. Finally, to address the blurriness issue in GMM-based models, PAGER introduces quality booster based on Locally Linear Embedding (LLE) [119] to add fine details conditioned on the blurry generated images. PAGER demonstrates a lower training time and a more robust performance with less training samples in comparison to GAN-based models. PAGER also offers attribute-guided image generation. 18 Chapter3 Noise-AwareTexture-PreservingLow-LightEnhancement chapter3 3.1 Introduction Images captured in low-light conditions suffer from low visibility, lost details and high noise. Besides unpleasant human visual experience, low-light images can significantly degrade the performance of com- puter vision tasks such as object detection and recognition. As a result, low-light enhancement is widely demanded in consumer electronics and computer vision systems. Quite a few low light enhancement methods have been proposed. Early work is based on histogram equalization. Although being simple and fast, it suffers from unnatural color, amplified noise, and under/over-exposed areas. One popular approach for low-light enhancement is based on retinex decomposition. In this work, we propose to decompose the retinex model into the element-wise product of a piece-wise smooth illumination map and a noise-aware texture-preserving refelectance map, solve it with a simple and low complexity solution and, finally, use gamma correction to enhance the illumination map. We give the new method an acronym, “NATLE", due to its noise-aware texture-preserving characteristics. Our main idea is sketched below. We begin with an initialization of a piece-wise smooth illumination map. Based on the retinex model and the initial illumination map, we can solve for the initial refelectance map. Afterwards, we apply nonlinear median filtering as well as linear filtering [37] to the RGB channels of 19 (a) input fig:in (b) b L fig:lhat (c) L fig:l (d) Noisy b R fig:rhat (e) R r (f) output fig:out Figure 3.1: Illustration of NATLE’s processing steps: (a) input, (b) illumination map initialization, (c) illu- mination map estimation, (d) noisy reflectance map, (e) reflectance map estimation, (f) output after illumi- nation gamma correction. fig:steps initial refelectance map separately for noise-free, texture-preserving reflectance estimation. The complex- ity of our solution is low. Our work has several contributions. First, it conducts low light enhancement and denoising at the same time. Second, it has a low-complexity solution. Third, it preserves details without unrealistic edges for better visual quality. The rest of this chapter is organized as follows. Our new method is detailed in Sec. 3.2. Experimental results are shown in Sec. 3.3. Finally, concluding remarks and future work are given in Sec. 3.4. 3.2 ProposedNATLEMethod sec:chapter3method System Overview. The classic retinex model decomposes an observed image (S) into an element-wise multiplication of its reflectance map (R) and its illumination map (L) as S =R◦ L, (3.1) eq:chapter3retinex eq:chapter3retinex where R represents inherent features of the image, which is decoupled from lightness, andL delineates the lightness condition. A desired refelectance map includes texture and details while an ideal illumination map is a piece-wise smooth map indicative of the edge information. The NATLE method consists of two major steps: 20 • Step 1: Use the first optimization procedure to estimate L; • Step 2: Use the second optimization procedure to estimateR based on estimatedL from Step 1. Then, with gamma correction, NATLE yields the final output image. NATLE is summarized in Algorithm 1. Its intermediate processing results are illustrated in Fig. 3.1. Step1. To estimateL, we conduct the following optimization [111]: argmin L ∥L− b L∥ 2 F +α ∥∇L∥ 1 , (3.2) eq:chapter3L eq:chapter3L whereα is a model parameter and b L is an initial estimation ofL. It is set to the default average of RGB three color components as b L = 0.299R + 0.587G + 0.114B. The first term in the right-hand-side of Eq. (3.2) demands thatL represents the luminance while the second term ensures thatL is a peice-wise smooth map containing remarkable edges only. We approximate the second term as: lim ϵ →0 + X x X d∈{h,v} (∇ d L(x)) 2 |∇ d b L(x)|+ϵ =∥∇L∥ 1 , (3.3) eq:chapter3Lappr eq:chapter3Lappr whered is the gradient direction andv andh indicate the vertical and horizontal directions, respectively. Eq. (3.2) can be rewritten as: argmin L ∥L− b L∥ 2 F + X x X d∈{h,v} A d (x)(∇ d L(x)) 2 , (3.4) eq:chapter3L2 eq:chapter3L2 where A d (x)= α |∇ d b L(x)|+ϵ . (3.5) 21 Step2. The reflectance map, R, can be solved with the following optimization: argmin R ∥R− b R∥ 2 F +β ∥∇R− G∥ 2 F , (3.6) eq:chapter3R eq:chapter3R where b R is a noise-free initialization ofR: b R =S⊘ (L+ε)− N, (3.7) eq:chapter3Rhat eq:chapter3Rhat whereS is the V channel of the input in the HSV color space,N is input noise,L is the estimated illumi- nation obtained in Step 1,⊘ denotes the element-wise division andε is a small value to prevent division by zero. The first term in Eq. (3.6) is used to ensure: 1) the element-wise multiplication of R andL givesS, as demanded by the retinex model; 2)R is noise-free since b R is the element-wise division ofS byL minus noise. Being inspired by [111], the second term in Eq. (3.6) has texture-preserving and noise-removal dual roles in our model. It enforcesR not to involve very small gradients to reduce noise. Yet, it is modified in our model to avoid bold borders between objects or halo next to edges, which appear in [111]. When∇S is small, [111] attempts to enhance the contrast by forcing∇R to be much larger than∇S; however, it also affects edges that should not be amplified much, leading to unrealistic borders or halo near edges. In our model,∇R is forced to be slightly greater than∇S everywhere, except for very small gradients. This results in natural borders without halo as shown in Fig. 3.2. To this end, we modifyG in [111] as G= 0, ∇S <ϵ g λ ∇S, (3.8) eq:chapter3G eq:chapter3G whereϵ g is the threshold to filter out small gradients and λ controls the degree of amplification. 22 Algorithm1 Low-Light Enhancement Algorithm alg1 Input: Low-light Image 1: Illumination initialization; 2: Illumination estimation via Eq. (3.9); 3: Reflectance initialization via Eq. (3.7); • S is theV channel of input in the HSV space; • Noisy b R with element-wise division (S/L); • Noisy b R back to the RGB space; • Apply median filtering and FastABF to RGB 3 channels; • Return to the HSV space: – b R is the V channel; – Save denoised hue and saturation; 4: Reflectance estimation via Eq. (3.10); 5: S ′ is enhancedS via Eq. (3.11); 6: IntegrateS ′ , hue & saturation and change to the RGB space Output: Normal-light Image To obtain noise-freeR, input noiseN should be removed as shown in Eq. (3.7). To do so, we convert S⊘ (L+ε), hue and saturation to the RGB color space and conduct denoising there. First, a median filter is applied to RGB channels to remove color impulse noise. Second, the Fast Adaptive Bilateral Filtering (fastABF) method [37] is used to remove remaining noise in each of RGB channels. FastABF employs the Gaussian kernel of different parameters at different pixel locations to adjust denoising degree according to the local noise level. Furthermore, it can implemented by a fast algorithm. After denoising, the RGB output is converted back to the HSV space. Hue and saturation are saved for the final solution. The V channel serves as b R to initializeR. The above procedure not only remove noise in theV channel of noisy b R but also in hue and saturation channels. It is worthwhile to point out that conventional high-performance denoising methods such as non-local- mean (NLM) [11] and BM3D [26] are slow in run time and weak in texture preservation. Faster denoising methods such as classic bilateral filtering [3] do not work well for images with heavy noise. The fastABF method is a good solution since it provides a good balance between low complexity, texture preservation and effective noise reduction. 23 Closed-FormSolution. The optimization problems in Eqs. (3.2) and (3.4) can be solved by differenti- ating with respect toL andR and setting the derivative to zero in a straightforward manner. There is no approximation neither high-complexity algorithm in order to solve them. Actually, the final solution can be derived in closed form as l = (I + X d∈{h,v} D T d Diag(a d )D d ) − 1 b l, eq:chapter3l (3.9) r = (I +β X d∈{h,v} D 2 d ) − 1 (b r+β X d∈{h,v} D T d g d ), eq:chapter3r (3.10) where Diag(y) is a diagonal matrix of vectorized y from matrix Y , D d is a discrete differential operator matrix that plays the role of∇ andI is the identity matrix. Once vectorsl andr are determined, they are reformed to matricesL andR, respectively. As the last step, gamma correction is applied toL to adjust illumination. The ultimate enhanced low-light image can be computed as S ′ =R◦ L 1 γ . (3.11) eq:Sprime eq:Sprime 3.3 Experiments sec:chapter3experiments We conduct experiments on Matlab R2019b with an Intel Core i7 CPU @2.7GHz. The parameters α , β , λ and γ are set to 0.015, 3, 1.1 and 2.2, respectively. We carry out performance comparison of several benchmarking methods with two sets of images: 1) 59 commonly used low-light images collected from various datasets for subjective evaluation and no-reference comparison; 2) 15 images from the LOL paired dataset for reference-based comparison. ObjectiveEvaluation. A no-reference metric, ARISMC [40], and a reference-based one, SSIM [152], are adopted for objective evaluation of several methods in Table 3.1. While ARISMC assesses quality of 24 (a) Input fig:input (b) PIE [34] fig:pie (c) SRIE [35] fig:srie (d) RetinexNet [136] fig:retinexnet (e) LR3M [111] fig:lr3m (f) NATLE (Ours) fig:ours Figure 3.2: Qualitative comparison of low-light enhancement results. fig:quality1 both luminance and chrominance, SSIM evaluates structural similarity with the ground truth. The run time is also compared in the table. LR3M has the best ARISMC performance. NATLE has the best SSIM performance and the second best ARISMC. Yet, LR3M demands 25 more times than NATLE. Subjective Evaluation. A qualitative comparison of our method with four benchmarking methods is shown in Fig. 3.2. For the first street image, PIE, SRIE and RetinexNet amplify noise in the low-light enhanced image. RetinexNet has unnatural color. LR3M has extra borders or halo next to edges. For the second lamp image, PIE and SRIE have either dark or noisy background. RetinexNet is noisy and over- exposed with unnatural texture and color. LR3M has false red borders around the lamp. For the third hills image, PIE and SRIE has low-light results. RetinexNet reveals square traces of BM3D denoising on trees 25 Table 3.1: Objective performance evaluation tab:score Method ARISMC↓ SSIM↑ Run Time (sec.)↓ BIMEF [140] 3.1543 0.5903 0.27 CRM [139] 3.1294 0.5366 0.32 MF [36] 3.1342 0.4910 0.96 PIE [34] 3.0636 0.5050 1.55 SRIE [35] 3.1416 0.4913 16.89 LR3M [111] 2.7262 0.4390 127 NATLE (Ours) 2.9970 0.6193 4.98 and has unnatural color. LR3M removes all texture in mountains and generates an extra border between mountain and sky. For the last bird image, PIE and SRIE have low-light shadow areas. RetinexNet and LR3M generate black border around the bird. RetinexNet has unnatural color while LR3M loses feather texture and blurs shadow area. NATLE yields noise-free images with natural edges in these examples. It enhances light adequately and preserves texture well. Discussion. It is worthwhile to highlight several characteristics of the proposed NATLE method. a) Denoising and Texture Preservation. NATLE effectively removes noise without losing texture detail when being applied to a wide range of low-light images. The optimization in Eq. (3.6) demands the en- hanced reflectance map to be as close as its noise-free form while preserving edges and textures in the input. As compared with [111], NATLE takes a moderate approach. That is, it does not denoiseR more than needed. This is the main reason why NATLE can preserve texture, remove noise and maintain natural borders without halo at the same time. b)Speed. NATLE performs fast and efficient with a closed-form solution. It is of low-complexity, since it does not demand iterations. It does not require sequential mathematical approximations, either. The performance of NATLE is affected by denoising methods. FastABF is chosen here. It is fine to adopt other denoising methods depending on the application requirement. c) Parameter Study. The impact of model parametersα ,β ,λ andγ is shown in Fig. 3.3. The results of setting α = 0 is shown in column (a). It removes the second term in Eq. (3.2), leading to a non-smooth 26 illumination map and a noisy output. Moreover, removing this term results in color distortions such as the yellow area on grass beside the pavement at the bottom of the image. Column (b) shows results with β = 0. Without the second term in Eq. (3.6), edges and details are blurred. Column (c) is very noisy. It shows the need of denoising b R. Column (d) is the result by including all model parameters, which is clearly better than the other three cases. (a) α =0 fig:alpha0 (b) β =0 fig:beta0 (c) Without denoising b R fig:noisy (d) NATLE (Ours) fig:withall Figure 3.3: Parameter study for NATLE: (a) result with α = 0, (b) result with β = 0, (c) result without denoising b R, (d) desired parameters. fig:parameters 3.4 ConclusionandFutureWork sec:chapter3conclusion A low-light image enhancement method based on a noise-aware texture-preserving retinex model, called NATLE, was proposed in this work. It has closed-form solutions to the two formulated optimization prob- lems and allows fast computation. Its superior performance was demonstrated by extensive experiments with both objective and subjective evaluations. One possible future work is to extend this framework into video low-light enhancement. The main challenge is to preserve temporal consistency of enhanced video. 27 Chapter4 SALVE:Self-SupervisedAdaptiveLowLightVideoEnhancement chapter4 4.1 Introduction sec:chapter4intro Videos captured under low light conditions are often noisy and of poor visibility. Low-light video enhance- ment aims to improve viewers’ experience by increasing brightness, suppressing noise, and amplifying de- tailed texture. The performance of computer vision tasks such as object tracking and face recognition can be severely affected under low-light noisy environments. Hence, low-light video enhancement is needed to ensure the robustness of computer vision systems. Besides, the technology is highly demanded in con- sumer electronics such as video capturing by smart phones. While mature methods for low-light image enhancement have been developed in recent years, low lightvideo enhancement is still a standing challenge and open for further improvement. A trivial solution to low light video enhancement is to enhance each frame with an image enhancement method indepen- dently. However, since this solution disregards temporal consistency, it tends to result in flickering videos [76]. Also, frame-by-frame low light video processing can be too computationally expensive for practical applications. Several methods utilized deep learning (DL) to preserve the temporal consistency of video frames. For instance, 3D CNNs are trained to process a number of frames simultaneously in order to take temporal 28 consistency into account [54, 100]. Some papers enforce similarity between pairs of frames with a temporal loss function or loss function regularization in training [16, 32]. Other works extract the motion informa- tion and leverage redundancy among frames to ensure the temporal consistency of enhanced videos [76, 143]. On one hand, the efforts mentioned above lead to high-performance models with a range of acceptable to excellent quality results. On the other hand, their performance is dependant on the training dataset. Differences between the training and testing environments can degrade the performance of low light video enhancement severely. In other words, when deployed in the real world, the DL-based models cannot be trusted and utilized without fine-tuning. Considering the fact that paired low-light/normal-light video datasets are very scarce, fine-tuning these models can be challenging. In this chapter, we propose an alternative low-light video enhancement method to address the above- mentioned challenges. Our proposed method is called the self-supervised adaptive low-light video en- hancement (SALVE) method. By self-supervision, we mean thatSALVE directly learns to enhance an arbi- trary input video without requiring to be trained on other training videos. SALVE offers a robust solution that is highly adaptive to new real-world conditions. SALVE selects a couple of key frames from the input video and enhances them using an effective Retinex-based image enhancement method called NATLE [4]. Given NATLE-enhanced key frames of the input video, SALVE learns a mapping from low-light frames to enhanced ones via ridge regression. Finally, SALVE uses this mapping to enhance the remaining frames. SALVE does not need low- and normal-light paired videos in training. Therefore, it can be an attractive choice for non-public environments such as warehouses and diversified environments captured by phone cameras. SALVE is a hybrid method that combines components from a Retinex-based image enhancement method and a learning-based method. The former component leads to a robust solution which is highly adaptive to new real-world environments. The latter component offers a fast, computationally inexpensive and 29 Figure 4.1: Overview of SALVE. It decomposes the input frame into estimations of illumination ( b L) and reflectance ( b R) components. InI frames, it calculates the final illumination L and reflectance R. InB/P frames, SALVE predicts theses final components using a mapping it learnt from the last I frame. fig:chapter4overview temporally consistent solution. We conduct extensive experiments to show the superior performance of SALVE. Our user study shows that 87% of participants prefer SALVE over prior work. The rest of this chapter is organized as follows. The low light image enhancement method, NATLE, is reviewed and then, the proposed low light video enhancement method,SALVE is explained in Section 4.2. Experimental results are presented in Section 4.3. Finally, concluding remarks are given in Section 4.4. 4.2 ProposedMethod sec:chapter4method Figure 4.1 presents the overview of our proposed method. The top row shows the steps taken to enhance an input frame, which we discuss in Section 4.2.1. The bottom row shows the extension to videos, i.e. it shows how we treat different frames of the video. We discuss this process in Section 4.2.2. 4.2.1 NATLE subsec:NATLE In order to propose our method in Section 4.2.2, we need to first review NATLE [4], which is an effec- tive method for low light image enhancement. NATLE is a retinex-based low light image enhancement method. A classic retinex model decomposes an input image S into the element-wise multiplication of two components; a reflectance map ( R) and an illumination (L) map: 30 S =R◦ L, (4.1) eq:retinex eq:retinex R represents the inherent features within an image which remain the same in different lightness con- ditions. L shows the lightning condition. Ideally,R contains all the texture and details within the image andL is a piece-wise smooth map with significant edges. NATLE presents a methodology to find solutions forR andL. It then enhancesL to a normal light condition, and follows the retinex model to combine the enhancedL withR and obtain the enhanced image. In what follows, we explain the stepsNATLE takes to find solutions for R andL: Step 1 is to calculate an initial estimation ofL, namely b L, which is a weighted average of RGB color channels: b L=0.299R+0.587G+0.114B. (4.2) eq:L_hat eq:L_hat Step2 is to form an optimization function to find a piece-wise smooth solution for L: argmin L ∥L− b L∥ 2 F +α ∥∇L∥ 1 , (4.3) eq:L eq:L where∥∇L∥ 1 is approximated with: lim ϵ →0 + X x X d∈{h,v} (∇ d L(x)) 2 |∇ d b L(x)|+ϵ =∥∇L∥ 1 , (4.4) eq:Lappr eq:Lappr whered is the gradient direction andv andh indicate the vertical and horizontal directions, respectively. Eq.(4.3) is rewritten as: argmin L ∥L− b L∥ 2 F + X x X d∈{h,v} A d (x)(∇ d L(x)) 2 , (4.5) eq:L2 eq:L2 31 where: A d (x)= α |∇ d b L(x)|+ϵ . (4.6) eq:Ad eq:Ad Finally, Eq. (4.5) is solved by differentiating with respect to L and setting the derivative to zero, without any approximation. The final solution is derived in closed form as: l =(I + X d∈{h,v} D T d Diag(a d )D d ) − 1 b l, (4.7) eq:l eq:l where Diag(a d ) is a matrix witha d on its diagonal,D d is a discrete differential operator matrix that plays the role of∇ andI is the identity matrix. Once vectorl is determined, it is reshaped to matrixL. Step3 is to calculate an estimation ofR, namely b R: b R =S⊘ (L+ε)− N, (4.8) eq:Rhat eq:Rhat whereS is the input image,L is the estimated illumination obtained in Step 2,− N shows a median filter denoising followed by a bilateral filter denoising, ⊘ denotes element-wise division andε is a small value to prevent division by zero. Step4 is to form an optimization function to find R: argmin R ∥R− b R∥ 2 F +β ∥∇R− G∥ 2 F , (4.9) eq:R eq:R The first term in Eq. 4.9 ensures that R is noise-free and consistent with retinex model. The second term has a noise-removal and texture preserving dual role where G is: 32 G= 0, ∇S <ϵ g λ ∇S, (4.10) eq:G eq:G Here, ϵ g is the threshold to filter out small gradients which are assumed as noise and λ controls the degree of texture amplification. Finally, the optimization problem in Eq. (4.9) is solved by differentiating with respect toR and setting the derivative to zero, without an approximation. The final solution is derived in closed form as: r =(I +β X d∈{h,v} D 2 d ) − 1 (b r+β X d∈{h,v} D T d g d ). (4.11) eq:r eq:r whereD d ,∇ andI are defined same as in Eq. 4.7. Once vector r is determined, it is reformed to matrixR. Step 5 is to apply gamma correction on L to adjust illumination. The ultimate enhanced image is computed as: S ′ =R◦ L 1 γ . (4.12) eq:gamma eq:gamma 4.2.2 VideoEnhancement subsec:Video In [4], we showed the performance of NATLE in low light image enhancement. NATLE suppresses noise and preserves texture while enhancing dark images. In order to extend the application ofNATLE to videos, the trivial idea would be to apply NATLE separately on all the frames within a video. We evaluated the performance of this method in section 4.3. However, a series of consecutive frames within a video usually have significant correlations in structure, color, and light. We may leverage this temporal similarity in order to lower the costs of the video enhancement from a series of repetitive image enhancements. Here, we propose a self-supervised method for low light video enhancement based on learning from NATLE. By applying NATLE on selected frames within a video, we acquire pairs of dark and enhanced 33 frames, from which we learn a mapping from dark to enhanced frames. We then apply the learnt mapping to the rest of the frames to accomplish the low light video enhancement. In particular, we approximate Eqs. (4.7) and (4.11) inNATLE, which take the major portion ofNATLEś runtime. Thus, our proposed video enhancement method is significantly faster and computationally less expensive than applying frame-by- frame NATLE. Still, the results using this approach are as good as doing a frame-by-frame NATLE. In order to decide the frame on which we apply NATLE, we use the FFMPEG compression technique. In FFMPEG, there are three types of frames, namely identity (I), bidirectional (B), and predicted (P) frames. The I frames are the key frames which indicate a significant spatial or temporal change within the video. More precisely, an I frame is placed where either of the following conditions is met: • The frame remarkably differs from the previous frame. • One second has passed from the previous I frame. We explain the steps to obtain enhanced video frames using SALVE below. Step1. Apply NATLE to an I frame: I enhanced , b L I ,L I , b R I ,R I =NATLE(I), (4.13) eq:natle_I eq:natle_I whereI andI enhanced are the low-light and enhanced keyframes, respectively. The rest of the parameters, i.e., b L I ,L I , b R I andR I , are the results of intermediate steps in NATLE as described in Section 4.2.1. Step2. Learn two ridge regressions mapping b L I and b R I toL I andR I , respectively. To be more precise, we look forW L andW R to solve the following optimization problems: min W l ||l I − b l I W l || 2 2 +α ||W l || 2 2 , (4.14) eq:ridge_regression_L eq:ridge_regression_L min Wr ||r I − b r I W r || 2 2 +α ||W r || 2 2 , (4.15) eq:ridge_regression_R eq:ridge_regression_R 34 wherel I ∈R n× 1 is the vectorized form ofL I withn pixels. b l I ∈R n× 25 denotes5× 5 neighborhoods of each pixel in b L I . The solution,W l ∈R 25× 1 , maps each5× 5 patch in b L I to the corresponding center pixel inL I . The same process and notation is used for Eq. (4.15). Step 3. Compute b L P of B/P frames using Eq. (4.2). Obtaining b L by NATLE is computationally inex- pensive. Hence, we keep it the same while enhancing B/P frames. Step4. ComputeL P using the ridge regressorW l learned in Step 2: l P = b l P W l , (4.16) eq:ridge_regressor_L eq:ridge_regressor_L where b l P ∈R n× 25 denotes5× 5 neighborhoods of each pixel in b L P .l P ∈R n× 1 is the vectorized form of L P withn pixels. We reshapel P vector to obtainL P matrix. Step5. Compute b R P for B/P frames using Eq. (4.8). Step6. ComputeR P using the ridge regressorW r learned in Step 2; namely, r P =b r P W r , (4.17) eq:ridge_regressor_R eq:ridge_regressor_R whereb r P ∈R n× 25 denotes5× 5 neighborhoods of each pixel in b R P . r P ∈R n× 1 is the vectorized form ofR P withn pixels. We reshaper P to obtainR P . Step7. Apply gamma correction toL P for illumination adjustment. The final enhanced B/P frame is computed using Eq. (4.12). We perform Steps 1 and 2 on the I frames and Steps 3 to 7 on the subsequent B/P frames. Once a new I frame is encountered, we repeat Steps 1 and 2 and continue. This setting ensures that the self- supervised learning fromNATLE is being updated frequently enough to keep up with any significant tem- poral changes. 35 4.3 Experiments sec:chapter4experiments 4.3.1 ExperimentalSetup subsec:setup We conduct extensive qualitative and quantitative experiments to evaluate our method and show its effec- tiveness. In our experiments, we use the DAVIS video dataset [12, 107] as the ground truth. DAVIS offers two resolutions, 480P and full resolution. We use all full resolution videos from 2017 and 2019 challenges. Following [99], we synthesize dark videos by darkening the normal-light frames of DAVIS dataset with gamma correction and linear scaling: x=B× (A× y) γ (4.18) eq:darken eq:darken where y is the ground-truth (normal light) frame, x is the darkened frame, A andB are linear scaling factors and sampled from uniform distributionsU(0.9,1) andU(0.5,1), respectively, andγ is the gamma correction factor which is sampled fromU(2,3.5). We also synthesized the noisy version of the dark frames. Following [143], we add both Poisson and Gaussian noise to the low-light frames: n=P(σ p )+N(σ g ) (4.19) eq:noise eq:noise whereσ p andσ g are parameters of Poisson noise and Gaussian noise, respectively. They are both sampled from a uniform distribution U(0.01,0.04). We acquire two sets of videos, namely darkened and noisy- darkened videos. Our goal is to enhance these videos and assess them qualitatively and quantitatively. 4.3.2 VisualComparison subsec:visual We first provide a visual analysis over an example video frame from the dark-clean and dark-noisy datasets in Figures 4.2 and 4.3, respectively. We observe that methods that are not based on deep learning (LIME 36 (a) input dark video (b) LIME [44] (c) Dual [149] (d) DRP [90] (e) SDSD [132] (f) StableLLVE [143] (g) Ours (h) Ground-truth Figure 4.2: Visual comparison between our method and prior work on clean-darkened video frames from DAVIS dataset. fig:visual_clean and DUAL) do not add artifacts to the frame, but the resulting enhanced frame still lacks lightness. Among prior methods based on deep learning, DRP [90] is a self-supervised method that gives nice enhancement results. While DRP adds colorful textures to the enhanced images, the results tend to be slighly different with the ground truth and have artifacts in some regions. SDSD [132] is a supervised method which is fine- tuned to the DAVIS dataset. SDSD tends to add artifacts to enhanced images. This issue is more noticable in Figure 4.3. StableLLVE [143] is a supervised method trained on the DAVIS dataset. The enhancement results by StableLLVE have a pale color. Our method achieves enhanced frames that are fairly close to the ground-truth and avoids adding artifacts or changing the coloring of the image. The example input dark frames in Figures 4.2 and 4.3 were created synthetically. We next examined our framework on a real-world video randomly selected from the LoLi-Phone dataset [86]. Note that in this case there is no ground-truth video. We present the enhanced frames corresponding to our work and related work in Figure 4.4. We make a similar observation to that of the synthetic dataset: our method is capable of achieving an image with better lightness while keeping the coloring and visual content intact. 37 (a) input dark & noisy video (b) LIME [44] (c) Dual [149] (d) DRP [90] (e) SDSD [132] (f) StableLLVE [143] (g) Ours (h) Ground-truth Figure 4.3: Visual comparison between our method and prior work on noisy-darkened video frames from DAVIS dataset. fig:visual_noisy 4.3.3 QuantitativeEvaluation subsec:scores We use four quantitative metrics to evaluate the performance of our method and compare it with prior work. First two metrics are Peak-Signal-to-Noise Ratio (PSNR) and Structural-SIMilarity (SSIM) [134], which we apply on all frames of the videos. We also use AB(Var) from [100] and Mean Absolute Brightness Difference (MABD) from [54] to assess temporal stability and its consistency with those of the ground truth. Table 4.1 and Table 4.2 show comparisons between our method and prior work for the clean and noisy datasets, respectively. In Table 4.1 and 4.2, we take the scores of all prior work except for Dual [149], DRP [90] and SDSD [132] from [143]. For DUAL (traditional method) and DRP (self-supervised method), we use their public codes to enhance the images. For SDSD, we fine-tune their pre-trained model on the DAVIS dataset. We then use the fine-tuned SDSD model to enhance the images. We then calculate the scores of these three methods and report them in Tables 4.1 and 4.2. Low scores of the DRP method is caused by its generative flavor, which makes the results different from the ground truth. Also, their public codes are only conducted for low-resolution video content. To tailor their implementations to high-resolution video content demands special efforts. 38 Figure 4.4: Visualization of video frames from a real-world dark video from LoLi-Phone dataset. fig:visual_real 4.3.4 ComputationalComplexity subsec:flop In this section, we calculate the runtime and FLOPs (FLoating Point Operations) ofSALVE and prior work to offer a comparison on the computational complexity. Table 4.3 shows the runtime comparison between different methods on CPU and GPU resources. We measure the average runtime of different methods for enhancing an RGB frame of size 530× 942 on the CPU resource of Intel Xeon 6130 and the GPU resource of NVIDIA Tesla V100. Since LIME and Dual only have CPU implementations, their GPU runtimes are not mentioned in the table. Table 4.3 shows that SALVE’s runtime is better than LIME, Dual, NATLE and DRP. Specifically, SALVE is more than 2500 times faster than the self-supervised DRP method. Table 4.4 compares the numbers of FLOPs per pixel among StableLLVE, SDSD, NATLE andSALVE. We use a FLOP counter tool ∗ [124] for PyTorch to calculate FLOPs of StableLLVE and SDSD. We did not find ∗ https://github.com/sovrasov/flops-counter.pytorch 39 Table 4.1: Quantitative comparison for enhancing clean dark videos. The best scores are indicated inbold. tab:clean Method PSNR↑ SSIM↑ AB(Var)↓ MABD↓ LIME [44] 17.36 0.7386 9.65 0.37 Dual [149] 18.12 0.8283 2.13 0.07 MBLLEN [100] 18.41 0.8100 77.24 1.95 RetinexNet [135] 19.78 0.8353 1.32 0.09 SID [17] 22.95 0.9428 4.93 0.43 DRP [90] 6.89 0.3160 6.73 0.52 NATLE [4] 26.70 0.9127 2.04 0.03 MBLLVEN [100] 24.50 0.9482 1.79 0.80 SMOID [54] 24.85 0.9472 1.30 0.17 SFR [32] 23.81 0.9413 2.14 0.11 BLIND [76] 22.87 0.9344 8.66 0.43 StableLLVE [143] 24.07 0.9483 1.96 0.05 SDSD [132] 25.09 0.8783 0.98 0.01 SALVE (Ours) 28.85 0.9225 1.47 0.006 a tool to measure the number of FLOPs of (non-PyTorch) LIME, Dual, and DRP methods. For NATLE and SALVE, we calculate their FLOP numbers manually [25]. Table 4.4 shows that SALVE has a significantly lower number of FLOPs than StableLLVE and SDSD. Our explanation for the lower runtime of these two methods in Table 4.3 is that their implementations in PyTorch are very efficient. In contrast, SALVE uses several libraries including SciPy in most of the calculations. The latter is not as efficient as PyTorch. As a result, SALVE has a slightly longer runtime despite its lower number of FLOPs. 40 Table 4.2: Quantitative comparison for enhancing noisy dark videos. The best scores are indicated inbold. tab:noisy Method PSNR↑ SSIM↑ AB(Var)↓ MABD↓ LIME [44] 16.83 0.4567 8.29 0.33 Dual [149] 18.38 0.6073 2.14 0.22 MBLLEN [100] 18.38 0.7982 78.76 1.93 RetinexNet [135] 19.56 0.7475 1.45 0.09 SID [17] 22.93 0.9253 4.03 0.39 DRP [90] 5.55 0.4107 15.37 0.34 NATLE [4] 25.55 0.8237 1.48 0.06 MBLLVEN [100] 23.08 0.8839 2.81 1.02 SMOID [54] 23.42 0.9212 0.82 0.17 SFR [32] 22.82 0.9299 2.29 0.12 BLIND [76] 22.94 0.9174 7.86 0.33 StableLLVE [143] 24.01 0.9305 3.00 0.10 SDSD [132] 22.27 0.8051 1.35 0.03 SALVE (Ours) 27.06 0.8202 1.18 0.01 4.3.5 UserStudy subsec:user To further demonstrate the effectiveness of our method, we conduct a user study with 31 participants. In this study, we have 10 blind A/B tests between our method and prior works. At each time, only 2 videos are shown to the user. The 10 videos are randomly selected for this study. Each of the five prior work appears two times in the study. We show the results of this study in Figure 4.5. As seen, depending on the comparison baseline, between 87% to 100% of users prefer our enhanced videos over prior work. 4.3.6 AblationStudy subsec:ablation An ablation study was conducted in [4] to show the effectiveness of α ,β and denoising of b R on the final enhanced image. Here, we study the effect of these three parameters on the future enhanced frames in 41 Table 4.3: Average runtime (in seconds) comparison per RGB frame of size530× 942 pixels. tab:video_runtime Method CPU GPU LIME [44] 6.60 N/A Dual [149] 13.20 N/A DRP [90] 2760 2728 StableLLVE [143] 0.063 0.057 SDSD [132] 0.307 0.261 NATLE [4] 3.14 2.90 SALVE (Ours) 0.980 0.322 Table 4.4: FLOPs comparison per pixel. tab:flop Method FLOPs Ratio StableLLVE [143] 51.19 K 7.20× SDSD [132] 233.45 K 32.83× NATLE [4] 10.85 K 1.52× SALVE (Ours) 7.11 K 1× Figure 4.6. We see from the figure that cancelling parameter α and/or disabling the denoising operation results in noisy textures. Setting parameterβ =0 makes the edges of objects blurry and degrades texture preservance quality of the method. As mentioned in Eq. (4.17), the window size of ridge regression is5× 5. Here, we analyze the effect of the window size on enhanced frames. Figure 4.7 shows this analysis. A small window size results in artifacts in the enhanced frame (e.g., some pixels on the street light become red instead of maintaining the black color). Noisy patterns can be reduced as the window size increases. Overall, we strike a good balance between visual quality and the cost of ridge regression setting the window size to5× 5. A close look reveals noise patterns, blurry textures or artifacts in Figures 4.6 and 4.7. † † High-resolution versions of Figures 6 and 7 are available onbit.ly/3TXGxRh 42 Figure 4.5: User study results, where we show user’s preference in pair-wise comparison between our method and five benchmarking methods. fig:userstudy 4.4 Conclusion sec:chapter4 conclusion In this chapter, we proposed SALVE, a new method for low-light video enhancement. Our new self- supervised learning method is fully adaptive to the test video. SALVE enhances a few keyframes of the test video, learns a mapping from low-light to enhanced keyframes, and finally uses the mapping to enhance the rest of the frames. This approach enables SALVE to work without requiring (paired) training data. Furthermore, we conducted a user study and observed that participants preferred our enhanced videos in at least 87% of the tests. Finally, we performed an ablation study to demonstrate the contribution of each component of SALVE. 43 (a)α =0.0,β =3, denois- ing enabled. (b) α = 0.015, β = 0, de- noising enabled. (c) α = 0.015, β = 3, de- noising disabled. (d) α = 0.015, β = 3, de- noising enabled. Figure 4.6: Effect of parameters α andβ as well as the denoising operation on the enhanced frame’s quality. fig:ablation1 (a)1× 1 window (b)3× 3 window (c)5× 5 window (d)7× 7 window Figure 4.7: Effect of the regressor’s window size on the quality of enhanced frames. fig:ablation2 44 Chapter5 PAGER:ProgressiveAttribute-GuidedExtendableRobustImage Generation chapter5 5.1 Introduction sec:intro Unconditional image generation has been a hot research topic in the last decade. In image generation, a generative model is trained to learn the image data distribution from a finite set of training images. Once trained, the generative model can synthesize images by sampling from the underlying distribution. GANs have been widely used for unconditional image generation with impressive visual quality in recent years [39]. Despite the evident advantages of GANs, their training is a non-trivial task: GANs are sensitive to training hyperparameters and generally suffer from convergence issues [49]. Moreover, training GANs requires large-scale GPU clusters and an extensive number of training data. [91]. Limited training data usually cause the discriminator to overfit and the training to diverge [63]. These concerns have led to the development of improved GAN training methods [41], techniques for stabilized training with fewer data [63, 91], or non-adversarial approaches [49]. Yet, the great majority of existing genera- tion techniques utilize deep learning (DL), a method for learning deep neural networks, as the modeling backbone. 45 A neural network is typically trained using a large corpus of data over long episodes of iterative up- dates. Therefore, training a neural network is often a time-consuming and data-hungry process. To ensure the convergence of deep neural networks (DNNs), one has to carefully select (or design) the neural net- work architecture, the optimization objective (or the loss) function, and the training hyper-parameters. Some DL-based generative models like GANs are often specifically engineered to perform a certain task. They cannot be easily generalized to different related generative applications. For example, the archi- tectures of these neural networks for unconditional image generation have to be re-designed for image super-resolution or attribute-guided image generation. Last but not the least, due to the non-linearity of neural networks, understanding and explaining their performance is a standing challenge. To address the above-mentioned concerns, this chapter presents an alternative approach for uncondi- tional image generation based on successive subspace learning (SSL) [71, 72, 73, 75]. The resulting method, called progressive attribute-guided extendable robust image generative (PAGER) model, has several advan- tages, including mathematical transparency, progressive content generation, lower training time, robust performance with fewer training samples, and extendibility to conditional image generation. PAGER consists of three modules: 1) core generator, 2) resolution enhancer, and 3) quality booster. The core generator learns the distribution of low-resolution images and performs unconditional image generation. The resolution enhancer increases image resolution via conditional generation. Finally, the quality booster adds finer details to generated images. To demonstrate the generative performance of PAGER, we conduct extensive experiments on MNIST, Fashion-MNIST, and CelebA datasets. We show thatPAGER can be trained in a fraction of the time required for training DL based models and still achieve a similar generation quality. We then demonstrate the robustness ofPAGER to the training size by reducing the number of training samples. Next, we show that PAGER can be used in image super resolution, high-resolution image generation, and attribute-guided face image generation. In particular, the modular design of PAGER allows us to use the conditional generation 46 modules for image super resolution and high-resolution image generation. The robustness of PAGER to the number of training samples enables us to train multiple sub-models with smaller subsets of data. As a result, PAGER can be easily used for attribute-guided image generation. The rest of this chapter is organized as follows. The PAGER method is proposed in Sec. 5.2. Experi- mental results are reported in Sec. 5.3. Extendability and applications of PAGER are presented in Sec. 5.4. Finally, concluding remarks and possible future extensions are given in Sec. 5.5. 5.2 ProposedPAGERMethod sec:method ThePAGER method is presented in this section. First, our research motivation is given in Sec. 5.2.1. Then, an overview on PAGER and its three modules are described in Sec. 5.2.2. Finally, our attribute-guided face image generation is elaborated in Sec. 5.2.3. 5.2.1 Motivation subsec:motivation A generative model learns the distribution of the training data in the training phase. During the genera- tion phase, samples are drawn from the distribution as new data. To improve the accuracy of generative image modeling, gray-scale or color images should be first converted into dimension-reduced latent rep- resentations. After converting all training images into their (low-dimensional) latent representation, the distribution of the latent space can be approximated by a multivariate Gaussian distribution. For learning the latent representation, most prior work adopts GAN-, VAE-, and diffusion-based generative models; they train neural networks that can extract latent representations from an image source through a series of nonlinear transformations. Similarly, we need to learn such a transformation from the image space to the latent representation space. In this work, we utilize an SSL pipleline, rather than neural networks, to achieve the transformation to the latent representation space. The SSL pipeline consists of consecutive Saab transforms. In essence, 47 Figure 5.1: Example distributions from RGB pixels (left block) and Saab transforms (right block). The top figures correspond to single vector dimensions ( I 0 ...I 2 in RGB andX 0 ...X 2 in Saab domains). The bot- tom figures correspond to joint distributions. Distributions are extracted from the first three components of CelebA images. fig:gmm it receives an image, denoted by I ∈ R w× h× c , and converts it into a latent feature vector, denoted by X ∈ R n , where w, h and c are the pixel numbers of the width, height and color channels of an image whilen is the dimension of the latent vector. For the remainder of this chapter, we refer to the latent space obtained by SSL as the core space. The Saab transform utilizes mean calculation and PCA computation to extract features from its input. Due to the properties of PCA, the i-th and j-th components in the core space are uncorrelated fori̸=j. This property facilitates the use of Gaussian priors for generative model learning over the core space. Fig. 5.1 illustrates the distributions of input image pixels (I) and Saab outputs (X). In this example, we plot the distributions of the first, second and third components of I (i.e., the RGB values of the upper-left pixel of all source images) andX (i.e., the Saab transform coefficients). The RGB components are almost uniformly distributed in the marginal probability. They are highly correlated as shown in the plot of joint distributions. In contrast, Saab coefficients are close to the Gaussian distribution and they are nearly uncorrelated. While the distributions of one- and two-dimensional components of X are very close to Gaussians, the distribution of higher-dimensional vectors might not be well modeled by one multivariate Gaussian distribution. For this reason, we employ a mixture of Gaussians to represent the distribution of the core space. 48 Figure 5.2: Overview of PAGER generation method. fig:overview 5.2.2 SystemOverview subsec:overview An Overview of thePAGER generation method is shown in Fig. 6.1. PAGER is an unconditional generative model with a progressive approach in image generation. It starts with unconditional generation in a low- resolution regime, which is performed by the core generator. Then, it sequentially increases the image resolution and quality through a cascade of two conditional generation modules: the resolution enhancer and the quality booster. 5.2.2.1 Module1: CoreGenerator sec:generation The core generator is the unconditional generative module inPAGER. Its goal is to generate low-resolution (e.g., 4× 4× 3) color images. This module is trained with images of shape 2 d × 2 d × 3 (e.g., d = 2). It applies consecutive Saab transforms on input images{I i } M i=1 using PixelHop++ structure [22], ultimately converting images inton-dimensional vectorsX ∈R n (n=2 d × 2 d × 3) in core space. The goal of the core generator is to learn the distribution of{X i } M i=1 . We useX to denote a random variable within{X i } M i=1 , 49 representing observed samples in core space. Let P(X) be the underlying distribution ofX ∈ R n . The generation coreG attempts to approximate the distributionP(X) with a distributionG(X). DL-based methods utilize iterative end-to-end optimization of neural networks to achieve this objec- tive. InPAGER, we model the underlying distribution of the core space using the Gaussian Mixture Model (GMM), which is highly efficient in terms of training time. This is feasible since we use SSL to decouple random variables, which we illustrated in Sec. 5.2.1. The conjunction of multi-stage Saab (SSL) features and GMMs can yield a highly accurate density modeling. Formally, the GMM approximation ofG(X) is defined as follows: G(X)= K X k=1 p k N(X,µ k ,Σ k ), (5.1) eq:GMM eq:GMM whereN(X,µ k ,Σ k ) is a multi-variate normal distribution with meanµ k and diagonal covariance matrix Σ k , and p k is a binary random variable. We have p k = 1 with probability P k , p k = 0 with probability (1− P k ) and P K k=1 P k =1. In other words, only one of theK Gaussian models will be selected at a time, and the probability of selecting thek-th Gaussian model isP k in such a GMM. The parameters of the GMM can be determined using the Expectation Maximization (EM) algorithm [113]. Once such a GMM model is obtained, one can draw a sample,X, randomly and proceed to Modules 2 and 3. The need for Modules 2 and 3 is explained below.G(X) is learned from observationsX i ,i=1··· M. When the dimension,n, of the core space is large, estimatingG(X) becomes intractable and the approxi- mation accuracy of GMM would drop. For this reason, the unconditional generation process is constrained to a low-dimensional space. Then, we employ conditional generative models (modules 2 and 3) to further increase image resolution and quality. 50 5.2.2.2 Module2: ResolutionEnhancer subsec:module_2 We represent imageI d as the summation of its DC and AC components: eq:DC_AC_decomposition I d = DC d +AC d , (5.2) DC d = U(I d− 1 ), (5.3) whereI d is an image of size2 d × 2 d ,U is the Lanczos image interpolation operator,DC d is the interpolated image of size2 d × 2 d andAC d is the residual image of size2 d × 2 d . The above decoupling of DC and AC components of an image allows to define the objective of the resolution enhancer. It aims to generate the residual imageAC d conditioned onDC d . In Fig. 6.1, a multi-stage cascade of resolution enhancers is shown. The detail of a representative resolution enhancer is highlighted in the lower subfigure. To train the resolution enhancer, we first decouple the DC and AC of training samples. Then, we extract SSL features from the DC and build a GMM model withK components, denoted byG DC . By this method, we learn a distribution of the DC at a certain image resolution. Note that each DC from a training image belongs to one of the Gaussian models in G DC . Therefore, DCs (and their associated AC) are clustered intoK classes usingG DC . We gather the AC of each class and build a corresponding GMM, denoted by G AC,k wherek∈{1,··· ,K}. In total, we learnK +1 GMMs:{G DC , G AC,1 ... G AC,K }. At the test time, the resolution enhancer receives the low resolution imageI d− 1 , and upsamples it to obtain the interpolated DC, i.e., DC d = U(I d− 1 ). Then, the resolution enhancer converts the DC to its SSL features and classifies it into one of the K clusters usingG DC . Mathematically, we have X DC = SSL(DC d ), (5.4) y = arg k max{N(X DC ,µ k ,Σ k )} K k=1 , (5.5) 51 whereN(X DC ,µ k ,Σ k ) is the probability score ofX DC according to thek-th component ofG DC , and the classification label y is the maximizer index. In other words, the resolution enhancer identifies a cluster of samples that are most similar to DC d . Next, the resolution enhancer draws a sample from the AC distribution corresponding to classy: X AC ∼ G AC,y (X AC ). (5.6) With the above two-step generation, the resolution enhancer generatesX AC conditioned onX DC . After- wards,X AC is converted to the RGB domain using the inverse SSL transform: AC d = SSL − 1 (X AC ). (5.7) eq:ac_inverse eq:ac_inverse The computed AC component is masked and added to the DC to yield the higher resolution image via eq:mask I d = DC d + d AC d , (5.8) d AC d = M(DC d )⊙ AC d , (5.9) where M(DC d ) is a mask and ⊙ denotes element-wise multiplication. The mask is derived from the edge information obtained by the Canny edge detector [13]. The masking operation serves two objectives. First, it prevents details from being added to smooth regions of the DC component. Second, it suppresses unwanted noise. Once I d is generated, it is cropped into four non-overlapping regions, and each region goes through another resolution enhancement process. The process is recursively applied to each sub- region to further enhance image quality. In our experiments, we continue the recursion until a cropped window size of2× 2 is reached. 52 5.2.2.3 Module3: QualityBooster subsec:module_3 The right subfigure of Fig. 6.1 presents the quality booster module. It follows the resolution enhancer by adding detail and texture to the output of the resolution enhancer. It exploits the locally linear embedding (LLE) [119] scheme and adds extra residue values that are missed by the resolution enhancer. LLE is a well known method in building correspondence between two components in image super resolution [15, 58] or image restoration [50]. To design the quality booster, we decompose the training dataset, enhance the DC component, and compute the residuals as follows: I d = DC d +AC d , (5.10) E d = Enhancer(DC d ), (5.11) R d = I d − E d , (5.12) whereI d represents a2 d × 2 d training image,E d is the result of applying the enhancer module to the DC component of the image, andR d is the residual image. During training, the quality booster storesE i d and R i d , i = 1,··· ,M from M training samples. In generation, the quality booster receives image E d and uses the LLE algorithm to estimate the residual image for imageE d based onE i d andR i d from the training dataset. It approximates the residual image with a summation of several elements within R i d . Readers are referred to [119] for details of LLE computation. Similar to the enhancer module, the computedR i d is masked and added toE d to boost its quality. Although the LLE in the quality booster module uses training data residues during inference, it does not affect the generation diversity for two reasons. First, the quality booster only adds some residual textures to the image. In other words, it has a sharpening effect on edges. Since its role is limited to adding residuals and sharpening, it does not have a significant role in adding or preventing diversity. Second, the 53 Figure 5.3: Examples of attribute-guided generated images for CelebA with various attribute combinations. fig:attributes weight prediction mechanism of LLE provides a method to combine various patch instances and obtain diverse patterns. 5.2.3 Attribute-GuidedFaceImageGeneration subsec:attributes In attribute-guided face image generation, the goal is to synthesize face images that have certain properties. LetA∈{− 1,+1} T denote a set ofT binary attributes. The goal is to synthesize an image that satisfies a queryq∈{− 1,0,+1} T , where -1, 0, +1 denote negative, don’t care, and positive attributes. For instance, if the attribute set is {male,smiling}, the queryq=[− 1,+1] requests an image of a female smiling person, and the queryq=[0,− 1] request an image (of any gender) that is not smiling. Without loss of generality, we explain the attribute-guided generation process with T = 7. The at- tributes selected from attribute labels in CelebA dataset include ‘gender’, ‘smiling’, ‘blond hair’, ‘black hair’, 54 ‘wearing lipstick’, ‘bangs’ and ‘young’. Given these seven binary attributes, there are 2 7 = 128 subsets of data that correspond to each unique set of selected attributes. However, some of the attribute combi- nations might not be abundant in the training data due to the existing correlation between the attributes. For instance, ‘wearing lipstick’, ‘bangs’, and ‘gender’ are highly correlated. Thus, instead of considering all 128 combinations, we partition the attributes of training data intoK subsets using k-means clustering (we setK =10 in our experiments). Based on the attribute clusters, we createK data subsets and train a separate PAGER model for each subset. At generation time, the goal is to synthesize a sample with a given attribute set, q ∈ {− 1,0,+1} 7 . To determine which of the 10 models best represents the requested attribute set, we compute the Cosine distance ofq to each of the cluster centers and select the model that gives the minimum distance. Then, we draw samples from the corresponding model. Fig. 5.3 shows generated images corresponding to 15 different attribute vectors. We see that the attribute-based generation technique can successfully synthesize images with target attributes while preserving diversity and fidelity. 5.3 Experiments sec:experiments 5.3.1 ExperimentalSetup We perform experiments on three datasets: MNIST, Fashion-MNIST, and CelebA. They are commonly used for learning unconditional image generative models. We briefly explain the experimental settings of PAGER for each dataset below. CelebA. The dataset is a set of colored human face images. Suppose that there are2 d × 2 d pixels per image. To derive Saab features and their distributions, we applyd-stage cascaded Saab transforms. At each stage, the Saab filter has a spatial dimension of 2× 2 with stride2. The number of GMM components in the core generator is 500. The core generator synthesizes color images of size 4× 4. Higher resolution 55 images are generated conditioned on the previous resolution with the resolution enhancer and the quality booster modules in cascade (4× 4→8× 8→16× 16→32× 32). The resolution enhancer has100 GMM components for the DC part and 3 GMM components for the AC part at each stage. LLE in the quality booster module is performed using2 nearest neighbors. MNISTandFashion-MNIST. The two datasets contain gray-scale images of digits and clothing items, respectively. The generation pipeline for these datasets is similar to CelebA except that the core generator synthesizes16× 16 padded gray-scale images for each of the10 classes. The16× 16 images are converted to 32× 32 with a single stage of resolution enhancer and quality booster. Finally, they are cropped to 28× 28. 5.3.2 EvaluationofGeneratedImageQuality SubjectiveEvaluation. We show image samples of resolution32× 32 generated by PAGER for MNIST, Fashion-MNIST and CelebA in Fig. 5.4. Generated images learned from MNIST represent the structure of digits accurately and with rich diversity. Images generated from Fashion-MNIST show diverse examples for all classes with fine details and textures. Generated images for CelebA are semantically meaningful and with fine and diverse details in skin tone, eyes, hair and lip color, gender, hairstyle, smiling, lighting, and angle of view. Fig. 5.5 compares generated images by GenHop [83], which is an earlier SSL-based method, and our PAGER for the CelebA dataset. To be compatible with GenHop, we perform comparison on generated images of resolution 32× 32. As seen, images generated by PAGER are more realistic with finer details than GenHop. Next, we compare images generated by our method and those obtained by prior DL-based generative models in Fig. 5.6. We resort our comparison to GAN [39], WGAN [2], LSGAN [101], WGAN-GP [41], GLANN [49] and Diffusion-based model [48] of resolution 64× 64. Note that these methods along with 56 the selected resolution are ones that we could find over the Internet so as to allow a fair comparison to the best available implementations. Specifically, we take generated images of GAN, WGAN and LSGAN from celeba-gan-pytorch github ∗ . We take those of WGAN-GP from WGAN-GP-DRAGAN-Celeba-Pytorch github † . For the diffusion model, we take the pre-trained model from pytorch-diffusion-model-celebahq github ‡ , which generates samples of resolution256× 256. We resize generated samples to the resolution of 64× 64 to make them comparable with other methods. Fig. 5.6 compares generated images by prior DL-based generative models and our PAGER for the CelebA dataset. It can be seen that generated images of PAGER are comparable with those of prior DL-based methods. There are some noise patterns in our results. Their suppression is an interesting future research topic. ObjectiveEvalution. We use the Frechet Inception Distance (FID) [46] score to perform quantitative comparison of our method with prior art. FID is a commonly used metric to evaluate the performance of generative models. It considers both diversity and fidelity of generated images. We follow the procedure described in [98] to obtain the FID scores; an Inception neural network extracts features from a set of 10K generated images as well as another set of 10K real (test) images. Two multivariate Gaussians are fit to the extracted features from two sets separately. Then, the Frechet distance between their mean vectors and covariance matrices is calculated. A smaller FID score is more desirable as it indicates a better match between the synthesized and real test samples. The FID scores of various methods for MNIST, Fashion-MNIST and CelebA datasets are compared in Table 5.1. Methods in the first and second sections are both based on DL. Methods in the first section are adversarial generative models while those in the second section are non-adversarial. The results of the first and second sections are taken from [98] and [49], respectively. For the Diffusion model, we generated 10K samples using the pre-trained model from pytorch-diffusion-model-celebahq github § and measured ∗ https://github.com/joeylitalien/celeba-gan-pytorch † https://github.com/joeylitalien/celeba-gan-pytorch ‡ https://github.com/FengNiMa/pytorch_diffusion_model_celebahq § https://github.com/FengNiMa/pytorch_diffusion_model_celebahq 57 Table 5.1: Comparison of FID scores for MNIST, Fashion-MNIST and CelebA datasets. tab:fid_score Method MNIST Fashion CelebA MM GAN [39] 9.8 29.6 65.6 NS GAN [39] 6.8 26.5 55.0 LSGAN [101] 7.8 30.7 53.9 WGAN [2] 6.7 21.5 41.3 WGAN-GP [41] 20.3 24.5 30.0 DRAGAN [69] 7.6 27.7 42.3 BEGAN [9] 13.1 22.9 38.9 VAE [68] 23.8 58.7 85.7 GLO [10] 49.6 57.7 52.4 GLANN [49] 8.6 13.0 46.3 Diffusion [48] N/A N/A 48.8 GenHop [83] 5.1 18.1 40.3 PAGER (Ours) 9.5 19.3 43.8 the FID score. GenHop in Section 3 does not use a neural network backbone. Its results are taken from [83]. We see from Table 5.1 that the FID scores ofPAGER are comparable with those of prior generative models. In trainingPAGER model for Table 5.1, we used 100K training images from CelebA and 60K training images from MNIST and Fashion-MNIST with no augmentation. PAGER is still in its preliminary development stage. Although it does not outperform prior generative models in the FID score, it does have comparable performance in all three datasets, indicating its potential to be further improved in the future. In addition, PAGER has several other advantages to be discussed in the next subsection. 5.3.3 OtherPerformanceMetrics In this section, we study additional performance metrics: robustness to the number of training samples and training time. Robustness to training dataset sizes. Fig. 5.7 presents the FID score of PAGER and five DL-based generative models (MM GAN, LSGAN, WGAN, WGAN-GP and GLANN) when the number of training 58 Table 5.2: Training time comparison. tab:runtime Method CPU GPU MM GAN [39] 93m14s 33m17s LSGAN [101] 1426m23s 45m52s WGAN [2] 48m11s 25m55s WGAN-GP [41] 97m9s 34m7s GLO [10] 1090m7s 139m18s GLANN [49] 1096m24s 142m19s GenHop [83] 6m12s N/A PAGER (Ours) 4m23s 2m59s samples is set to 1K, 2K, 5K, 10K, 20K and 60K for MNIST dataset. To produce the FID scores of the GAN- based related work, we use the open-source implementation by PyTorch-GAN github ¶ . For GLANN, we use the implementation provided by the authors. Since GLANN is not trained with less than 10K samples, its FID scores for 1K, 2K and 5K samples are not available. It is worth noting that the FID scores for 60K training samples of some prior work in Fig. 5.7 are different than those in Table 5.1. This happens because some of prior generative models (e.g., MM GAN, LSGAN, and WGAN) are too sensitive to training hyper- parameters and/or data augmentation [98]. The scores reported in Fig. 5.7 are the best FID scores obtained using the default hyper-parameters in the open-source library. We see from Fig. 5.7 that PAGER is least affected by the number of training samples. Even with the number of training samples as small as 1K, PAGER has an FID score of 16.2 which is still better than some prior works’ original FID scores presented in Table 5.1, such as WGAN-GP, VAE and GLO. Among prior works, GLANN is less sensitive to training size but cannot be trained with less than 10K samples. ComparisononTrainingTime. The training time ofPAGER is compared with prior work in Table 5.2 on two platforms. • CPU (Intel Xeon 6130): The CPU training time of PAGER is slightly more than 4 minutes, which is significantly less than all other methods as shown in Table 5.2. The normalized CPU training times ¶ https://github.com/eriklindernoren/PyTorch-GAN 59 of various DL-based methods againstPAGER are visualized in the left subfigure of Fig. 5.8. PAGER is 11× faster than WGAN and325× faster than LSGAN. • GPU (NVIDIA Tesla V100): The GPU training time ofPAGER is around 3 minutes, which is again less than all other methods as shown in Table 5.2. The normalized GPU run times of various methods are also visualized in the right subfigure of Fig. 5.8. PAGER is9× faster than WGAN and48× faster than GLANN. JointConsiderationofFIDScoresandTrainingTime. To provide a better picture of the tradeoff between training time and FID score, we present both of these metrics in Fig. 5.9. On this figure, points that are closer to the bottom left are more desirable. As seen, PAGER significantly outperforms prior art when considering FID scores and training time jointly. 5.3.4 Discussion Based on the above experimental results, we can draw the following conclusions. • Qualityimagegeneration. The FID scores ofPAGER are comparable with those of prior DL-based image generation techniques on common datasets. This indicates that PAGER can generate images of similar quality to prior art. • Efficienttraining. PAGER can be trained in a fraction of the time required by DL-based techniques. For example, our MNIST generative model is trained in 4 minutes on a personal computer’s CPU while the fastest prior work demands 25-minute training time on an industrial GPU. The efficiency of PAGER is achieved by the development of a non-iterative training scheme. CPU-based efficient training implies smaller energy consumption and carbon footprint than GPU-based DL methods. This is a major advantage of PAGER. 60 • Robustness to training sample size. PAGER can still yield images of reasonable quality even when the number of training samples is drastically reduced. For example, in Fig. 5.10 we show that the number of training samples can be reduced from 100K to 5K with only a negligible drop in the generated image quality for the CelebA dataset. • ImprovementsoverpriorSSL-basedgenerativemodel-GenHop. While PAGER is the second SSL-based generative model, it is worthwhile to review its improvements over the prior SSL-based generative model known as GenHop [83]. First, the great majority of CelebA generated samples by GenHop suffer from over-smoothing which blurs details and even fades out the facial components in many samples as shown in Fig. 5.5. This is because GenHop heavily relies on LLE which has a smoothing effect and limits synthesis diversity. On the other hand, PAGER generates diverse samples with visible facial components. Note thatPAGER only uses LLE to add residuals to already generated samples. It serves as a sharpening technique and does not affect synthesis diversity. Second, GenHop limits the resolution of generated samples to 32× 32. This prevents GenHop to be extendable to high-resolution image generation or other generative applications like super-resolution. Third, GenHop takes longer time that PAGER to train and it is not implemented for GPU training. Fourth, GenHop only conducts unconditional image generation while PAGER has further applications such as attribute-guided image generation and super-resolution. 5.4 CommentsonExtendability sec:applications In this section, we comment on another advantage ofPAGER. That is,PAGER can be easily tailored to other contexts without re-training. We elaborate on three applications at the conceptual level. • Super Resolution. PAGER’s two conditional image generation modules (i.e., the resolution en- hancer and the quality booster) can be directly used for image super resolution with no additional 61 training. These modules enhance the image resolution from an arbitrary dimension 2 d × 2 d to 2 d+k × 2 d+k , wherek is the number of consecutive resolution enhancer and quality booster mod- ules needed to achieve this task. Fig. 5.11 shows several examples starting from different resolutions and ending at resolutions32× 32,64× 64 and128× 128. • Attribute-guidedFaceImageGeneration. To generate human face images with certain charac- teristics (e.g., a certain gender, hair color, etc.) we partition the training data based on the underlying attributes and construct subsets of data (Sec. 5.2.3). Each subset is used to train a different core gen- erator that represents the underlying attributes. Examples of such attribute-guided face generation are presented in Figure 5.3. The feasibility of training PAGER using a subset of training data is a direct result of its robustness to the training dataset size. It was empirically evaluated in Fig. 5.10. The mean FID score of CelebA-like image generation changes only 6% when the number of training samples is reduced from 100K to as low as 5K. • High-ResolutionImageGeneration. PAGER can be easily extended to generate images of higher resolution. To achieve this objective, we can have more resolution enhancer and quality booster units in cascade to reach the desired resolution. We present several generated CelebA-like samples of resolution128× 128 and256× 256 in Fig. 5.12. This gives some evidence that the current design of PAGER is extendable to higher resolution generation. On the other hand, to generate results comparable with state-of-the-art generative models like ProGAN [62], StyleGAN [65, 66, 64], VQ- VAE-2 [110] or diffusion-based models [27, 47], we need to further optimize our method. Further improvement on PAGER could lead to enhanced quality of generated images in higher resolutions. 62 5.5 ConclusionandFutureWork sec:conclusion A non-DL-based generative model for visual data generation called PAGER was proposed in this work. PAGER adopts the successive subspace learning framework to extract multi-scale features and learns un- conditional and conditional probability density functions of extracted features for image generation. The unconditional probability model is used in the core generator module to generate low-resolution images to control the model complexity. Two conditional image generation modules, the resolution enhancer and the quality booster, are used to enhance the resolution and quality of generated images progressively. PAGER is mathematically transparent due to its modular design. We showed that PAGER can be trained in a fraction of the time required by DL-based models. We also demonstrated PAGER’s generation quality as the number of training samples decreases. We then showed the extendibility of PAGER to image super resolution, attribute-guided face image generation, and high resolution image generation. The model size of PAGER is primarily determined by the sizes of the quality booster. The number of parameters is about 46 millions. The large quality booster size is due to the use of LLE in predicting residual details. We do not optimize the LLE component in the current implementation. As a future topic, we would like to replace it with a lightweight counterpart for model size reduction. For example, We might replace LLE with GMMs to learn the distribution of residual textures, to reduce the model size significantly. With these techniques, we aim to reduce to the model size to less than 10 million parameters. 63 Figure 5.4: Examples of PAGER generated images for MNIST (top), Fashion-MNIST (middle), and CelebA (bottom) datasets. fig:mnist 64 Figure 5.5: Example images generated by PAGER and GenHOP for the CelebA dataset. fig:CelebaComparisonGenhop 65 Figure 5.6: Samples generated by PAGER and prior DL-based generative models for the CelebA dataset. fig:CelebaComparisonDL 66 Figure 5.7: Comparison of FID scores of six benchmarking methods with six training sizes (1K, 2K, 5K, 10K, 20K and 60K) for the MNIST dataset. The FID scores of PAGER are significantly less sensitive with respect to smaller training sizes. fig:training_comparison Figure 5.8: Comparison of normalized training time, where each bar represents the training time of a DL- based model corresponding to those shown in Table 5.2 and normalized by training time of PAGER. fig:time_ratio 67 Figure 5.9: Comparison of joint FID scores and GPU training time of PAGER with DL-based related work in the generation of MNIST-like images. PAGER provides the best overall performance since it is closest to the left-bottom corner. fig:FID_time Figure 5.10: Comparison ofPAGER’s FID scores with six training sample sizes for CelebA, Fashion-MNIST and MNIST datasets. We see that the FID scores do not increase significantly as the training samples number is as low as 5K for CelebA and 1K for MNIST and Fashion-MNIST. fig:training 68 Figure 5.11: Illustration ofPAGER’s application in image super-resolution for CelebA images: Two top rows starting from resolution4× 4 (left block) and8× 8 (right block) and ending at resolution32× 32. Two middle rows starting from resolution8× 8 (left block) and16× 16 (right block) and ending at resolution 64× 64. Two bottom rows starting from resolution 16× 16 (left block) and 32× 32 (right block) and ending at resolution128× 128. fig:superresolution 69 (a) Resolution128× 128. fig:highres128 (b) Resolution256× 256. fig:highres256 Figure 5.12: Examples of generated CelebA-like images of resolution128× 128 and256× 256. fig:highres 70 Chapter6 AttGMM :AGMM-basedMethodforFacialAttributeEditing chapter6 6.1 Introduction sec:6_intro Facial attribute editing aims to manipulate face images to change a certain attribute. Unlike discrimina- tive tasks like face recognition [121, 125] or facial attribute prediction [31, 95], facial attribute editing is an unsupervised generative task since paired images with and without the desired attribute are difficult or impossible to collect. During the past decade and with the great advances in image generative model- ing, facial attribute editing is tackled as a conditional generation task where an input image and a target attribute put constraints on the generation process. GAN-based methods have shown high quality results in facial attribute editing [24, 45, 92]. Although these methods generate pleasant results, training GANs is a non-trivial task. Convergence issues are very common in training GANs since they are generally very sensitive to training hyperparameters. Another concern of training GANs is the need to substantial computational resources and extensive training data. Without enough training data, the discriminator is prone to overfit, which leads to divergence of the training. While improved GAN training methods [41] and techniques for stabilized training with fewer data [63, 91] have been proposed to address these concerns, the attempt to use non-adversarial approaches for facial attribute editing is less explored. 71 In recent years, Probabilistic Diffusion Models (DPM) are another mainstream for generative tasks. Although DPMs generally are less sensitive in training than GANs, they usually have larger model sizes and longer training and inference times, due to several time steps. Another issue with vanilla DPMs is the lack of semantic meaning in the latent variables, which limits the usage of their representation for other tasks like attribute editing. To tackle this issue, some methods are proposed to combine DPMs with auto-encoders to extend their application to image manipulation [108]. To address the above issues, we explored an alternative approach to tackle the facial attribute editing task. Gaussian Mixture Models (GMMs) have been used in a few methods for full-size unconditional image generation [5, 114]. Although the generated results using GMM-based models suffer from unexpected artifacts such as over-smoothing, they show other advantages such as modeling the full data distribution [114], lower training time and robustness to changes in the number of training samples [5]. In this work, we extend the application of GMM-based generative models to facial attribute editing for the first time. Our proposed method, namedAttGMM, is the first to conduct facial attribute editing without utilizing deep neural networks. AttGMM demonstrates the potential of GMM-based generative models for conditional generation and the feasibility to use these models for image manipulation. AttGMM has a great advantage in lowering the computational cost in comparison with prior work. AttGMM conducts facial attribute editing in three steps. First, it reconstructs the given image using a posterior probability in the latent space. Then, it manipulates the given image in latent space to posses a certain attribute. To tackle the blurriness issue with GMM-based generative models, AttGMM offers the third step to exploit the difference between the results of the first two steps to generate a refined sample having the target attribute. To demonstrate the performance of AttGMM, we evaluate attribute editing accuracy and reconstruction quality on 12 different attributes from the CelebA dataset and compare with those of the prior work. Moreover, we offer a comparison on computational complexity of AttGMM and prior work. 72 Figure 6.1: Overview of the AttGMM facial attribute editing method. fig:overview In the rest of this chapter, we present the AttGMM method in Section 6.2. Next, we go over the ex- perimental results in Section 6.3. Finally, we give concluding remarks, limitations and possible future investigations in Section 6.4. 6.2 ProposedMethod sec:6_method An overview of theAttGMM facial attribute editing method is shown in Fig. 6.1. AttGMM’s design for facial attribute editing is based on two priors; first, Gaussian Mixture Models (GMMs) for image generation and second, image reconstruction based on GMMs. In this section, we first explain these two priors in sections 6.2.1 and 6.2.2. Then, we elaborate the design of AttGMM in sections 6.2.3 and 6.2.4. 6.2.1 GaussianMixtureModelforImageGeneration sec:generation Gaussian Mixture Model (GMM) is known as the simplest statistical model for image generation. However, there are two concerns we may encounter while training GMM on full-size image datasets. The first is the dimensionality and computational cost. Assume we have 64× 64 color images. The covariance matrix of a single component will have 7.5× 10 7 free parameters. Thus, we will require the memory and power 73 to store and invert matrices of the mentioned size during training. The second concern is to learn the complexity of the distribution within a reasonable number of Gaussian components [114]. In order to address the first concern, we use Probabilistic PCA [126, 127] to reduce the data dimension and hence, the computational cost. As for the second concern, GMM has shown capability to model the data distribution with a reasonable number of components for MNIST, Fashion-MNIST and CelebA datasets [5, 114]. Probabilistic PCA multiplies a scale matrix A d× l by the latent vector z of a low dimension d << l. The latent vector z is sampled from a standard Gaussian distribution. This distribution is modeled on a low-dimensional subspace embedded in the full data space. An isotropic noise is also added for stability. The model for a single PPCA model is [114]: x=Az+µ +ϵ,z ∼N (0,I),ϵ ∼N (0,D), (6.1) eq:ppca eq:ppca whereµ is the data mean andϵ is the added noise with diagonal covarianceD. As a result,x has a Gaussian distributionx∼N (µ,AA T +D). A single PPCA model hasd(l+2) free parameters whered andl are data and latent dimensions. In a Mixture of Probabilistic PCA (MPPCA) model withK components, the number of free parameters increase toK[d(l+2)+1] [114]. During training, the model is initialized with K-means clustering on the data. Then we use the EM algorithm [7, 38] to estimate the PPCA parameters of each component [114]. 6.2.2 ImageReconstruction sec:reconstruction In our framework, the image reconstruction task does not require any additional training. The MPPCA latent space learned for image generation is sufficient for reconstruction of images as well. In image 74 reconstruction task, we give an image as an input to the model. The goal is to generate an image as close to the input image as possible. Suppose imageX ∈R d (d=2 m × 2 m × 3) as an input to the generative model. In order to reconstruct X, we need to find the corresponding latent vector z ∈R l which generates the closest image toX. We take three steps to find the target z: Step1. Find the most probable componentc ˆ k where ˆ k∈{1,··· ,K} for the latent vectorz given the observed imageX. The log-likelihood of a given imagex belonging to componentc k is: L k =π k P(x|µ k ,Σ k )=e log(π k )+log(P(x|µ k ,Σ k )) (6.2) eq:logliklihood eq:logliklihood where π k is the mixing coefficient of the k-th component. µ k and Σ k are the mean and covariance of the k-th component, respectively. We calculate the log of the normal probability because of the high dimensionality. The log probability of imageX given the component is: logP(x|µ, Σ)= − 1 2 [dlog(2π )+logdet(Σ)+( x− µ ) T Σ − 1 (x− µ )] (6.3) eq:logprobability eq:logprobability To avoid inversion of large matrices, we use the Woodbury matrix inversion lemma: Σ − 1 =(AA T +D) − 1 =D − 1 − D − 1 AL − 1 A T D − 1 (6.4) eq:inversion eq:inversion whereL=I +A T D − 1 A is anl× l matrix. And using the log determinant lemma, we have: logdet(AA T +D)=logdet(L)+logdet(D)=logdet(L)+ d X j=1 logd j (6.5) eq:logdet eq:logdet 75 The above approach introduced in [114] enables computing the log-likelihood for imageX ∈R d with linear complexity. For given image X, once we calculate allL k where k ∈ {1,··· ,K}, we choose the most responsible componentc ˆ k with respect toX: ˆ k = arg k max{L k } K k=1 , (6.6) eq:k_hat eq:k_hat Step 2. In the component c ˆ k , we calculate the posterior probability P(z|x) which is the posterior distribution ofz given the observed inputx. P(z|x) is also a Gaussian: P(z|x)=N(A T Σ − 1 (x− µ ),L − 1 ) (6.7) eq:posterior eq:posterior whereL=I +A T D − 1 A andΣ − 1 can be calculated efficiently using Eq. 6.4. Step 3. We use the MAP estimate of the posterior distribution P(z|x) to maximize the probability within this distribution. SinceP(z|x) has a Gaussian distribution, the MAP estimate ˆ z would simply be the mean of the Gaussian. Finally, we reconstructX by calculating ˆ X = A c ˆ k ˆ z +µ c ˆ k . We may generate other possible reconstructions ofX by sampling different latent vectors z out ofP(z|x) [114]. 6.2.3 FacialAttributeEditing sec:editing In the facial attribute editing task, the goal is to synthesize face images that resemble a given face image in all facial attributes except for a certain attribute which we target to change. Let A ∈ {0,1} T denote a set of T binary attributes. For a given image X with original attribute set A, and an attribute index t∈{1,··· ,T}, our goal is to synthesize imageY with attribute set ˆ A that satisfies the following condition: { ˆ A i } T i=1 = 1− A i , i=t A i , i̸=t (6.8) eq:Ahat eq:Ahat 76 To conduct facial attribute editing, we reformulate the problem into image reconstruction within the domain of the target attribute. Without the loss of generality, we explain the attribute editing process with T = 1 where we only have one attribute. The attribute is selected from the attribute labels of CelebA dataset, e.g. "Smiling". The training process of a generative model which is also capable of attribute editing has one extra step which is taken during initialization. When training for attribute editing, before K-means clustering, we use the attribute labels to cluster the training data. Unlike the clusters forming during K-means clustering, these initial clusters formed using attribute labels would remain the same during further training steps. Meaning that we would have two super componentsS 0 andS 1 each containingK 0 andK 1 components as explains in section 6.2.1. In case ofT >1, we would haveT× 2 super componentsS tq each containing K tq components wheret∈{1,··· ,T} andq∈{0,1}. AssumeT = 1 and we target the attribute "Smiling" from CelebA dataset. During inference, we take an input image X with A = 0, meaning X is a facial image without "Smiling" attribute. We would like to synthesis imageY which resembles imageX withA = 1 meaning with "Smiling" attribute. To do this task, first we determine the super component. Since we target A = 1, the super component would beS 1 . Given the super component, we take the three steps indicated in section 6.2.2 to synthesis image Y . In other words, we reconstruct the "non-Smiling" imageX using a model trained only on "Smiling" images. With this approach, we force to synthesize a "Smiling" imageY while maintaining the properties ofX. In a more general scenario where T > 1, given image X with attribute set A, attribute index t ∈ {1,··· ,T} and attribute query q ∈ {0,1}, we determine the super component S tq . Within the given super component, we follow the first step of section 6.2.2 to find the most responsible component ˆ k ∈ {1,··· ,K tq }. Within the componentc ˆ k , we use the MAP estimate of the posterior probabilityP(z|x) to find ˆ z and calculate the edited imageY =A c ˆ k ˆ z+µ c ˆ k . 77 Figure 6.2: Intermediate results of reconstruction and editing using AttGMM method. fig:recons_edit 6.2.4 RefinementofAttributeEditing sec:refinement In sections 6.2.2 and 6.2.3 we explained our method for image reconstruction and editing. We show the visual results of applying these methods on images with different attributes in Fig. 6.2. As shown in Fig. 6.2, the reconstruction results resemble the original images while the target attribute in the editing results are successfully altered. However, both reconstruction and editing results are over- smooth and they lack fine details and local textures of the original images. This happens because we only preserve thel most important features of thed-dimensional image. In case of64× 64 color images, d = 12288 and we choose l = 10 in our experiments. Since l << d, losing fine local properties in the generated images is inevitable. In order to overcome the smoothing effect, we propose the refinement pipeline demonstrated in Fig. 6.3. The refiner takes three inputs: the reconstructed image, the edited image, and the input image. We reform all the three inputs into small patches before passing to the refiner. In our experiments, we set the patch size to 4× 4. For each patch in the original image, the refiner first finds the most responsible 78 Figure 6.3: Overview of the Refiner block in AttGMM method. fig:refiner componentc p for reconstruction of that patch. Then, using the parameters of the chosen componentc p , the refiner calculates the posterior probability P(z|x p ) thorough Eq. 6.7. Since the posterior probability P(z|x p ) forms a Gaussian distribution, the MAP estimate of the posterior distribution is the mean of the Gaussian. Next, we calculate three MAP estimates each corresponding to one of the three input patches. Using this approach, we have three MAP estimates in the same latent space, which is shown in Fig. 6.3. Then, we move the MAP estimate of the original patch in the same direction of the difference between the MAP estimates of the intermediate reconstructed and edited samples. Then the final MAP estimate for the patch is: z ′ =z ′ x +(z ′ e − z ′ r ), (6.9) eq:map_refiner eq:map_refiner wherez ′ x , z ′ e , andz ′ r are the MAP estimates of the original patch, edited patch, and reconstructed patch, respectively. Then, we use z ′ to calculate the edited patch with a high quality in color space thorough A p z ′ +µ p . After repeating this process for all the patches within the image, we stitch the patches together to find the complete refined edited image. 79 Figure 6.4: Refined results of reconstruction and editing usign AttGMM method. fig:refined 6.3 Experiments sec:6_experiments We perform experiments and evaluation on the CelebA dataset. CelebA dataset contains aligned colored human face images cropped to 178× 218, with 40 with/without attribute labels for each image. In our experiments, we crop and resize images to 64× 64 resolution. In this resolution, the input dimension d is 12288. We choose the latent vector dimension l and the number of Gaussian components K for each attribute label to be10 and30 respectively. Following prior work [45, 92], we target12 attributes from the CelebA dataset which cover most of the attributes and are most distinctive in appearance. These attributes are Bangs, Black Hair, Blond Hair, Brown Hair, Bushy Eyebrows, Eyeglasses, Male, Mouth Slightly Open, Mustache, No Beard, Pale Skin and Young. 6.3.1 QualitativeEvaluation sec:visual We show facial attribute editing results of our method in Fig 6.4. In this figure, the first column contains various face image inputs, the second column shows the reconstruction results, and the rest of the columns 80 Table 6.1: Reconstruction quality of the comparison methods on facial attribute editing task. tab:psnr_comparison Method IcGAN FaderNet AttGAN StarGAN STGAN AttGMM PSNR/SSIM 15.28/0.430 30.62/0.908 24.07/0.841 22.80/0.819 31.67/0.948 37.38/0.990 Figure 6.5: Comparison on attribute generation accuracy between AttGMM and prior work. fig:accuracy show editing results for each of the target attributes. In comparison to intermediate results shown in Fig. 6.2, the final results contain more details and resemble the input image more. The changes in the final results are also bolder than the intermediate results. On the other hand, the noise patterns and color artifacts are more visible in Fig. 6.4. 6.3.2 QuantitativeEvaluation sec:scores We evaluate the performance of our attribute editing method in two aspects; attribute editing accuracy and image quality. For attribute editing accuracy, we use a pre-trained attribute classification model ∗ which is commonly used by prior work for evaluation of attribute generation [45, 92]. In Fig 6.5, we compare the attribute editing accuracy among different methods. The accuracy for each of the 12 attributes and the average accuracy on all the attributes are demonstrated for each method. As shown in Fig. 6.5, we ∗ https://github.com/csmliu/STGAN/tree/master/att_classification 81 Table 6.2: FLOPs (× 10 9 ) comparison on a color image of resolution64× 64. tab:flops Method FaderNet AttGAN STGAN AttGMM FLOPs 0.212 2.405 3.420 0.170 Ratio 1.2× 10.8× 16.0× 1× Figure 6.6: Attribute generation accuracy versus FLOPs. fig:accuracyflops stand at the middle of the comparison on accuracy. Although, we do not beat the three more sophisticated GAN-based models (AttGAN, StarGAN, and STGAN), we perform better than the earlier methods based on neural networks and we do have an accuracy of more than 50%. For image quality assessment, we calculate the reconstruction quality of each method. Table 6.1 shows the PSNR/SSIM scores of reconstruction. Our method outperforms all prior work in reconstruction quality. 6.3.3 ComputationalComplexity sec:flops In this section, we calculate the FLoating Point Operations (FLOPs) of our method and prior work to offer a comparison on computational complexity. Table 6.2 shows a comparison on FLOPs per a color image of size 64× 64. The numbers in Table 6.2 represent Giga FLOPs. We use an online tool for calculating the FLOPs for PyTorch models. Our model has the least number of FLOPs among prior works. In Fig. 6.6, we present a comparison on average attribute generation accuracy versus FLOPs. The most favorable point in this graph is the upper left point where a high accuracy is obtained with a negligible 82 Figure 6.7: Ablation study on visual quality with/without refinement step. fig:ablation Table 6.3: Ablation study on reconstruction quality without/with the refinement step. tab:psnr_ablation Method AttGMM without refiner AttGMM PSNR/SSIM 19.41/0.654 37.38/0.990 number of FLOPs. Although STGAN and AttGAN have higher accuracies than our method, they require way more number of FLOPs to achieve those scores. Our method has a higher accuracy than FaderNet with slightly less number of FLOPs. 6.3.4 AblationStudy sec:ablation In this section, we study the effect of refinement on the attribute editing quality. Fig. 6.7 shows a compari- son on the intermediate results of editing before refinement and the final results of editing after refinement. The results after refinement show more details but also more visible noise patterns and color artifacts. After refinement, the results look more like the input image. Also, the change in the attributes are stronger. Table 6.3 offers a comparison on the SSIM/PSNR reconstruction scores of our method before and after refinement. Both scores are significantly better after refinement. 83 6.4 ConclusionandFutureWork sec:6_conclusion A new method for facial attribute editing, calledAttGMM, was proposed in this chapter. AttGMM offers a computationally inexpensive solution for facial attribute editing, based on GMMs. Being the first method in literature to accomplish this task without exploiting neural networks, AttGMM demonstrates the great potential of GMMs in conditional generative modeling. AttGMM edits a certain attribute in a given image thorough three steps. In the first two steps, AttGMM reconstructs and edits the given image separately, using proper posterior probability distributions. To tackle the over-smoothing effect of GMM-based gener- ative models,AttGMM refines the edited result. In the third step, the difference between the reconstructed and edited results is derived to be utilized in the refining process. We demonstrated the performance of AttGMM thorough qualitative and quantitative experiments. While AttGMM shows the GMM-based generative modeling’s potential for facial attribute editing, it has some limitations as it is. The latent space in AttGMM takes only 10 dimensions, which captures the global structure of images well enough. However, color information or fine textures are mostly lost in transformation to the low-dimensional space. Therefore,AttGMM shows better performance for attributes like "Mouth slightly open" or "Eyeglasses", which associates with global structures, than hair color or age, which are related to color and fine textures, respectively. To address this issue, we would like to extract and keep spatial-spectral features of images, instead of only spectral ones. to this end, we aim to utilize Green Learning [74] and conduct AttGMM in a channel-wise approach. We aim to reach better accuracy in all attributes with this means. 84 Chapter7 SummaryandFutureWork chapter7 7.1 SummaryoftheResearch sec:summary In this dissertation, we focused on four problems following the matter of high quality visual content: noise-aware texture-preserving low-light image enhancement, self-supervised adaptive low-light video enhancement, progressive attribute-guided extendable robust image generation, and a GMM-based method for facial attribute editing. In the first work, we present a low-light image enhancement method based on a noise-aware texture- preserving retinex model. Our method, called NATLE, proposes two specifically designed optimization functions to decompose the input image into a piece-wise smooth illumination map (L) and a noise-free texture-preserving reflectance map (R). These optimization functions have closed-form solutions which allows fast computation. The superior performance of NATLE is demonstrated by extensive experiments with both objective and subjective evaluations. In the second work, we extend NATLE’s application to low-light video enhancement through a new method called SALVE. Our new self-supervised learning method for low-light video enhancement is fully adaptive to the test video. While enhancing a few key frames of the test video, it also learns a mapping from low-light to enhanced key frames. It then uses the mapping to enhance the rest of the frames within 85 the video. This approach enables SALVE to work without requiring (paired) training data. We performed a user study showing that participants preferred our videos at least 87% of the time. We also performed an ablation study, showing the contribution of each SALVE component. In the third work, we propose a generative model for visual data generation called PAGER. Unlike the common trend in today’s research community, our new method doesn’t use deep learning. PAGER adopts the successive subspace learning framework to extract multi-scale features and learns unconditional and conditional probability density functions of extracted features for image generation. The unconditional probability model is used in the core generator module to generate low-resolution images to control the model complexity. Two conditional image generation modules, the resolution enhancer and the quality booster, are used to enhance the resolution and quality of generated images progressively. PAGER is math- ematically transparent due to its modular design. We showed that PAGER can be trained in a fraction of the time required by deep learning based models. We also demonstratedPAGER’s robustness in generation quality as the number of training samples decreases. We then showed the extendibility ofPAGER to image super resolution, attribute-guided face image generation, and high resolution image generation. In the fourth work, we present a GMM-based method for facial attribute editing. Our new method, calledAttGMM, offers a computationally inexpensive solution for facial attribute editing, based on GMMs. Being the first method in literature to accomplish this task without exploiting neural networks, AttGMM demonstrates the great potential of GMMs in conditional generative modeling. AttGMM edits a certain attribute in a given image thorough three steps. In the first two steps, AttGMM reconstructs and edits the given image separately, using proper posterior probability distributions. To tackle the over-smoothing effect in GMM-based generative models, AttGMM refines the edited result. In the third step, the differ- ence between the reconstructed and edited results is derived to be utilized in the refining process. We demonstrated the performance of AttGMM thorough qualitative and quantitative experiments. 86 7.2 FutureWork sec:future We proposed a progressive attribute-guided extendable robust image generative model in Chapter 5. Our new method, called PAGER, showed advantages in mathematical transparency, progressive content gen- eration, extendability to further generative applications and robustness to training sample size. These ad- vantages makesPAGER a good candidate for the backbone of further conditional image generative models. We would like to pursue the following research topics in future: 7.2.1 VisionLanguageModeling Nowadays, multi-modal deep learning models are very popular. Our current version ofPAGER may take a list of binary attributes as an input. Then, it generates an image following those attributes. We would like to upgrade this application to visual generation from a textual input. For this application , we need to find a correspondence between the input words and the pre-defined attributes. Then, the problem would be reduced to whatPAGER is already capable of. In a more complicated problem, we may take a sentence as an input. In this case, we need to find the keywords and whether the sentence is affirmative or negative. Then, we associate the keywords with the pre-defined attributes and let PAGER generate images accordingly. 7.2.2 ImageAttributeEditing The current version of PAGER can take a list of facial attributes as an input and generate an image ac- cordingly. We would like to advance this application into facial attribute editing. In this application, an image of human face in taken as one of the inputs. The second input is the target attribute. In this problem setting, we would like to keep all the attributes within the input image untouched, except for the target attribute. For example, we may want to change a non-smiling face into a smiling face, or change the hair color or style. In order to achieve this goal, we need to keep all the facial areas the same and only change the region which associates with the target attribute. 87 Inspired by PAGER’s design, we fit several Gaussian mixture models (GMM) for each mutual group of facial landscapes and a related attribute. For example, we fit separate GMMs for smiling lips and non- smiling lips. We fit these GMMs conditioned on the local neighborhood of a certain facial landscape. During inference time, first we take the local neighborhood of the target facial landmark as a condition to decide which GMM to draw samples from. Then, we generate a new facial landmark following the target attribute. Finally, we replace the generated facial landmark on the correct place within the input image. This method may require post-processing techniques to avoid any mismatch or visual artifacts around the target facial landmark. We may use the color and texture information in the original facial landmark in the input image to adapt the generated facial landmark to the input image. 7.2.3 ImageInpainting In image inpainting, given a masked input image, we fill in the masked area and output a complete image. The resulting image should look semantically and visually realistic. It is desirable for the output to be sharp and not show any visual artifacts or distortions. Image inpainting is a form of conditional image generation, where the unmasked area in the input image is the generation condition. For simplicity, let us focus on human face images because they all have meaningful landmarks. Inspired by PAGER’s structure, we can fit several GMMs for various landmarks within the image. We fit these GMMs conditioned on the local neighborhood of each landmark. During inference time, we first find what landmarks in the face are affected by the mask. Then, we generate one landmark at a time and fill in the image step by step. If there are more landmarks left, we take the new updated image as the condition and generate the next masked landmark. After each step of GMM-based sampling and generation, the recently added region may look quite blurry. In order to add finer details, we may adopt the quality booster module from PAGER. 88 In what explained above, conditioning the generation on the local neighborhood regions is meant to enforce a visual and semantic consistency within the unmasked image and our generated region. If our experiments show that this condition is not adequate for enforcing the consistency, we would think about further considerations. We also proposed a GMM-based method for facial attribute editing in Chapter 6. Our new method, calledAttGMM, showed advantages in computational complexity. AttGMM demonstrates the GMM-based generative modeling’s potential for facial attribute editing; however, it has some limitations as it is. The low-dimensional latent space in AttGMM captures the global structure of images well. However, color information or fine textures are mostly lost in the transformation to the low-dimensional space. There- fore, AttGMM shows better performance for attributes like "Mouth slightly open" or "Eyeglasses", which associates with global structures, than hair color or age, which are related to color and fine textures, re- spectively. To address this issue, we would like to extract spatial-spectral features of images, instead of only spectral ones. to this end, we aim to utilize Green Learning [74] and conduct AttGMM in a channel-wise approach. We aim to reach a better accuracy in all attributes with this means. 89 Bibliography [1] Tarik Arici, Salih Dikbas, and Yucel Altunbasak. “A histogram modification framework and its application for image contrast enhancement”. In: IEEE Transactions on image processing 18.9 (2009), pp. 1921–1935. [2] M. Arjovsky, S. Chintala, and L. Bottou. “Wasserstein generative adversarial networks”. In: International conference on machine learning. PMLR. 2017, pp. 214–223. [3] Volker Aurich and Jörg" Weule. “Non-Linear Gaussian Filters Performing Edge Preserving Diffusion”. In: Mustererkennung 1995. Ed. by Gerhard Sagerer, Stefan Posch, and Franz Kummert. Springer Berlin Heidelberg, 1995, pp. 538–545.isbn: 978-3-642-79980-8. [4] Z. Azizi, X. Lei, and C-C Jay Kuo. “Noise-aware texture-preserving low-light enhancement”. In: 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE. 2020, pp. 443–446. [5] Zohreh Azizi, C-C Jay Kuo, et al. “PAGER: Progressive attribute-guided extendable robust image generation”. In: APSIPA Transactions on Signal and Information Processing 11.1 (2022). [6] Zohreh Azizi, C-C Jay Kuo, et al. “SALVE: Self-supervised Adaptive Low-light Video Enhancement”. In: APSIPA Transactions on Signal and Information Processing 12.4 (2022). [7] David J Bartholomew, Martin Knott, and Irini Moustaki. Latent variable models and factor analysis: A unified approach . Vol. 904. John Wiley & Sons, 2011. [8] Yoshua Bengio, Grégoire Mesnil, Yann Dauphin, and Salah Rifai. “Better mixing via deep representations”. In: International conference on machine learning. PMLR. 2013, pp. 552–560. [9] D. Berthelot, T. Schumm, and L. Metz. “Began: Boundary equilibrium generative adversarial networks”. In: arXiv preprint arXiv:1703.10717 (2017). [10] P. Bojanowski, A. Joulin, D. Lopez-Paz, and A. Szlam. “Optimizing the latent space of generative networks”. In: arXiv preprint arXiv:1707.05776 (2017). [11] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. “A non-local algorithm for image denoising”. In: CVPR. 2005, pp. 60–65. 90 [12] Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, Alberto Montes, Kevis-Kokitsi Maninis, and Luc Van Gool. “The 2019 davis challenge on vos: Unsupervised multi-object segmentation”. In: arXiv preprint arXiv:1905.00737 (2019). [13] J. Canny. “A computational approach to edge detection”. In: IEEE Transactions on pattern analysis and machine intelligence 6 (1986), pp. 679–698. [14] Turgay Celik and Tardi Tjahjadi. “Contextual and variational contrast enhancement”. In: IEEE Transactions on Image Processing 20.12 (2011), pp. 3431–3441. [15] H. Chang, D. Yeung, and Y. Xiong. “Super-resolution through neighbor embedding”. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. Vol. 1. IEEE. 2004, pp. I–I. [16] Chen Chen, Qifeng Chen, Minh N Do, and Vladlen Koltun. “Seeing motion in the dark”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 3185–3194. [17] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. “Learning to see in the dark”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 3291–3300. [18] H. Chen, M. Rouhsedaghat, H. Ghani, S. Hu, S. You, and C-C Jay Kuo. “Defakehop: A light-weight high-performance deepfake detector”. In: 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE. 2021, pp. 1–6. [19] H. Chen, K. Zhang, S. Hu, S. You, and C-C Jay Kuo. “Geo-DefakeHop: High-Performance Geographic Fake Image Detection”. In: arXiv preprint arXiv:2110.09795 (2021). [20] Hong-Shuo Chen, Shuowen Hu, Suya You, and C-C Jay Kuo. “DefakeHop++: An Enhanced Lightweight Deepfake Detector”. In: arXiv preprint arXiv:2205.00211 (2022). [21] Y. Chen and C-C Jay Kuo. “Pixelhop: A successive subspace learning (ssl) method for object recognition”. In: Journal of Visual Communication and Image Representation 70 (2020), p. 102749. [22] Y. Chen, M. Rouhsedaghat, S. You, R. Rao, and C-C Jay Kuo. “Pixelhop++: A small successive-subspace-learning-based (ssl-based) model for image classification”. In: 2020 IEEE International Conference on Image Processing (ICIP). IEEE. 2020, pp. 3294–3298. [23] Yu Chen, Ying Tai, Xiaoming Liu, Chunhua Shen, and Jian Yang. “Fsrnet: End-to-end learning face super-resolution with facial priors”. In:ProceedingsoftheIEEEConferenceonComputerVision and Pattern Recognition. 2018, pp. 2492–2501. [24] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 8789–8797. [25] Siegfried Cools and Wim Vanroose. “The communication-hiding pipelined BiCGstab method for the parallel solution of large unsymmetric linear systems”. In: Parallel Computing 65 (2017), pp. 1–20. 91 [26] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. “Image denoising with block-matching and 3D filtering”. In: Image Process.: Algs. Sys., Neural Networks, and Machine Learning. Vol. 6064. 2006, pp. 354–365.doi: 10.1117/12.643267. [27] P. Dhariwal and A. Nichol. “Diffusion models beat gans on image synthesis”. In: Advances in Neural Information Processing Systems 34 (2021). [28] L. Dinh, D. Krueger, and Y. Bengio. “Nice: Non-linear independent components estimation”. In: arXiv preprint arXiv:1410.8516 (2014). [29] L. Dinh, J. Sohl-Dickstein, and S. Bengio. “Density estimation using real nvp”. In: arXiv preprint arXiv:1605.08803 (2016). [30] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. “Learning a deep convolutional network for image super-resolution”. In: European conference on computer vision. Springer. 2014, pp. 184–199. [31] Max Ehrlich, Timothy J Shields, Timur Almaev, and Mohamed R Amer. “Facial attributes classification using multi-task representation learning”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2016, pp. 47–55. [32] Gabriel Eilertsen, Rafal K Mantiuk, and Jonas Unger. “Single-frame regularization for temporally stable cnns”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 11176–11185. [33] William T Freeman, Thouis R Jones, and Egon C Pasztor. “Example-based super-resolution”. In: IEEE Computer graphics and Applications 22.2 (2002), pp. 56–65. [34] X. Fu, Y. Liao, D. Zeng, Y. Huang, X. Zhang, and X. Ding. “A Probabilistic Method for Image Enhancement With Simultaneous Illumination and Reflectance Estimation”. In: IEEE Transactions on Image Processing 24.12 (2015), pp. 4965–4977. [35] X. Fu, D. Zeng, Y. Huang, X. Zhang, and X. Ding. “A Weighted Variational Model for Simultaneous Reflectance and Illumination Estimation”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 2782–2790. [36] Xueyang Fu, Delu Zeng, Yue Huang, Yinghao Liao, Xinghao Ding, and John W. Paisley. “A fusion-based enhancing method for weakly illuminated images”. In: Signal Process. 129 (2016), pp. 82–96. [37] R. G. Gavaskar and K. N. Chaudhury. “Fast Adaptive Bilateral Filtering”. In: IEEE Transactions on Image Processing 28.2 (2019), pp. 779–790. [38] Zoubin Ghahramani, Geoffrey E Hinton, et al. The EM algorithm for mixtures of factor analyzers. Tech. rep. Technical Report CRG-TR-96-1, University of Toronto, 1996. [39] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. “Generative adversarial nets”. In: Advances in neural information processing systems 27 (2014). 92 [40] K. Gu, G. Zhai, W. Lin, X. Yang, and W. Zhang. “No-Reference Image Sharpness Assessment in Autoregressive Parameter Space”. In: IEEE Transion on Image Processing 24.10 (2015), pp. 3218–3231. [41] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. “Improved training of wasserstein gans”. In: Advances in neural information processing systems 30 (2017). [42] Chunle Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, and Runmin Cong. “Zero-reference deep curve estimation for low-light image enhancement”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 1780–1789. [43] Chunle Guo Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, and Runmin Cong. “Zero-reference deep curve estimation for low-light image enhancement”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2020, pp. 1780–1789. [44] Xiaojie Guo, Yu Li, and Haibin Ling. “LIME: Low-light image enhancement via illumination map estimation”. In: IEEE Transactions on image processing 26.2 (2016), pp. 982–993. [45] Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan, and Xilin Chen. “Attgan: Facial attribute editing by only changing what you want”. In: IEEE transactions on image processing 28.11 (2019), pp. 5464–5478. [46] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. “Gans trained by a two time-scale update rule converge to a local nash equilibrium”. In: Advances in neural information processing systems 30 (2017). [47] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. “Cascaded diffusion models for high fidelity image generation”. In: Journal of Machine Learning Research 23.47 (2022), pp. 1–33. [48] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising Diffusion Probabilistic Models”. In: arXiv preprint arxiv:2006.11239 (2020). [49] Y. Hoshen, K. Li, and J. Malik. “Non-adversarial image synthesis with generative latent nearest neighbors”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 5811–5819. [50] C. Huang, Z. Wang, and C-C Jay Kuo. “Visible-light and near-infrared face recognition at a distance”. In: Journal of Visual Communication and Image Representation 41 (2016), pp. 140–153. [51] Ziqi Huang, Kelvin CK Chan, Yuming Jiang, and Ziwei Liu. “Collaborative diffusion for multi-modal face generation and editing”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 6080–6090. [52] Haidi Ibrahim and Nicholas Sia Pik Kong. “Brightness preserving dynamic histogram equalization for image contrast enhancement”. In: IEEE Transactions on Consumer Electronics 53.4 (2007), pp. 1752–1758. 93 [53] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. “Flownet 2.0: Evolution of optical flow estimation with deep networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 2462–2470. [54] Haiyang Jiang and Yinqiang Zheng. “Learning to see moving objects in the dark”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 7324–7333. [55] Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, and Zhangyang Wang. “Enlightengan: Deep light enhancement without paired supervision”. In: IEEE Transactions on Image Processing 30 (2021), pp. 2340–2349. [56] Daniel J Jobson, Zia-ur Rahman, and Glenn A Woodell. “A multiscale retinex for bridging the gap between color images and the human observation of scenes”. In: IEEE Transactions on Image processing 6.7 (1997), pp. 965–976. [57] Daniel J Jobson, Zia-ur Rahman, and Glenn A Woodell. “Properties and performance of a center/surround retinex”. In: IEEE transactions on image processing 6.3 (1997), pp. 451–462. [58] J. Johnson, M. Douze, and H. Jégou. “Billion-scale similarity search with gpus”. In: IEEE Transactions on Big Data 7.3 (2019), pp. 535–547. [59] P. Kadam, M. Zhang, S. Liu, and C-C Jay Kuo. “GPCO: An Unsupervised Green Point Cloud Odometry Method”. In: arXiv preprint arXiv:2112.04054 (2021). [60] P. Kadam, M. Zhang, S. Liu, and C-C Jay Kuo. “R-pointhop: a green, accurate and unsupervised point cloud registration method”. In: arXiv preprint arXiv:2103.08129 (2021). [61] P. Kadam, M. Zhang, S. Liu, and C-C Jay Kuo. “Unsupervised point cloud registration via salient points analysis (SPA)”. In: 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE. 2020, pp. 5–8. [62] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. “Progressive growing of gans for improved quality, stability, and variation”. In: arXiv preprint arXiv:1710.10196 (2017). [63] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. “Training generative adversarial networks with limited data”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 12104–12114. [64] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. “Alias-free generative adversarial networks”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 852–863. [65] Tero Karras, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial networks”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 4401–4410. [66] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. “Analyzing and improving the image quality of stylegan”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, pp. 8110–8119. 94 [67] Taeksoo Kim, Byoungjip Kim, Moonsu Cha, and Jiwon Kim. “Unsupervised visual attribute transfer with reconfigurable generative adversarial networks”. In: arXiv preprint arXiv:1707.09798 (2017). [68] D. P. Kingma and M. Welling. “Auto-encoding variational bayes”. In: arXiv preprint arXiv:1312.6114 (2013). [69] N. Kodali, J. Abernethy, J. Hays, and Z. Kira. “On convergence and stability of gans”. In: arXiv preprint arXiv:1705.07215 (2017). [70] Marek Kowalski, Stephan J Garbin, Virginia Estellers, Tadas Baltrušaitis, Matthew Johnson, and Jamie Shotton. “Config: Controllable neural face image generation”. In: European Conference on Computer Vision. Springer. 2020, pp. 299–315. [71] C-C Jay Kuo. “The CNN as a guided multilayer RECOS transform [lecture notes]”. In: IEEE signal processing magazine 34.3 (2017), pp. 81–89. [72] C-C Jay Kuo. “Understanding convolutional neural networks with a mathematical model”. In: Journal of Visual Communication and Image Representation 41 (2016), pp. 406–413. [73] C-C Jay Kuo and Y. Chen. “On data-driven saak transform”. In: Journal of Visual Communication and Image Representation 50 (2018), pp. 237–246. [74] C-C Jay Kuo and Azad M Madni. “Green learning: Introduction, examples and outlook”. In: Journal of Visual Communication and Image Representation (2022), p. 103685. [75] C-C Jay Kuo, M. Zhang, S. Li, J. Duan, and Y. Chen. “Interpretable convolutional neural networks via feedforward design”. In: Journal of Visual Communication and Image Representation 60 (2019), pp. 346–359. [76] Wei-Sheng Lai, Jia-Bin Huang, Oliver Wang, Eli Shechtman, Ersin Yumer, and Ming-Hsuan Yang. “Learning blind video temporal consistency”. In: Proceedings of the European conference on computer vision (ECCV). 2018, pp. 170–185. [77] Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, and Marc’Aurelio Ranzato. “Fader networks: Manipulating images by sliding attributes”. In: Advances in neural information processing systems 30 (2017). [78] Edwin H Land. “The retinex theory of color vision”. In: Scientific american 237.6 (1977), pp. 108–129. [79] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. “Autoencoding beyond pixels using a learned similarity metric”. In: International conference on machine learning. PMLR. 2016, pp. 1558–1566. [80] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. “Photo-realistic single image super-resolution using a generative adversarial network”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 4681–4690. 95 [81] Chang-Hsing Lee, Jau-Ling Shih, Cheng-Chang Lien, and Chin-Chuan Han. “Adaptive multiscale retinex for image contrast enhancement”. In: 2013 International Conference on Signal-Image Technology & Internet-Based Systems. IEEE. 2013, pp. 43–50. [82] Chulwoo Lee, Chul Lee, and Chang-Su Kim. “Contrast enhancement based on layered difference representation of 2D histograms”. In: IEEE transactions on image processing 22.12 (2013), pp. 5372–5384. [83] X. Lei, W. Wang, and C-C Jay Kuo. “GENHOP: an image generation method based on successive subspace learning”. In: IEEE International Symposium on Circuits & Systems (ISCAS) (2022). [84] X. Lei, G. Zhao, and C-C Jay Kuo. “NITES: A non-parametric interpretable texture synthesis method”. In: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE. 2020, pp. 1698–1706. [85] X. Lei, G. Zhao, K. Zhang, and C-C Jay Kuo. “TGHop: an explainable, efficient, and lightweight method for texture generation”. In: APSIPA Transactions on Signal and Information Processing 10 (2021). [86] Chongyi Li, Chunle Guo, Ling-Hao Han, Jun Jiang, Ming-Ming Cheng, Jinwei Gu, and Chen Change Loy. “Low-light image and video enhancement using deep learning: A survey”. In: IEEE transactions on pattern analysis and machine intelligence (2021). [87] K. Li and J. Malik. “Implicit maximum likelihood estimation”. In: arXiv preprint arXiv:1809.09087 (2018). [88] Mu Li, Wangmeng Zuo, and David Zhang. “Convolutional network for attribute-driven and identity-preserving human face generation”. In: arXiv preprint arXiv:1608.06434 (2016). [89] Mu Li, Wangmeng Zuo, and David Zhang. “Deep identity-aware transfer of facial attributes”. In: arXiv preprint arXiv:1610.05586 (2016). [90] Jinxiu Liang, Yong Xu, Yuhui Quan, Boxin Shi, and Hui Ji. “Self-Supervised Low-Light Image Enhancement Using Discrepant Untrained Network Priors”. In: IEEE Transactions on Circuits and Systems for Video Technology 32.11 (2022), pp. 7332–7345. [91] Bingchen Liu, Yizhe Zhu, Kunpeng Song, and Ahmed Elgammal. “Towards faster and stabilized gan training for high-fidelity few-shot image synthesis”. In: International Conference on Learning Representations. 2020. [92] Ming Liu, Yukang Ding, Min Xia, Xiao Liu, Errui Ding, Wangmeng Zuo, and Shilei Wen. “Stgan: A unified selective transfer network for arbitrary image attribute editing”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 3673–3682. [93] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. “Unsupervised image-to-image translation networks”. In: Advances in neural information processing systems 30 (2017). 96 [94] X. Liu, F. Xing, C. Yang, J. Kuo, S. Babu, G. Fakhri, T. Jenkins, and J. Woo. “Voxelhop: Successive subspace learning for als disease classification using structural mri”. In: IEEE Journal of Biomedical and Health Informatics (2021). [95] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. “Deep learning face attributes in the wild”. In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 3730–3738. [96] Kin Gwn Lore, Adedotun Akintayo, and Soumik Sarkar. “LLNet: A deep autoencoder approach to natural low-light image enhancement”. In: Pattern Recognition 61 (2017), pp. 650–662. [97] Yongyi Lu, Yu-Wing Tai, and Chi-Keung Tang. “Attribute-guided face generation using conditional cyclegan”. In: Proceedings of the European conference on computer vision (ECCV). 2018, pp. 282–297. [98] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. “Are gans created equal? a large-scale study”. In: Advances in neural information processing systems 31 (2018). [99] Feifan Lv, Yu Li, and Feng Lu. “Attention guided low-light image enhancement with a large scale low-light simulation dataset”. In: International Journal of Computer Vision 129.7 (2021), pp. 2175–2193. [100] Feifan Lv, Feng Lu, Jianhua Wu, and Chongsoon Lim. “MBLLEN: Low-Light Image/Video Enhancement Using CNNs.” In: BMVC. Vol. 220. 1. 2018, p. 4. [101] X. Mao, Q. Li, H. Xie, R. Lau, Z. Wang, and S. Paul Smolley. “Least squares generative adversarial networks”. In: Proceedings of the IEEE international conference on computer vision. 2017, pp. 2794–2802. [102] Mehdi Mirza and Simon Osindero. “Conditional generative adversarial nets”. In: arXiv preprint arXiv:1411.1784 (2014). [103] M. Monajatipoor, M. Rouhsedaghat, L. Harold Li, A. Chien, C. Kuo, F. Scalzo, and K. Chang. “BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis”. In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). IEEE. 2021, pp. 3327–3336. [104] Keita Nakai, Yoshikatsu Hoshi, and Akira Taguchi. “Color image contrast enhacement method based on differential intensity/saturation gray-levels histograms”. In: 2013 International Symposium on Intelligent Signal Processing and Communication Systems. IEEE. 2013, pp. 445–449. [105] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. “Image transformer”. In: International Conference on Machine Learning. PMLR. 2018, pp. 4055–4064. [106] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and Jose M Álvarez. “Invertible conditional gans for image editing”. In: arXiv preprint arXiv:1611.06355 (2016). 97 [107] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. “The 2017 davis challenge on video object segmentation”. In: arXiv preprint arXiv:1704.00675 (2017). [108] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. “Diffusion autoencoders: Toward a meaningful and decodable representation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 10619–10629. [109] Shengju Qian, Kwan-Yee Lin, Wayne Wu, Yangxiaokang Liu, Quan Wang, Fumin Shen, Chen Qian, and Ran He. “Make a face: Towards arbitrary high fidelity face manipulation”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 10033–10042. [110] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. “Generating diverse high-fidelity images with vq-vae-2”. In: Advances in neural information processing systems 32 (2019). [111] X. Ren, W. Yang, W. Cheng, and J. Liu. “LR3M: Robust Low-Light Enhancement via Low-Rank Regularized Retinex Model”. In: IEEE Transactions on Image Processing 29 (2020), pp. 5862–5876. [112] Xutong Ren, Wenhan Yang, Wen-Huang Cheng, and Jiaying Liu. “LR3M: Robust low-light enhancement via low-rank regularized retinex model”. In: IEEE Transactions on Image Processing 29 (2020), pp. 5862–5876. [113] Douglas A Reynolds. “Gaussian mixture models.” In:Encyclopediaofbiometrics 741.659-663 (2009). [114] Eitan Richardson and Yair Weiss. “On gans and gmms”. In: Advances in neural information processing systems 31 (2018). [115] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation”. In: International Conference on Medical image computing and computer-assisted intervention. Springer. 2015, pp. 234–241. [116] M. Rouhsedaghat, M. Monajatipoor, Z. Azizi, and C-C Jay Kuo. “Successive subspace learning: An overview”. In: arXiv preprint arXiv:2103.00121 (2021). [117] M. Rouhsedaghat, Y. Wang, X. Ge, S. Hu, S. You, and C-C Jay Kuo. “Facehop: A light-weight low-resolution face gender classification method”. In: International Conference on Pattern Recognition. Springer. 2021, pp. 169–183. [118] M. Rouhsedaghat, Y. Wang, S. Hu, S. You, and C-C Jay Kuo. “Low-resolution face recognition in resource-constrained environments”. In: Pattern Recognition Letters 149 (2021), pp. 193–199. [119] S. T. Roweis and L. K. Saul. “Nonlinear dimensionality reduction by locally linear embedding”. In: science 290.5500 (2000), pp. 2323–2326. [120] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. “Image super-resolution via iterative refinement”. In: arXiv preprint arXiv:2104.07636 (2021). 98 [121] Florian Schroff, Dmitry Kalenichenko, and James Philbin. “Facenet: A unified embedding for face recognition and clustering”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 815–823. [122] Wei Shen and Rujie Liu. “Learning residual images for face attribute manipulation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 4030–4038. [123] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. “Deep unsupervised learning using nonequilibrium thermodynamics”. In: International conference on machine learning. PMLR. 2015, pp. 2256–2265. [124] Vladislav Sovrasov. ptflops: a flops counting tool for neural networks in pytorch framework .url: https://github.com/sovrasov/flops-counter.pytorch. [125] Yi Sun, Xiaogang Wang, and Xiaoou Tang. “Deep learning face representation from predicting 10,000 classes”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2014, pp. 1891–1898. [126] Michael E Tipping and Christopher M Bishop. “Mixtures of probabilistic principal component analyzers”. In: Neural computation 11.2 (1999), pp. 443–482. [127] Michael E Tipping and Christopher M Bishop. “Probabilistic principal component analysis”. In: Journal of the Royal Statistical Society Series B: Statistical Methodology 61.3 (1999), pp. 611–622. [128] Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, and Kilian Weinberger. “Deep feature interpolation for image content changes”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 7064–7073. [129] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. “Conditional image generation with pixelcnn decoders”. In: Advances in neural information processing systems 29 (2016). [130] Aaron Van Den Oord, Oriol Vinyals, et al. “Neural discrete representation learning”. In: Advances in neural information processing systems 30 (2017). [131] Junyi Wang, Weimin Tan, Xuejing Niu, and Bo Yan. “RDGAN: Retinex decomposition based adversarial learning for low-light enhancement”. In: 2019 IEEE international conference on multimedia and expo (ICME). IEEE. 2019, pp. 1186–1191. [132] Ruixing Wang, Xiaogang Xu, Chi-Wing Fu, Jiangbo Lu, Bei Yu, and Jiaya Jia. “Seeing Dynamic Scene in the Dark: A High-Quality Video Dataset with Mechatronic Alignment”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 9700–9709. [133] Wei Wang, Xuejing Lei, Yueru Chen, Ming-Sui Lee, and C-C Jay Kuo. “LSR: A Light-Weight Super-Resolution Method”. In: arXiv preprint arXiv:2302.13596 (2023). [134] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. “Image quality assessment: from error visibility to structural similarity”. In: IEEE transactions on image processing 13.4 (2004), pp. 600–612. 99 [135] Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. “Deep retinex decomposition for low-light enhancement”. In: arXiv preprint arXiv:1808.04560 (2018). [136] Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. “Deep retinex decomposition for low-light enhancement”. In: arXiv preprint arXiv:1808.04560 (2018). [137] Taihong Xiao, Jiapeng Hong, and Jinwen Ma. “Dna-gan: Learning disentangled representations from multi-attribute images”. In: arXiv preprint arXiv:1711.05415 (2017). [138] J. Xu, Y. Hou, D. Ren, L. Liu, F. Zhu, M. Yu, H. Wang, and L. Shao. “STAR: A Structure and Texture Aware Retinex Model”. In: IEEE Transactions on Image Processing 29 (2020), pp. 5022–5037. [139] Z. Ying, G. Li, Y. Ren, R. Wang, and W. Wang. “A New Low-Light Image Enhancement Algorithm Using Camera Response Model”. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). 2017, pp. 3015–3022. [140] Zhenqiang Ying, Ge Li, and Wen Gao. “A Bio-Inspired Multi-Exposure Fusion Framework for Low-light Image Enhancement”. In: ArXiv abs/1711.00591 (2017). [141] Jason J Yu, Konstantinos G Derpanis, and Marcus A Brubaker. “Wavelet flow: Fast training of high resolution normalizing flows”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 6184–6196. [142] Dongxu Yue, Qin Guo, Munan Ning, Jiaxi Cui, Yuesheng Zhu, and Li Yuan. “ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation”. In: arXiv preprint arXiv:2305.14742 (2023). [143] Fan Zhang, Yu Li, Shaodi You, and Ying Fu. “Learning temporal consistency for low light video enhancement from single images”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 4967–4976. [144] K. Zhang, B. Wang, W. Wang, F. Sohrab, M. Gabbouj, and C-C Jay Kuo. “Anomalyhop: an ssl-based image anomaly localization method”. In: 2021 International Conference on Visual Communications and Image Processing (VCIP). IEEE. 2021, pp. 1–5. [145] M. Zhang, P. Kadam, S. Liu, and C-C Jay Kuo. “GSIP: Green Semantic Segmentation of Large-Scale Indoor Point Clouds”. In: arXiv preprint arXiv:2109.11835 (2021). [146] M. Zhang, P. Kadam, S. Liu, and C-C Jay Kuo. “Unsupervised feedforward feature (uff) learning for point cloud classification and segmentation”. In: 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE. 2020, pp. 144–147. [147] M. Zhang, Y. Wang, P. Kadam, S. Liu, and C-C Jay Kuo. “Pointhop++: A lightweight learning model on point sets for 3d classification”. In: 2020 IEEE International Conference on Image Processing (ICIP). IEEE. 2020, pp. 3319–3323. [148] M. Zhang, H. You, P. Kadam, S. Liu, and C-C Jay Kuo. “Pointhop: An explainable machine learning method for point cloud classification”. In: IEEE Transactions on Multimedia 22.7 (2020), pp. 1744–1755. 100 [149] Qing Zhang, Yongwei Nie, and Wei-Shi Zheng. “Dual illumination estimation for robust exposure correction”. In: Computer Graphics Forum. Vol. 38. 7. Wiley Online Library. 2019, pp. 243–252. [150] Yonghua Zhang, Jiawan Zhang, and Xiaojie Guo. “Kindling the darkness: A practical low-light image enhancer”. In: Proceedings of the 27th ACM international conference on multimedia. 2019, pp. 1632–1640. [151] Shuchang Zhou, Taihong Xiao, Yi Yang, Dieqiao Feng, Qinyao He, and Weiran He. “Genegan: Learning object transfiguration and attribute subspace from unpaired data”. In: arXiv preprint arXiv:1705.04932 (2017). [152] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. “Image quality assessment: from error visibility to structural similarity”. In: IEEE Transaction on Image Processing 13.4 (2004), pp. 600–612. [153] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. “Unpaired image-to-image translation using cycle-consistent adversarial networks”. In: Proceedings of the IEEE international conference on computer vision. 2017, pp. 2223–2232. [154] Y. Zhu, X. Wang, H. Chen, R. Salloum, and C-C Jay Kuo. “A-PixelHop: A Green, Robust and Explainable Fake-Image Detector”. In: arXiv preprint arXiv:2111.04012 (2021). 101
Abstract (if available)
Abstract
Recent technological advances have led to production of massive volumes of visual data. Images and videos are nowadays important forms of content shared across social platforms for various purposes. In this manuscript, we propose novel methodologies to enhance, generate and manipulate visual content. Our contributions are outlined as follows:
Low-light image enhancement. We present a simple and effective low-light image enhancement method based on a noise-aware texture-preserving retinex model in this dissertation. The new method, called NATLE, attempts to strike a balance between noise removal and natural texture preservation through a low-complexity solution. Its cost function includes an estimated piecewise smooth illumination map and a noise-free texture-preserving reflectance map. After decomposing an image into the illumination and reflectance map, the illumination is adjusted to form the enhanced image together with the reflectance map. Extensive experiments are conducted on common low-light image enhancement datasets to demonstrate the superior performance of NATLE.
Low-light video enhancement. We also present a self-supervised adaptive low-light video enhancement method, called SALVE, in this dissertation. SALVE first enhances a few keyframes of an input low-light video using a retinex-based low-light image enhancement technique. For each keyframe, it learns a mapping from low-light image patches to enhanced ones via ridge regression. These mappings are then used to enhance the remaining frames in the low-light video. The combination of traditional retinex-based image enhancement and learning-based ridge regression leads to a robust, adaptive and computationally inexpensive solution to enhance low-light videos. Our extensive experiments along with a user study show that 87% of participants prefer SALVE over prior work.
Image generation. Then, we present a generative modeling approach based on successive subspace learning (SSL). Unlike most generative models in the literature, our method does not utilize neural networks to analyze the underlying source distribution and synthesize images. The resulting method, called the progressive attribute-guided extendable robust image generative (PAGER) model, has advantages in mathematical transparency, progressive content generation, lower training time, robust performance with fewer training samples, and extendibility to conditional image generation. PAGER consists of three modules: core generator, resolution enhancer, and quality booster. The core generator learns the distribution of low-resolution images and performs unconditional image generation. The resolution enhancer increases image resolution via conditional generation. Finally, the quality booster adds finer details to generated images. Extensive experiments on MNIST, Fashion-MNIST, and CelebA datasets are conducted to demonstrate generative performance of PAGER.
Facial Attribute Editing. Finally, we present a facial attribute editing method based on Gaussian Mixture Model (GMM). Our proposed method, named AttGMM, is the first to conduct facial attribute editing without exploiting neural networks. AttGMM first reconstructs the given image in a low-dimensional latent space through a posterior probability distribution. Next, it manipulates the low-dimensional latent vectors into a certain attribute. Finally, AttGMM utilizes the difference between the results of the previous two steps, along with the given image, to generate a refined and sharp image which possesses the target attribute. We show that AttGMM has a great advantage in lowering the computational cost. We present several experimental results to demonstrate the performance of AttGMM.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Green image generation and label transfer techniques
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Efficient graph learning: theory and performance evaluation
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Green learning for 3D point cloud data processing
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
A learning‐based approach to image quality assessment
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Recording, reconstructing, and relighting virtual humans
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Texture processing for image/video coding and super-resolution applications
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Green unsupervised single object tracking: technologies and performance evaluation
PDF
Green knowledge graph completion and scalable generative content delivery
Asset Metadata
Creator
Azizi, Zohreh
(author)
Core Title
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-12
Publication Date
03/06/2024
Defense Date
08/29/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
attribute-guided image generation,facial attribute editing,generative models,high-resolution image generation,image attribute editing,image generation,low-light image enhancement,low-light video enhancement,OAI-PMH Harvest,super-resolution
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
azizizohre94@gmail.com,zazizi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113303420
Unique identifier
UC113303420
Identifier
etd-AziziZohre-12326.pdf (filename)
Legacy Identifier
etd-AziziZohre-12326
Document Type
Dissertation
Format
theses (aat)
Rights
Azizi, Zohreh
Internet Media Type
application/pdf
Type
texts
Source
20230907-usctheses-batch-1092
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
attribute-guided image generation
facial attribute editing
generative models
high-resolution image generation
image attribute editing
image generation
low-light image enhancement
low-light video enhancement
super-resolution