Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Novel optimization tools for structured signals recovery: channels estimation and compressible signal recovery
(USC Thesis Other)
Novel optimization tools for structured signals recovery: channels estimation and compressible signal recovery
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NOVEL OPTIMIZATION TOOLS FOR STRUCTURED SIGNALS RECOVERY: CHANNELS ESTIMATION AND COMPRESSIBLE SIGNAL RECOVERY by Sajjad Beygiharchegani A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2018 Copyright 2018 Sajjad Beygiharchegani Dedication To my loving wife, and my dear parents, for their support, encouragement, and constant love. ii Acknowledgments This thesis is a product of my close collaboration with my research advisor Prof. Urbashi Mitra. Words cannot express my feelings of gratitude towards her for her constant support, encouragement, and guidance. She is the main reason I can now look back and say that I have had an incredible last five years at USC. I am also grateful to Prof. Erik Ström, Prof. Mariane Rembold Petraglia, Dr. Shirin Jalali, and Prof. Arian Maleki, who are also my co-authors. They have always been a constant source of guidance and inspiration. I am also grateful to Prof. Behnam Jafarpour and Prof. Justin Haldar for being on my thesis committee. I am greatly indebted to them for their encouragement and enthusiasm about my work. I would like to thank all my Ph.D. friends for their sympathetic ear, altruistic help, and stimulating discussions we had about our research and lives. Finally, last but not least, I want to thank my family for their constant support all through my life - no achievement in my life would have been possible without them. I am forever indebted to my wife, Negin Golrezaei, and my parents, Ghaffar Beygi and Leyla Beygi. Their love, encouragement, and support have been the core of my strength throughout the quest of attaining my Ph.D. degree. This thesis is dedicated to them. I also express a feeling of bliss for having loving siblings like Ali, Lotfolah, Raziyeh, Azam, and Somayeh. They have always been genuinely concerned and ever encouraging. iii Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract xi 1 Introduction 1 1.1 Structured Signal Recovery: Compressed Sensing . . . . . . . . . . . . . . . 1 1.2 Wireless Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.1 Multipath propagation, time- and frequency-dispersion . . . . . . . . 2 1.2.2 Deterministic description of LTV channels . . . . . . . . . . . . . . . 4 1.2.3 Global parameters of Cellular, V2V, and UWA channels . . . . . . . 8 2 V2V Channel Estimation 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Jointly Sparse Signal Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Proximity Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Optimality of the Nesting of Proximity Operators . . . . . . . . . . . 17 2.2.3 Proximity Operator of Sparse-Inducing Regularizers . . . . . . . . . . 20 2.3 Communication System Model . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Joint Sparsity Structure of V2V Channels . . . . . . . . . . . . . . . . . . . 24 2.5 Observation Model and Leakage Effect . . . . . . . . . . . . . . . . . . . . . 30 2.6 Channel Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.7 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8 Real Channel Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.8.1 Measurement setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.10 Appendix A: Sparsity Inducing Regularizers . . . . . . . . . . . . . . . . . . 49 2.11 Appendix B: Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . 52 2.12 Appendix C: Region and Group Specification . . . . . . . . . . . . . . . . . 56 iv 2.13 Appendix D: Proximal ADMM Iteration Development . . . . . . . . . . . . . 57 2.14 Appendix E: Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . 59 3 MSML Channel Estimation 60 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2 Signal and Channel Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3 Noiseless Channel Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4 Structural Property of Noiseless Signal . . . . . . . . . . . . . . . . . . . . . 70 3.5 Channel Estimation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5.1 Non-convex Rank Regularizer . . . . . . . . . . . . . . . . . . . . . . 73 3.5.2 Convex Rank Regularizer . . . . . . . . . . . . . . . . . . . . . . . . 76 3.5.3 Channel Parameter Extraction . . . . . . . . . . . . . . . . . . . . . . 79 3.6 Short Review of Prior Methods . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.6.1 Sparse Approximation Method . . . . . . . . . . . . . . . . . . . . . . 80 3.6.2 Semi-Definite Program Method . . . . . . . . . . . . . . . . . . . . . 81 3.7 Simulation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.7.1 Approximation Assessment . . . . . . . . . . . . . . . . . . . . . . . . 83 3.7.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.7.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.9 Appendix A: Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.10 Appendix B: Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . 89 3.11 Appendix C: Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . 91 4 Narrowband TV Channel Estimation 95 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3 Parametric Signal Representation . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4 Structured Estimation of Time-varying Narrowband Channels . . . . . . . . 102 4.4.1 Non-convex Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.4.2 Convex Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.5 Numerical Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.5.1 Parameters setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.5.2 Performance Comparsion . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.5.3 Doppler Estimation and Resolution Constraint . . . . . . . . . . . . . 113 4.5.4 Bit Error Performance Comparison . . . . . . . . . . . . . . . . . . . 114 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.7 Appendix A: Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . 117 5 Compression-based Compressed Sensing 124 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1.2 Compression-based compressed sensing problem statement . . . . . . 129 v 5.1.3 Compressible signal pursuit . . . . . . . . . . . . . . . . . . . . . . . 131 5.2 Our main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.2.1 Compression-based gradient descent (C-GD) . . . . . . . . . . . . . . 133 5.2.2 Convergence Analysis of C-GD . . . . . . . . . . . . . . . . . . . . . 134 5.3 Standard signal classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.3.1 Sparse signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.3.2 Piecewise polynomial functions . . . . . . . . . . . . . . . . . . . . . 141 5.4 Simulation Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 144 5.4.1 Parameters setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.4.2 Algorithms and comparison criteria . . . . . . . . . . . . . . . . . . . 146 5.4.3 Compressive imaging with i.i.d. measurement matrices . . . . . . . . 147 5.4.4 Compressive imaging with partial-Fourier matrices . . . . . . . . . . 149 5.4.5 Convergence rate evaluation . . . . . . . . . . . . . . . . . . . . . . . 151 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.6 Appendix A: Proofs of Theorems . . . . . . . . . . . . . . . . . . . . . . . . 154 5.6.1 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.6.2 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.6.3 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.7 Appendix B: Concentration of Measure Background . . . . . . . . . . . . . . 162 5.8 Appendix C: Finding the best piecewise polynomial approximation . . . . . 167 Reference List 169 vi List of Tables 1.1 Global parameters of wireless communication channels . . . . . . . . . . . . 9 2.1 Proposed V2V Channel Estimation Method . . . . . . . . . . . . . . . . . . 38 3.1 The Non-convex approach to remove the noise from the receive data using the low rank structure of data matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.2 D-update step for rank regularization using Nuclear norm of data matrix. . . . . 78 3.3 Proposed MSML channel estimation methods . . . . . . . . . . . . . . . . . 79 5.1 PSNR of 128× 128 reconstructions with no measurement noise - Sampled by a random Gaussian measurement matrix. . . . . . . . . . . . . . . . . . . . . 148 5.2 PSNR of reconstruction of 128× 128 test images with Gaussian measurement noise with various SNR values - sampled by a random Gaussian measurement matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.3 PSNR of 512× 512 reconstructions with no noise - sampled by a random partial-Fourier measurement matrix. . . . . . . . . . . . . . . . . . . . . . . 150 5.4 PSNR of 512× 512 reconstructions with Gaussian measurement noise with various SNR value - sampled by a random partial-Fourier measurement matrix.152 vii List of Figures 2.1 MCP and SCAD proximity operators. Here λ = 1 is considered. . . . . . . . 21 2.2 Geometric representation of the V2V channel. The shaded areas on each side of the road contain static discrete (SD) and diffuse (DI) scatters, while the road area contains both SD and moving discrete (MD) scatterers. . . . . . . 25 2.3 Delay-Doppler contribution of line-of-sight (LOS) and from static scatterers (SD/DI) placed on a parallel line beside the road. . . . . . . . . . . . . . . . 28 2.4 Delay-Doppler domain representation of V2V channel. Delay-Doppler spread- ing function for diffuse components is confined to a U-shaped area. . . . . . 29 2.5 Schematic of the V2V channel vector partitioning for group vectors. . . . . . 34 2.6 Comparison of NMSE v.s SNR for Wiener filter estimator [107], HSD estima- tor [13, 68], CS Method [4, 95], and proposed method with different regular- izes, i.e., Nested-soft [90, 26], Nested-MCP, and Nested-SCAD regularizers. . 41 2.7 NMSEofthechannelestimators. The (∗)inthelegendmeansthattheleakage effect is not compensated, i.e., G = I is assumed in the channel estimation algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.8 Performance of algorithm for different values ofN DI andN SD +N MD . NMSE for varying N SD +N MD and fixed N DI = 500 . . . . . . . . . . . . . . . . . . 43 2.9 Performance of algorithm for different values ofN DI andN SD +N MD . NMSE for varying N DI and fixed N SD +N MD = 50 . . . . . . . . . . . . . . . . . . 44 viii 2.10 Performance of the proposed algorithm under different values ofK =γd(N r − 1)/2e. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.11 The channel delay-Doppler scattering function for a real channel measurement data [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.12 Delay-Doppler spreading function. Diffuse components are confined to a U- shaped area - Δτ = 0.3 μs and Δν = 500 Hz. . . . . . . . . . . . . . . . . . 49 2.13 Delay-Doppler spreading function. Diffuse components are confined to a U- shaped area - Δτ = 0.3 μs and Δν = 200 Hz. . . . . . . . . . . . . . . . . . 50 2.14 Comparison of NMSE v.s SNR for CS Method [4, 95], and proposed method i.e., Group-wise (λ e = 0), Nested-soft [90, 26], and Nested-SCAD regularizers. 50 2.15 Performance of the proposed algorithm for different values of Δτ and Δν. . . 51 2.16 Discrete representation of the regions R 1 , R 2 , and R 3 . . . . . . . . . . . . . . 56 3.1 The rank one approximation’s mean-squared-error (MSE) versus number of domi- nant paths in channel for different values of Doppler scale support. . . . . . . . . 83 3.2 Normalized MSE versus signal-to-noise ratio (SNR), performance comparison of our proposed convex and non-convex approaches with SA and SDP methods. . . . . . 85 3.3 Comparison of normalized MSE versus the number of dominant paths in a MSML channel - our proposed convex and non-convex approaches with SA method pro- posed in [7]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.1 Normalized mean-squared-error (NMSE) of estimation strategies vs. signal-to- noise-ratio (SNR) at the receiver side - leaked channel matrix estimation. . . . . . 112 4.2 Normalized mean-squared-error (NMSE) of estimation strategies vs. signal-to- noise-ratio (SNR) at the receiver side - channel Doppler shifts estimation. . . . . . 114 4.3 Doppler shifts are the roots of Q(ν). This figure depicts 1−Q(ν) vs. ν. For n T =100, p 0 =5, and m 0 =10. . . . . . . . . . . . . . . . . . . . . . . . . . . 115 ix 4.4 Doppler shifts are the roots of Q(ν). This figure depicts 1−Q(ν) vs. ν. For n T =200, p 0 =5, and m 0 =10. . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.5 BER vs. SNR performance comparison. The label "Convex - not satisfied" indicates that the resolution constraint is not enforced in the channel realization, while "Con- vex - satisfied" indicates the Doppler shifts are well-separated as given in Theorem 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.1 Test images used in our simulations. . . . . . . . . . . . . . . . . . . . . . . 147 5.2 Image reconstruction using partial-Fourier matrices via using m n = 10%, 30%, and 50% noisy measurements with SNR = 30 dB. The first row illustrates the images reconstructed by NLR-CS method and the second row illustrates reconstructed images by JP2K-GD method. The test image Barbara is resized to 512× 512. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.3 Normalized reconstruction error in each iteration of JP2K-GD method on compressive measurements, sampled by a random partial-Fourier measure- ment matrix. House 512× 512-test image. . . . . . . . . . . . . . . . . . . . 153 5.4 Reconstructed image in different iterations using JP2K-GD method via com- pressive measurements, sampled by a random partial-Fourier measurement matrix. First row of images associated withm/n = 5% and the second row is associated with m/n = 10% scenario. Numbers between the figures indicate the corresponding iteration number. . . . . . . . . . . . . . . . . . . . . . . . 154 x Abstract In this thesis, we investigate and design new optimization techniques and algorithms to promote the target signal structures in their estimation/recovery process from an ill-posed linear system of equations. In particular, we consider wireless time-varying channels estima- tion problem and image/video signals acquisition problem from a small set of measurements. We present our proposed approaches to estimate two-dimensional (2D), time- varying, wireless communication channels by promoting their prior physical structures. We show that geometric information and intrinsic sparse structures of wireless communication channels can be exploited via proper regularization functions in the estimation problem to improve the accuracy of channel estimation with modest computational complexity. In particular, we studied the channel estimation problem for vehicle-to-vehicle (V2V), underwater acous- tic (UWA) communication channels, and leaked time-varying narrowband communication channels. We show that V2V channels have a joint element- and group- wise sparsity structures. To exploit these structures, we propose a nested joint sparse recovery method. Our method solves the jointly element/group sparse channel (signal) estimation problem using the prox- imity operators of a broad class of regularizers based on the alternating direction method of multipliers. Furthermore, key properties of the proposed objective functions are proven which ensure that the optimal solution is found by the new algorithm. We also investigate the underwater channel estimation problem. The underwater channel can be represented by xi a mutli-scale multi-lag (MSML) channel model. We show that the data matrix for the trans- mitted signal, after passing through the MSML channel, exhibits a low-rank representation. In addition, we show that the MSML channel estimation problem can be represented as a spectral estimation problem. By exploiting the intrinsic low-rank structure of the received signal, the Prony algorithm is adapted to estimate the Doppler scales (close frequencies), delays and channel gains. Two strategies using convex and non-convex regularizers to remove noise from the corrupted signal were proposed. A bound on the reconstruction of the noise- less received signal provides guidance on the selection of the relaxation parameter in the convex optimizations. Another interesting problem is the estimation of a narrowband time- varying channel under the practical assumptions of finite block length and finite transmission bandwidth is investigated. We show that the signal, after passing through a time-varying narrowband channel reveals a particular parametric low-rank structure that can be repre- sented as a bilinear form. To estimate the channel, we propose two structured methods. The first method exploits the low-rank bilinear structure of the channel via a non-convex strategy based on alternating direction optimization between delay and Doppler directions. Due to the non-convex nature of this approach, it is sensitive to local minima. Motivated by this issue, we propose a novel convex approach based on the minimization of the atomic norm using measurements of signal at time domain. Furthermore, for convex approach, we characterize the optimality and uniqueness conditions, and theoretical guarantee for the noiseless channel estimation problem with small number of measurements. In the next part of this thesis, we consider employing a compression code to build an efficient (polynomial time) compressed sensing recovery algorithm. Modern image and video compression codes employ elaborate structures in an effort to encode them using small num- ber of bits. Compressed sensing recovery algorithms, on the other hand, use such structures to recover the signals from a few linear observations. Despite the steady progress in the field of compressed sensing, the structures that are often used for signal recovery are still much simpler than those employed by state-of-the-art compression codes. The main goal of our xii study is to bridge this gap by answering the following question: Can one employ a compres- sion code to build an efficient (polynomial time) compressed sensing recovery algorithm? In response to this question, the compression-based gradient descent (C-GD) algorithm is proposed. C-GD, which is a low-complexity iterative algorithm, is able to employ a generic compression code for compressed sensing and therefore enlarges the set of structures used in compressed sensing to those used by compression codes. We provide a convergence analysis of C-GD, a characterization of the required number of samples as a function of the rate- distortion function of the compression code; and a robustness analysis of C-GD to additive white Gaussian noise and other non-idealities in the measurement process. Finally, the pre- sented simulation results show that, in image compressed sensing, using compression codes such as JPEG2000, C-GD outperforms state-of-the-art methods. xiii Chapter 1 Introduction 1.1 Structured Signal Recovery: Compressed Sensing The main problem of compressed sensing is to recover an unknown target signal x∈ I R n from undersampled linear measurementsy∈ I R m , y = A A Ax +z, (1.1) where A A A ∈ I R m×n and z ∈ I R m denote the measurement matrix and the measurement noise, respectively. Since m<n,x cannot be recovered accurately, unless we leverage some prior information about the signal x. Such information can mathematically be expressed by assuming that x∈S, whereS ⊂ I R n is a known set. Intuitively, it is expected that the “smaller” the setS, the fewer number of measurements m required for recoveringx. In other words, having more side information about the target signalx limits the feasible set in a way that enables the recovery algorithm to identify a high-quality estimate ofx using fewer measurements. In the last decade, researchers have explored several different instances of the setS, such as the class of sparse signals or low-rank matrices [34, 21, 20, 11, 12]. 1.2 Wireless Communication In this part, we briefly describe some of physical phenomena associated with wireless chan- nels. Wireless communication systems, i.e., systems transmitting information via electro- magnetic (radio) or acoustic (sound) waves, have become ubiquitous. In many of these 1 systems, the transmitter or the receiver is mobile. Even if both link ends are static, scat- terers, i.e., objects that reflect or scatter the propagating waves, may move with significant velocities. These situations give rise to time variations of the wireless channel due to the Doppler effect. In their most general form, linear time-varying (LTV) channels are also referred to as time-frequency (TF) dispersive or doubly dispersive, as well as TF selective or doubly selective. 1.2.1 Multipath propagation, time- and frequency-dispersion The presence of multiple scatterers (buildings, vehicles, hills, etc.) causes a transmitted radio wave to propagate along several different paths that terminate at the receiver. Hence, the receive antenna picks up a superposition of multiple attenuated copies of the transmit signal. This phenomenon is referred to as multipath propagation. Due to different lengths of the propagation paths, the individual multipath components experience different delays (time shifts). The receiver thus observes a temporally smeared-out version of the transmit signal. Even though the medium itself is not physically dispersive (in the sense that different frequencies propagate with different velocities), such channels are termed time-dispersive. Time-dispersive channels are frequency-selective in the sense that different frequencies are attenuated differently [97, 72]. In many wireless systems, the transmitter, receiver, and/or scatterers are moving. In such situations, the emitted wave is subject to the Doppler effect. From classical physics, we know where the speeds of transmitter and the receiver relative to the medium are lower than the velocity of waves in the medium, the relationship between observed frequency f r and emitted frequency f c is given by: f = c +v r cos(θ) c +v s cos(θ) ! f c ≈ 1 + v cos(θ) c ! f c (1.2) 2 where c is the velocity of waves in the medium; θ is the angle of arrival of the wave relative to the direction of motion of the receiver; v r is the velocity of the receiver relative to the medium; positive if the receiver is moving towards the source (and negative in the other direction); v s is the velocity of the source relative to the medium; positive if the source is moving away from the receiver (and negative in the other direction); and v = v r −v s . The frequency is decreased if either is moving away from the other. Theaboveapproximationontheright-handsideof (1.2)holdsforthepracticallypredom- inant case v r ,v s ,vc. For a general transmit signal s(t) with Fourier transform S(f), one can then show the following expressions of the receive signal (h is the complex attenuation factor): R(f) =hS(af)→r(t) = h a s t a , with a = 1 + v cos(θ) c 0 (1.3) This shows that the Doppler effect results in a temporal/spectral scaling (i.e., compression or dilation). In many practical cases, the transmit signal is effectively band-limited around the carrier frequency f c , i.e., S(f) is effectively zero outside a band [f c −B/2,f c +B/2], where B f c . The approximation af = f + v cos(θ) c 0 f ≈ f + v cos(θ) c 0 f c (whose accuracy increases with decreasing normalized bandwidth B/f c then implies R(f)≈hS(f +f d )→r(t)≈hs(t)e −j2πf d t , ,with f d = v cos(θ) c 0 f c (1.4) Here, the Doppler effect essentially results in a frequency shift, with the Doppler shift fre- quency f d being proportional to both the velocity v and the carrier frequency f c . The relations (1.3) and (1.4) are often referred to as wideband and narrowband Doppler effect, respectively [71]. In the general case of multipath propagation and moving transmitter, receiver, and/or scatterers, the received multipath components (echoes) experience different Doppler shifts since the angles of arrival/departure and the relative velocities associated with the individual multipath components are typically different. Hence, the transmit signal is spread out in the frequency domain – it experiences frequency dispersion. 3 1.2.2 Deterministic description of LTV channels To describe the time dispersion and frequency dispersion effects on the transmit signal, in general we can consider the time-scale channel model, i.e., each reflection is a delayed and time scaled copy of the transmitted signal. Relative motion of the transmitter, scatterers, or receiver causes time dilations/contractions of the transmitted waveform s(t). Thus, each reflection is of the form [54, 60, 64], y τ,a (t) =h τ,a s t−τ a , (1.5) and the received signal is a summation of the reflections characterized by the wideband spreading function H(τ,a), y(t) = ZZ +∞ −∞ H(τ,a)s t−τ a .dτda (1.6) A time-scale channel is called a wideband channel [54, 84]. Due to the physical limitations of signal propagation, it is reasonable to expect thatH(τ,a) has finite support. The maximum possible rate of change in path length, which is constrained by the speeds of the objects in the environment, limits the support of H(τ,a) to a narrow range around the a = 1 line. Causality and the propagation loss associated with increasing path length effectively limit the support of H(τ,a) to a finite range in the τ direction. The support in the a direction causes a spreading in scale of the transmitted signal, and the support in the τ direction causes a spreading in time of the transmitted signal. Thus, channels described by (1.6) are often refereed to as doubly spread channels [84, 54]. As discussed in Section 1.2.1, many signals and signaling environments satisfy the nar- rowband condition ( B fc 1 and v c 1), an assumption under which the time dilations or 4 contractions are modeled as Doppler shifts. Under this assumption, each received reflection of the signal is assumed to be of the form, y τ,ν (t) =h τ,ν s(t−τ)e j2πνt (1.7) In the narrowband channel model, the received signal is a superposition of time delayed and frequency shifted copies of the input and the channel is characterized by the narrowband spreading function H(τ,ν), y(t) = ZZ +∞ −∞ H(τ,ν)s(t−τ)e j2πνt dτdν (1.8) where H(τ,ν) typically has finite support in τ and ν due to the physical limitations of the channel [89, 63, 72]. While both spreading functions in wideband and narrowband scenarios were motivated by a specific physical model (multipath propagation, Doppler effect), we see that equations (1.6) and (1.8) are a description of a general time-varying linear system, y(t) = Z +∞ −∞ h(t,τ)s(t−τ)dτ (1.9) where h(t,τ) = R +∞ −∞ H ((1−a)t +aτ,a)da for wideband channels, and h(t,τ) = R +∞ −∞ H(τ,ν)e j2πνt dν for narrowband channels [84, 72]. A popular model of an LTV wideband channel with specular scattering is specified in terms of the impulse response as h(t,τ) = P X p=1 h p δ(τ− [τ p −a p t]), (1.10) where h p , τ p , and a p denote, respectively, the complex attenuation factor, time delay, and Doppler scale associated with the pth path. This model is called multi-scale multi-lag 5 (MSML) model [54, 60, 64, 102, 103]. For an LTV narrowband channel the following model is widespread in the literature [89, 63], h(t,τ) = P X p=1 h p δ(τ−τ p )e j2πνpt , (1.11) where h p , τ p , and ν p denote, respectively, the attenuation factor, time delay, and Doppler frequency associated with the pth path. Discrete channel models - Tapped delay line model The complete mathematical description of LTV channels is rather complex. Characterizing T seconds of a transmission – i.e., howT seconds of the receive signal depend on T seconds of the transmit signal. For simplicity, we discuss channel representations in a discrete-time setting where the channel’s input-output relation reads [97, 72] y[n] = M−1 X m=0 h[n,m]s[n−m] (1.12) Here s[n], y[n], and h[n,m] are sampled versions of transmitted signal s(t), received signal y(t), and channel h(t,τ) in (1.9), respectively. Furthermore, M≈d τmax Ts e≈dτ max Be is the number of discrete channel taps, i.e., the maximum discrete-time delay. Here τ max is the maximumdelayspreadofchannel,T s denotesthesamplingtime;T s = 1 fs , wheref s (sampling frequency) is assumed to be larger thanB +f D , whereB is the transmit bandwidth andf D is the Doppler spread of channel. Basis expansion models (BEM) A popular class of low-rank channel models uses an expansion with respect to timen of each tap of the channel impulse response h[n,m] into a basis{u i [n]} i=0,···,I−1 i.e., 6 h[n,m] = I−1 X i=0 c i [m]u i [n] (1.13) This basis expansion model (BEM) is motivated by the observation that the temporal (n) variation of h[n,m] is usually rather smooth due to the channel’s limited Doppler spread, and hence{u i [n]} i=0,···,I−1 can be chosen as a small set of smooth functions. Therefore, for the TV channels, the variation of basis parameters are much slower than temporal channel variation [80, 71]. In most cases, the BEM (1.13) is considered only within a finite interval, hereafter assumed to be [0,N r − 1] without loss of generality (Doppler resolution 1 Nr ). The ith coeffi- cient for the mth tap in (1.13) is given by c i [m] =hh[·,m], ˆ u i i = Nr−1 X n=0 h[n,m]ˆ u ∗ i [n], (1.14) where {ˆ u i [n]} i=0,···,I−1 is the bi-orthogonal basis for the span of {u i [n]} i=0,···,I−1 (i.e., hu i , ˆ u i 0i = δ i,i 0 for∀i,i 0 ). In particular, choosing complex exponentials and polynomials for the u i [n] results in Fourier and Taylor series, respectively [6]. The usefulness of (1.13) is due to the fact that the complexity of characterizing h[n,m] on the interval [0,N r − 1] is reduced from N 2 r to MI N 2 r numbers. However, it is important to note that in most practical cases, an extension of the time interval will require a proportional increase in the BEM model order (i.e., I∝ N r ). By taking length-N r discrete Fourier transforms (DFT) of (1.13) with respect to n, we obtain the following expressions for the discrete spreading function: H[k,m], Nr−1 X n=0 h[n,m]e −j2π kn Nr (1.15) 7 The BEM most often employed in practice uses a basis of complex exponentials [80, 71]. This can be motivated by considering the inverse DFT of (1.15), i.e., h[n,m] = 1 N r Nr−1 X k=0 H[k,m]e j2π kn Nr . (1.16) Assuming H[k,m] = 0 for|k| > K, with K denoting the maximum discrete Doppler shift, results in the so-called (critically sampled) complex exponential (CE) BEM. Here the model orderequalsI = 2K+1andthebasisfunctionsandcoefficientsaregivenbyu i [n] =e j2π (i−K)n Nr and c i [m] = 1 Nr H[m,i−K], respectively. By substituting (1.16) in (1.12), we have y[n] = M−1 X m=0 Nr−1 X k=0 H[k,m] N r s[n−m]e j2π kn Nr . (1.17) 1.2.3 Global parameters of Cellular, V2V, and UWA channels For many design and analysis tasks in wireless communications, only global channel param- eters are relevant. In this section we shortly review parameters such as Doppler spread, coherence time, delay spread and coherence bandwidth. The delay spread, τ s , and Doppler spread,f D (≈ vfc c ), are defined as the root-mean-square (RMS) widths of delay power profile and Doppler power profile, respectively [97, 72]. Sometimes it is more convenient to work with the reciprocals of Doppler spread and delay spread, T c ∝ 1 f D (1.18) B c ∝ 1 τ s (1.19) which are known as the coherence time and coherence bandwidth, respectively. These two parameters can be used to quantify the duration and bandwidth within which the channel is approximately constant (or, at least, strongly correlated). A further insight about the frequency selectivity of the channel can be derived from these channels parameters. The 8 Table 1.1: Global parameters of wireless communication channels Cellular V2V UWA Propagation Speed (c) 3× 10 8 m/s 3× 10 8 m/s 1500 m/s Relative velocity (TX-RX) (v) 20− 30 m/s 50− 100 m/s 10− 20 m/s Carrier Frequency (f c ) 900 MHz 5.6 GHz 10 kHz to 20 kHz Bandwidth (B) 100− 500 kHz 20− 100 MHz 1− 10 kHz Delay spread (τ s ) 0.01− 1μs 1− 2μs 5− 50 ms Doppler spread (f D ) 100− 200Hz 1 - 2 kHz 10 - 100 Hz Doppler Scales (a) - - 0.99− 1.01 Coherence BW (B c ) 1− 100MHz 200− 700kHz 100− 200Hz Coherence time (T c ) 1− 5ms 100− 500μs 0.1− 1s Bandwidth ratio ( B fc ) 10 −4 10 −3 0.5 to 1 Velocity ratio ( v c ) 10 −7 10 −6 0.01 Channel type Narrowband Narrowband Wideband Number of channel taps (M) 1 50 to 200 50 to 500 coherence bandwidth indicates the severity of the channels’ frequency selectivity. The time selectivity is reflected in the RMS Doppler spread and relates to the coherence time. In Table 1.1, we summarize these parameters for cellular, V2V, and UWA channels. Cellular communication channels From the data presented in Table 1.1, we see that cellular communication channels are narrowband because B fc ≈ 10 −4 1 and v c ≈ 10 −7 1. Furthermore, these channels are often non-selective channels (both frequency and time) due to the fact that B B c and symbol durationT c . Furthermore, the number of discrete channel taps isM≈dτ max Be = 1. This means that cellular communication channels can be well modeled with a single gain (tap), i.e., y[n] =H[n]s[n]. Since the channel coherence time is much larger than a symbol duration, during transmission a block of data we can simply consider H[n]≈H. V2V communication channels From the data presented in Table 1.1, we see that V2V communication channels are narrow- band because B fc ≈ 10 −2 1 and v c ≈ 10 −6 1. Based on results reported in [76, 73, 1] 9 for V2V channels, the calculated coherence times from practical measurements are in the range from T c =100 to 500 μs, and coherence bandwidths range from B c = 200 kHz to 700 kHz. Hence the vehicular channel is strongly time- and frequency-selective. The number of discrete channel taps is M ≈d τmax Ts e≈dτ max Be≈ 50 to 200. Therefore, in the V2V channel the possible number of non-zero components of the channel is bigger and due to channel frequency/time-selectivity, and the effect of diffuse components in the channel is significant. In Chapter 3, we propose a novel approach to estimate the V2V channel based on relationships in equations (1.12) and (1.17) and regularization of particular structure of V2V channels in delay-Doppler domain. UWA communication channels From the data presented in Table 1.1, we see that UWA communication channels are wide- bandbecauseofthelargebandwidth-carrierfrequency-ratio, i.e., B fc ≈ 0.5to 1,andlow-speed of (acoustic) signals in water, i.e.,i.e., v c ≈ 10 −2 . To the best of our knowledge, there is no rule of thumb to calculate the coherence time for underwater communication. However, based on practical measurements results reported in [44, 59], [[47]-chapter 9], the coherence time for UWA communication channels is about 0.1 to 1 sec, and coherence bandwidth is about 100 Hz to 1 kHz. The number of discrete channel taps isM≈d τmax Ts e≈dτ max Be≈ 50 to 500. However, due to sparse nature of UWA channels the number of non-zero channel’s coefficients is much smaller than these numbers [60, 44, 59]. In Chapter 3, we propose a novel approach to estimate the UWA channel based on the MSML channel model for wideband channels given in (1.10). 10 Chapter 2 V2V Channel Estimation 2.1 Introduction Vehicle-to-vehicle (V2V) communication is central to future intelligent transportation sys- tems, which will enable efficient and safer transportation with reduced fuel consumption [65]. In general, V2V communication is anticipated to be short range with transmission ranges varying from a few meters to a few kilometers between two mobile vehicles on a road. A big challenge of realizing V2V communication is the inherent fast channel variations (faster than in cellular [8, 75]) due to the mobility of both the transmitter and receiver. Since channel state information can improve communication performance, a fast algorithm to accurately estimate V2V channels is of interest. Furthermore, V2V channels are highly dependent on the geometry of road and the local physical environment [73, 65]. A popular estimation strategy for fast time-varying channels is to apply Wiener filtering [6, 107]. Recently, [107, 8] present an adaptive Wiener filter to estimate V2V channels using subspace selection. The main drawback of Wiener-filtering is that the knowledge of the scattering function is required [6]; however, the scattering function is not typically known at the receiver. Often, a flat spectrum in the delay-Doppler domain is assumed, which introduces performance degradation due to the mismatch with respect to the true scattering function [107]. In this chapter, we adopt a V2V channel model in the delay-Doppler domain, using the geometry-based stochastic channel model proposed in [57]. Our characterization of this model reveals the special structure of the V2V channel components in the delay-Doppler domain. We show that the delay-Doppler representation of the channel exhibits three key 11 regions; within these regions, the channel is a mixture of specular reflections and diffuse com- ponents. While the specular contributions appear all over the delay-Doppler plane sparsely, the diffuse contributions are concentrated in specific regions of the delay-Doppler plane. We see that the channel measurements from a real data experiment also confirm to our analysis of the V2V channel structures in the delay-Doppler domain. In our prior work [68, 69], a Hybrid Sparse/Diffuse (HSD) model was presented for a mixture of sparse and Gaussian diffuse components for a static channel, which we have adapted to estimate a V2V channel [13]. This approach requires information about the V2V channel such as the statistics, and power delay profile (PDP) of the diffuse and sparse components [68]. Another approach for time-varying frequency-selective channel estimation is via compressed sensing (CS) or sparse approximation based on an l 1 -norm regularization [4, 95, 24]. These algorithms perform well for channels with a small number of scatterers or, clusters of scatterers. For V2V channels, diffuse contributions from reflections along the roadside will degrade the performance of CS methods that only consider element-wise sparsity[95, 107]. Herein, we exploit the particular structure of the V2V channel, inspired by recent work in 2D sparse signal estimation [92, 37], to design a novel joint element- and group-wise sparsity estimator to estimate the 2D time-varying V2V channel using received data. Our proposed method provides a general machinery to solve the joint sparse structured estimation problem with a broad class of regularizers that promote sparsity. We show that our proposed algo- rithm covers both well-known convex and non-convex regularizers such as smoothly clipped absolute deviation (SCAD) regularizers [38], and the minimax concave penalty (MCP) [108] that were proposed for element-wise sparsity estimation. We also present a general way to design a proper regularizer function for joint sparsity problem in our previous work [14]. Our method can be applied to scenarios beyond V2V channels. Recentalgorithmsforhierarchicalsparsity(sparsegroupswithsparsitywithinthegroups) [90, 26, 40] also consider a mixture of penalty functions (group-wise and element-wise). Of 12 particular note is [26] where a similar nested solution is determined, also in combination with the alternating direction method of multipliers (ADMM) as we do herein. Our modeling assumptions can be viewed as a generalization of their assumptions which results in the need for different methods for proving the optimization of the nested structures. In particular, [26] examines a particular proximity operator for which the original regularizing function is never specified 1 . This proximity operator is built by generalizing the structure of the proximity operator for thel p norm. In contrast, we begin with a general class of regularizing functions, we specify the properties needed for such functions (allowing for both convex and non-convex functions). Thus, our proof methods rely only on the properties induced by theseassumptions. Furthermore, ourresultsarealsoapplicabletotheproblemofhierarchical sparsity [90, 26, 40]. We observe that the results in [90, 26, 40] cannot be applied to non- convex regularizer functions such as SCAD and MCP due to their concavity and a non-linear dependence on the regularization parameter. To find the optimal solution of our joint sparsity objective function, we take advantage of ADMM [36], which is a very flexible and efficient tool for optimization problems whose objectivefunctionsarethecombinationofmultipleterms. Furthermore, weusetheproximity operator[31]toshowthattheestimationcanbedonebyusingsimplethresholdingoperations in each iteration, resulting in low complexity. We also address the channel leakage effect due to finite block length and bandwidth for channel estimation in the delay-Doppler plane. In [95], the basis expansion for the scattering function is optimized to compensate for the leakage which can degrade performance. The resulting expansion in [95] is computationally expensive. Herein, we take an alternative view and show that with the proper sampling resolution in time and frequency, we can explicitly derive the leakage pattern and robustify the channel estimator with this knowledge at the 1 Note given a particular regularization function, there is a unique proximity operator, but for a given proximity operator there may exist more than one regularizer function. 13 receiver to improve the sparsity, compensate for leakage, and maintain modest algorithm complexity. The main contributions of this chapter are as follows: • A general framework for joint sparsity estimation problem is proposed, which covers a broad class of regularizers including convex and non-convex functions. Furthermore, we show that the solution for joint sparse estimation problem is computed by applying theelement-wiseandgroup-wisestructureinanestedfashionusingsimplethresholding operations. • We provide a simple model for the V2V channel in the delay-Doppler plane, using the geometry-based stochastic channel modeling proposed in [57]. We characterize the three key regions in the delay-Doppler domain with respect to the presence of sparse specular and diffuse components. This structure is verified by experimental channel measurement data, as presented in Section 2.8. • The leakage pattern is explicitly computed and a compensation procedure proposed. • A low complexity joint element- and group-wise sparsity V2V channel estimation algo- rithmisproposedexploitingtheaforementionedchannelmodelandoptimizationresult. • We use extensive numerical simulation and experimental channel measurement data to investigate the performance of the proposed joint sparse channel estimators and show that our method outperforms classical and compressed sensing methods [107, 13, 4, 95] The rest of this chapter is organized as follows. In Section 2.2, we review some definitions from variational analysis and present our key optimization result for a joint sparse and group sparse signal estimation. In Section 2.3, the system model for V2V communications is presented. In Section 2.4, the geometry-based V2V channel model is developed. The observation model and leakage effect are computed in Section 2.5. In Section 2.6, the channel estimation algorithm for the time-varying V2V channel model using joint sparsity structure is presented. In Section 2.7, we provide simulation results and compare the performance 14 of the estimators. In Section 2.8, the real channel measurements are provided to confirm the validity of the channel model and numerical simulation. Finally, Section 2.9 concludes this chapter. We present our proofs, region specifier algorithm, and analysis of our proximal ADMM in the Appendices in 2.10. 2.2 Jointly Sparse Signal Estimation In this section, we propose a unified framework using proximity operators [86] to solve the optimization problem imposed by a structured sparse 2 signal estimation problem. Then, we apply this machinery to estimate the V2V channel, exploiting the group- and element-wise sparsity structure discovered in Section 2.4. Proximalmethodshavedrawnincreasingattentioninthesignalprocessing(e.g., [31], and numerous references therein) and the machine learning communities (e.g., [3, 106], and refer- ences therein), due to their convergence rates (optimal for the class of first-order techniques) and their ability to accommodate large, non-smooth, convex (and non-convex) problems. In proximal algorithms, the base operation is evaluating the proximal operator of a function, which involves solving a small optimization problem. These sub-problems can be solved with standard methods, but they often admit closed-form solutions or can be solved efficiently with specialized numerical methods. Our main theoretical contribution is stated in Theorem 1 in Section II.B. In this theorem, we show that, for our proposed class of regularization functions, the nested joint sparse structure can be recovered by applying the element-wise sparsity and group-wise sparsity structures in a nested fashion, see Eq. (2.10). 2.2.1 Proximity Operator We start with the definition of a proximity operator from variational analysis [86]. 2 A jointly sparse signal in this paper, is a signal that has both element-wise and group-wise sparsity. 15 Definition 1. Let φ(a;λ) be a continuous real-valued function of a∈ R N , the proximity operator P λ,φ (b) is defined as P λ,φ (b) :=argmin a∈R N 1 2 kb− ak 2 2 +φ(a;λ) , (2.1) where b∈R N and λ> 0. Remark 1. If φ(.) is a separable function, i.e., φ(a;λ) = P N i=1 f (a[i];λ). Then, [P λ,φ (a)] i = P λ,f (a[i]). Remark 2. If the objective functionJ(a) = 1 2 kb−ak 2 2 +φ(a;λ) is a strictly convex function, the proximity operator of φ(a;λ) admits a unique solution. Remark 3. Furthermore, P λ,φ (b) is characterized by the inclusion ∀(a ∗ , b), a ∗ =P λ,φ (b) ⇐⇒ a ∗ − b∈∂φ(a ∗ ;λ), (2.2) where ∂φ(.) is the sub-gradient of the function φ [86]. Notethatφdoesnotneedtobeaconvexordifferentiablefunctiontosatisfytheconditions noted in Remarks 2 and 3. Proximity operators have a very natural interpretation in terms of denoising [31, 86]. Consider the problem of estimating a vector a∈R N from an observation b∈ R N , namely b = a + n where n is additive white Gaussian noise. If we consider the regularization function φ(a;λ) as the prior information about the vector a, then P λ,φ (b) can be interpret as a maximum a posteriori (MAP) estimate of the vector a [43]. Two well- known types of prior information about the structure of a vector are element-wise sparsity and group-wise sparsity structures. A N-vector a is element-wise sparse, if the number of non-zero (or larger than some threshold) entries in the vector is small compared to the length of the vector. To define the group-wise sparsity, let us consider{I i } Ng i=1 be a partition of the index set 16 {1, 2,...,N} to N g groups, namely∪ Ng i=1 I i ={1, 2,...,N} andI i ∩I j =? for∀i6=j. We define group vectors as follows a i [k] = a[k] k∈I i 0 k / ∈I i (2.3) fori = 1,...,N g , whereN g is the total number of group vectors. Based on above definition, we have a = P Ng i=1 a i and the nonzero elements of a i and a j are non-overlapping for i6= j. The vector a is called a group-wise sparse vector, if the number of group vectors a i for i = 1,...,N g with non-zero l 2 -norm (or l 2 -norm larger than some threshold) is small compared to the total number of group vectors, N g . 2.2.2 Optimality of the Nesting of Proximity Operators We consider the estimation of the vector a from vector b as noted above. Furthermore, suppose that the vector a is a jointly sparse vector. The desired optimization problem is as follows ˆ a = argmin a∈R N 1 2 kb− ak 2 2 +φ g (a,λ g ) +φ e (a;λ e ) , (2.4) where φ g (a;λ g ) is a regularization term to induce group sparsity and φ e (a;λ e ) is a term to induce the element-wise sparsity. In general, the weighting parameters, λ g > 0,λ e > 0, can be selected from a given range via cross-validation, by varying one of the parameters and 17 keeping the other fixed [40]. We further consider penalty functions φ g (a;λ g ) and φ e (a;λ e ) of the form: φ g (a;λ g ) = Ng X j=1 f g (ka j k 2 ;λ g ), and (2.5) φ e (a;λ e ) = N X i=1 f e (a[i];λ e ), (2.6) where f g :R→R and f e :R→R are continuous functions to promote sparsity on groups and elements, respectively, N is the length of vector a, and N g is the number of groups in vector a. Our goal here is to derive the solution of the optimization problem in (2.4) using the proximity operators of the functions f e and f g . We state the conditions imposed on the regularizers, in terms of the univariate functions f g (x;λ) and f e (x;λ) to promote sparsity and also to control the stability of the solution of the optimization problem in (2.4). Assumption I: For k∈{e,g} i. f k is a non-decreasing function of x for x≥ 0; f k (0;λ) = 0; and f k (x; 0) = 0. ii. f k is differentiable except at x = 0. iii. For∀z∈∂f k (0;λ), then|z|≤λ, where ∂f k (0;λ) is the subgradient 3 of f k at zero. iv. There exists a μ≤ 1 2 such that the function f k (x;λ) +μx 2 , is convex. v. f g is a homogeneous function, i.e., f g (αx;αλ) =α 2 f g (x;λ) for∀α> 0. vi. f e is a scale invariant function, i.e., f e (αx;λ) =f e (x;αλ) =αf e (x;λ) for∀α> 0. Itcanbeobserved thatconditions(i), (iv), (v), and(vi)ensuretheexistenceoftheminimizer of the optimization problem in Eq. (2.4), and they induce norm properties on the regularizer function. Assumption (ii) promotes sparsity, (iii) controls the stability of the solution in Eq. (2.4) and guarantees the optimality of solution of optimization problem in Eq. (4) or (P 0 ) in 3 A precise definition of subgradients of functions is given in [86], page 301. 18 Section VI. Finally Assumption (iv) enables the inclusion of many non-convex functions in the optimization problem. Note that the scale invariant property of f e implies that f e also satisfies (v) (is a homogeneous function). Many pairs of regularizer functions satisfy Assumption I. For instance, the l 1 -norm, namely f g (x;λ g ) = λ g |x| and f e (x;λ e ) = λ e |x|, satisfy Assumption I (see Appendix 2.10). We note that two recently popularized non-convex functions, SCAD and MCP regularizers, also satisfy Assumption I. It is worth pointing out that SCAD and MCP are more effective in promoting sparsity than the l p norms. The SCAD regularizer [38], is given by f g (x;λ) = λ|x| for|x|≤λ − x 2 −2μ S λ|x|+λ 2 2(μ S −1) for λ<|x|≤μ S λ (μ S +1)λ 2 2 for|x|>μ S λ , (2.7) where μ S > 2 is a fixed parameter, and the MCP regularizer [108], is f g (x;λ) = sign(x) Z |x| 0 λ− z μ M ! + dz, (2.8) where (x) + = max(0,x) and μ M > 0 is a fixed parameter. In Appendix 2.10, we show that Assumption I is met for SCAD and MCP. The following theorem is our main technical result that presents the solution of the optimization problem in (2.4) based on the proximity operators of the functions f g and f e . Theorem 1. Consider functions f e and f g that satisfy Assumption I. The optimization problem in (2.4), can be decoupled as follows ˆ a i = argmin a i ∈R N 1 2 kb i − a i k 2 2 +g(a i ;λ g ) +E(a i ;λ e ) , (2.9) 19 where the index i = 1...N g denotes the group number, g(a i ;λ g ) = f g (ka i k 2 ;λ g ), and E(a i ;λ e ) = P j f e (a i [j];λ e ). Then, ˆ a i =P λg,g (P λe,E (b i )), (2.10) where P λg,g and P λe,E are the proximity operators of E and g, respectively (see Definition1) and can be written as P λg,g (b) = P λg,fg (kbk 2 ) kbk 2 b, (2.11) [P λe,E (b)] j =P λe,fe (b[j]). (2.12) The proof is provided in Appendix 2.11. This result states that within a group, joint sparsity is achieved by first applying the element proximity operator and then the group proximity operator. We observe the f g and f e can be chosen structurally different, the resultant complexity is modest, and we can use our result with any optimization algorithm based on proximity operators, e.g., ADMM [36], proximal gradient methods [86], proximal splitting methods [31], and so on. 2.2.3 Proximity Operator of Sparse-Inducing Regularizers Inthissection, wecomputeclosedformexpressionsfortheproximityoperatorsofthesparsity inducing regularizers introduced in Section 2.2.2, using Definition 1 and Remarks 1 to 3. All of the aforementioned regularizers satisfy the condition noted in Remark 2 due to property (iv) in Assumption I. Using Definition 1, we can compute the proximity operators of the l p , SCAD, and MCP regularizers. The proximity operator for the l p -norm is given by, 20 −5 0 5 −5 0 5 x −5 0 5 −5 0 5 x μ S =2.1 μ S =3 μ S =300 μ M =1.1 μ M =2 μ M =200 P λ, f SCAD (x) P λ, f MCP (x) Figure 2.1: MCP and SCAD proximity operators. Here λ = 1 is considered. P λ,fg (x) = sign(x) max{0,λu}, where u p−1 + u p = |x| λp . For p = 1, i.e., f e (x;λ) = λ|x|, the resulting operator is often called soft-thresholding (see e.g. [40]), P λ,fe (x) = sign(x) max{0,|x|−λ}. (2.13) The closed form solution of the proximity operator for the SCAD regularizer [38] can be written as, P λ,fg (x) = 0, if|x|≤λ x− sign(x)λ, if λ≤|x|≤ 2λ x−sign(x) μ S λ μ S −1 1− 1 μ S −1 if 2λ<|x|≤μ S λ x if|x|>μ S λ , (2.14) and finally the proximity operator for the MCP regularizer [108] is P λ,fg (x) = 0, if|x|≤λ x−sign(x)λ 1− 1 μ M if λ<|x|≤μ M λ x if|x|>μ M λ . (2.15) 21 In Fig. 2.1, MCP and SCAD proximity operators are depicted for λ = 1 and three different values of the parametersμ S andμ M . It is clear that whenμ S andμ M are large, both SCAD and MCP operators behave like the soft-thresholding operator (for x smaller than μ S λ and μ M λ, respectively). In the sequel, first we model a V2V communication system. Then, we show that V2V channel representation in the delay-Doppler domain has both element-wise and group-wise sparsity structures. Finally, we apply our key optimization result derived in this section to estimate the V2V channel using an ADMM algorithm. 2.3 Communication System Model We will consider communication between two vehicles as shown in Fig. 2.2. The transmitted signal s(t) is generated by the modulation of the transmitted pilot sequence s[n] onto the transmit pulse p t (t) as, s(t) = +∞ X n=−∞ s[n]p t (t−nT s ), (2.16) where T s is the sampling period. Note that this signal model is quite general, and encom- passes OFDM signals as well as single-carrier signals. The signal s(t) is transmitted over a linear, time-varying, V2V channel. The received signal y(t) can be written as, y(t) = Z +∞ −∞ h (t,τ)s(t−τ)dτ +z(t). (2.17) 22 Here, h(t,τ) is the channel’s time-varying impulse response, and z(t) is a complex white Gaussian noise. At the receiver, y(t) is converted into a discrete-time signal using an anti- aliasing filter p r (t). That is, y[n] = Z +∞ −∞ y(t)p r (nT s −t)dt. (2.18) The relationship between the discrete-time signal s[n] and received signal y[n], using Eqs. (2.16)–(2.18), can be written as, y[n] = +∞ X m=−∞ h l [n,m]s[n−m] +z[n], (2.19) where h l [n,m] 4 is the discrete time-delay representation of the observed channel, which is related to the continuous-time channel impulse response h(t,τ) as follows, h l [n,m] = +∞ ZZ −∞ h (t +nT s ,τ)p t (t−τ +mT s )p r (−t)dtdτ. With some loss of generality, we assume thatp r (t) has a root-Nyquist spectrum with respect to the sample duration T s , which implies that z[n] is a sequence of i.i.d circularly symmet- ric complex Gaussian random variables with variance σ 2 z , and that h l [n,m] is causal with maximum delay M− 1, i.e., h l [n,m] = 0 for m≥M and m< 0. We can then write y[n] = M−1 X m=0 K X k=−K H l [k,m]e j 2πnk 2K+1 s[n−m] +z[n], (2.20) for n = 0, 1,...,N r − 1, 4 The subscript “l” hereafter denotes the channel with leakage. We discuss the channel leakage effect with more details in Section 2.5. 23 where 2K + 1≥N r , and H l [k,m] = 1 2K + 1 Nr−1 X n=0 h l [n,m]e −j 2πnk 2K+1 , for|k|≤K, (2.21) is the discrete delay-Doppler, spreading function of the channel. Here, N r denotes the total number of received signal samples used for the channel estimation. 2.4 Joint Sparsity Structure of V2V Channels In this section, we adopt the V2V geometry-based stochastic channel model from [57] and analyze the structure such a model imposes on the delay-Doppler spreading function. The V2V channel model considers four types of multipath components (MPCs): (i) the effective line-of-sight (LOS) component, which may contain the ground reflections, (ii) discrete com- ponents generated from reflections of discrete mobile scatterers (MD), e.g., other vehicles, (iii) discrete components reflected from discrete static scatterers (SD) such as bridges, large traffic signs, etc., and (iv) diffuse components (DI). Thus, the V2V channel impulse response can be written as h(t,τ) =h LOS (t,τ) + N MD X i=1 h MD,i (t,τ) + N SD X i=1 h SD,i (t,τ) + N DI X i=1 h DI,i (t,τ), (2.22) whereN MD denotes the number of discrete mobile scatterers,N SD is the number of discrete static scatterers and N DI is the number of diffuse scatterers, respectively. Typically, N DI is muchlargerthanN SD andN MD [57]. Intheaboverepresentation, themultipathcomponents can be modeled as h i (t,τ) =η i δ(τ−τ i )e −j2πν i t , (2.23) 24 P Figure 2.2: Geometric representation of the V2V channel. The shaded areas on each side of the road contain static discrete (SD) and diffuse (DI) scatters, while the road area contains both SD and moving discrete (MD) scatterers. where η i is the complex channel gain, τ i is the delay, and ν i is the Doppler shift associated with pathi andδ(t) is the Dirac delta function. The channel description in (2.22) and (2.23) explicitly models distance-dependent pathloss and scatterer parameters [7]. We assume that these parameters can be approximated as time-invariant over the pilot signal duration. This is a reasonable assumption in practical systems as will be illustrated in Figures 2.11 and 2.13 for the experimental channel measurement data. The spatial distribution of the scatterers and the statistical properties of the complex channel gains are specified in [57] for rural and highway environments. It is shown that the channel power delay profile is exponential. Further details about the spatial evolution of the gains can be found in [57, 18]. In geometry- based stochastic channel modeling, point scatterers are randomly distributed in a geometry according to a specified distribution. The position and speed of the scatterers, transmitter, and receiver determine the delay-Doppler parameters for each MPC, which in turn, together with the transmitter and receive pulse shapes, determine H l [k,m]. We next determine the delay and Doppler contributions of an ensemble of point scatterers of type (i)-(iv) for the road geometry depicted in Fig. 2.2. If vehicles are assumed to travel parallel with the x-axis, the overall Doppler shift for the path from the transmitter (at 25 position TX) via the point scatterer (at position P) to the receiver (at position RX) can be written as [57] ν (θ t ,θ r ) = 1 λ ν [(v T −v P ) cosθ t + (v R −v P ) cosθ r ], (2.24) where λ ν is the wavelength, v T , v P , and v R are the speed of the transmitter, scatterer, and receiver, respectively, and θ t and θ r is the angle of departure and arrival, respectively. The path delay is τ = d 1 +d 2 c 0 , (2.25) where c 0 is the propagation speed, d 1 is the distance from TX to P, and d 2 is the distance from P to RX. The path parameters θ t , θ r , d 1 , and d 2 are easily computed from TX, P, and RX. The delay and Doppler parameters of each component (i)-(iv) can now be specified by Eqs. (2.24) and (2.25). LOS If it exists, the most significant component of the V2V channel is the line of sight (LOS) component, which will have delay and Doppler parameters τ 0 = d 0 c 0 and ν 0 = 1 λν (v T −v R ) cos(θ), where d 0 is the distance from TX and RX and θ is the angle between the x-axis (i.e., the moving direction) and the line passing through TX and RX. Diffuse Scatterers The diffuse (DI) scatterers are static (v P = 0) and uniformly dis- tributed in the shadowed regions in Fig. 2.2. Suppose we place a static scatterer at the coordinates (x,y). The delay-Doppler pair, (τ(x,y),ν(x,y)), for the corresponding MPC can be calculated from Eqs. (2.24) and (2.25). If we fix y = y 0 and vary x from−∞ to +∞, the delay-Doppler pair will trace out a U-shaped curve in the delay-Doppler plane, as depicted in Fig. 2.3. Repeating this procedure for the permissible y-coordinates for the DI scatterers,|y 0 |∈ [D/2,d +D/2], will result in a family of curves that are confined to a U-shaped region, see Fig. 2.4. Hence, the DI scatterers will result in multipath components with delay-Doppler pairs inside this region. The maximum and minimum Doppler values of the region is easily 26 found from Eq. (2.24). In fact, it follows from Eq. (2.24) that the Doppler parameter of an MPC due to a static scatterer will be confined to the symmetric interval [−ν S ,ν S ], where ν S = 1 λν (v T +v R ). Static Discrete Scatterers The static discrete (SD) scatterers can appear outside the shadowed regions in Fig. 2.2. In fact, they-coordinates of the SD scatterers are drawn from a Gaussian mixture consisting of two Gaussian pdfs with the same standard deviation σ y,SD and meansy 1,SD andy 2,SD [57]. The delay-Doppler pair for an MPC due to an SD scatterer can therefore appear also outside the U-shaped region in Fig. 2.4. However, since the SD scatterers are static, the Doppler parameter is in the interval [−ν S ,ν S ], i.e., the same interval as for the diffuse scatterers. Mobile Discrete Scatterers We assume that no vehicle travels with an absolute speed exceeding v max . It then follows from Eq. (2.24) that the Doppler due to a mobile discrete (MD) scatterer is in the interval [−ν max ,ν max ], whereν max = 4vmax λν . For example, in Fig. 2.4, the Doppler shift ν p is due to an MD scatterer (vehicle) that travels in the oncoming lane (v P < 0). Based on the analysis above, we can conclude that the delay-Doppler parameters for the multipath components can be divided into three regions, R 1 Δ = n (τ,ν)∈R 2 :τ∈ (τ 0 ,τ 0 + Δτ),ν∈ (−ν S ,ν S ) o R 2 Δ = n (τ,ν)∈R 2 :τ∈ [τ 0 + Δτ,τ max ],|ν|∈ [ν S − Δν,ν S ) o R 3 Δ = n (τ,ν)∈R 2 :τ∈ [τ 0 ,τ max ],|ν|≤ν max ] o \ (R 1 ∪R 2 ), whereτ max −τ 0 is the maximum significant excess delay for the V2V channel. Here, Δτ and Δν are chosen such that the contributions from all diffuse scatterers are confined toR 1 ∪R 2 . The exact choice of Δτ is somewhat arbitrary. In this paper, we consider a thresholding rule to compare the noise level and channel components, which results in a particular choice 27 0 0.2 0.4 0.6 0.8 1 −800 −600 −400 −200 0 200 400 600 800 Delay (μs) Doppler (Hz) Static Scatterers (SD/DI) Line of Sight (LOS) Figure 2.3: Delay-Doppler contribution of line-of-sight (LOS) and from static scatterers (SD/DI) placed on a parallel line beside the road. of Δτ; the method is specified in Appendix 2.12. However, regardless of method, once Δτ is chosen, we can compute the height Δν of the two strips that make up R 2 . This can be done by placing an ellipsoid with its foci at the transmitter and receiver such that the path from the transmitter to the receiver via any point on the ellipsoid has propagation delay τ 0 + Δτ. By computing the associated Doppler along the part of ellipsoid that is in the diffuse region (i.e., in the strips just outside the highway, see Fig. 2.2), we can determine the smallest absolute Doppler value among them as ν 0 and calculate Δν as Δν = ν S −ν 0 . In Appendix 2.12, we present a data driven approach to approximate Δν. Note that the regions gather channel components with similar behavior. Region R 1 , contains the LOS, ground reflections, and (strong) discrete and diffuse components due scatterers near the transmitter and receiver. In Region R 2 , the delay-Doppler contribution of static discrete and diffuse scatterers from farther locations appear. Region R 3 contains contributions from moving discrete and static discrete scatters only. Remark 4. In Fig. 2.4, we see that there are sparse contributions from the SD and MD scatterers in all regions R 1 , R 2 , and R 3 . However, clusters of DI components are confined 28 MD#Sca'erer# SD#Sca'erer# LOS# DI#Sca'erers# Figure 2.4: Delay-Doppler domain representation of V2V channel. Delay-Doppler spreading function for diffuse components is confined to a U-shaped area. to R 1 ∪R 2 . Therefore, V2V channel components can be divided into two main clusters. One is the element-wise sparse components (mobile and static discrete scatterers) that are distributed in all three regions R 1 , R 2 , and R 3 , and the other one is the group-wise sparse components (diffuse components) that are located in regions R 1 and R 2 . We note that, in V2V channels due to the geometry of the channel and antenna heights, there exists more diffuse components compared to the cellular communication. Therefore, a proper V2V channel estimation algorithm should consider the estimation of diffuse components with higherqualitycomparedtothecellularcommunicationchannel. Sincethediffusecomponents are located in a specific part of delay-Doppler domain and the rest of the channel support in the delay-Doppler is essentially zero, we partition the channel into group vectors and take advantage of group-wise sparsity of the channel to enhance the accuracy of the estimate of the diffuse components. In Sec. 2.6, we propose a method based on joint element-wise and group-wise sparsity and Theorem 1 of Section 2.2.2 to estimate the discrete delay-Doppler representation of V2V channel exploiting this structure. 29 2.5 Observation Model and Leakage Effect In this section, we show how pulse shaping and a finite-length training sequence can be taken into account when formulating a linear observation model of the V2V channel. We assume thatp t (t) andp r (t) are causal with support [0,T supp ). The contribution to the received signal from one of the 1 +N MD +N SD +N DI terms in (2.22) is of the form s(t)∗h i (t,τ) = ∞ X l=−∞ s[l]η i e j2πν i t p t (t−lT s −τ i ) ≈ ∞ X l=−∞ s[l]η i e j2πν i (lTs+τ i ) p t (t−lT s −τ i ), (2.26) where the approximation is valid if we make the (reasonable) assumption that ν i T supp 1, and∗ denotes convolution. If we let p(t) =p t (t)∗p r (t), we can write the contribution after filtering and sampling as y i [n] = s(t)∗h i (t,τ)∗p r (t)| t=nT (2.27) = ∞ X l=−∞ s[l]η i e j2πν i (lTs+τ i ) p((n−l)T s −τ i ) = ∞ X m=−∞ s[n−m]η i e j2πν i ((n−m)Ts+τ i ) p(mT s −τ i ), and identify h i [n,m] = η i e j2πν i ((n−m)Ts+τ i ) p(mT s −τ i ). Suppose we have access to h i [n,m] for n = 0, 1,...,N r − 1 and let ω 2K+1 = exp(j2π/(2K + 1)). The (2K + 1)-point DFT of h i [n,m], where we choose (2K + 1)≥N r , is H l,i [k,m] = 1 2K + 1 Nr−1 X n=0 h i [n,m]ω −nk 2K+1 k∈K =η i e −j2πν i (mTs−τ i ) p(mT s −τ i )w(k,ν i ), (2.28) 30 whereK ={0,±1,±2,...,±K} and w(k,x) is the (2K + 1)-point DFT of a discrete-time complex exponential with frequency x and truncated to N r samples w(k,x) = Nr 2K+1 , x =k/(2K + 1) e −jπ ( k 2K+1 −x ) (Nr−1) 2K+1 sin(π( k 2K+1 −x)Nr) sin(π( k 2K+1 −x)) , otherwise (2.29) We note that the leakage in the delay and Doppler plane is due to the non-zero support of p(·) and w(·,·). The leakage with respect to Doppler decreases with the observation length, N r , and the leakage with respect to delay decreases with the bandwidth of the transmitted signal. Wecomputethe(effective)channelcoefficientsattimesamplem i = h τ i Ts i andDoppler samplek i = [ν i T s (2K + 1)], where [.] indicates the closest integer number. Note that the true channel parameters τ i and ν i are not restricted to integer multiples of a sampling interval. However, we do seek to estimate the effective channel after appropriate sampling. Thus, at the receiver side, to compensate (but not perfectly remove) for the channel leakage, the effective channel components at those sampled times are computed to equalize the channel (which may be different from the actual channel components). We can then write H l,i [k,m] =η i g[k,m,k i ,m i ], k∈K,m∈M (2.30) whereM ={0, 1,...,M− 1} and g[k,m,k 0 ,m 0 ] =ω −k 0 (m−m 0 ) 2K+1 w(k−k 0 , 0)p((m−m 0 )T s ). (2.31) Due to the linearity of the discrete Fourier transform, we can conclude that the channel with leakage is given by H l [k,m] = X i H l,i [k,m] = X i η i g[k,m,k i ,m i ], (2.32) 31 where the summation is over the LOS component and all the N MD +N SD +N DI scatterers. The channel without leakage is H[k,m] = X i η i δ[k−k i ]δ[m−m i ], (2.33) where δ[n] is the Kronecker delta function. The channels in (2.32) and (2.33) can be rep- resented for k∈K and m∈M by the vectors x l ∈ C N and x∈ C N , respectively, where N =|K||M| = (2K + 1)M, as x l = vec H l [−K, 0] ··· H l [−K,M− 1] . . . ··· . . . H l [K, 0] ··· H l [K,M− 1] (2.34) x = vec H[−K, 0] H[−K,M− 1] . . . ··· . . . H[K, 0] ··· H[K,M− 1] . (2.35) where vec(H) is the vector formed by stacking the columns of H on the top of each other. The relationship between x l and x can be written as x l = Gx, (2.36) where G∈C N×N is defined as G = vec(G 0 ) vec(G 1 ) ··· vec(G N−1 ) (2.37) G j = g[−K, 0,k 0 ,m 0 ] ··· g[−K,M− 1,k 0 ,m 0 ] . . . ··· . . . g[K, 0,k 0 ,m 0 ] ··· g[K,M− 1,k 0 ,m 0 ] , 32 where the one-to-one correspondence between j = 0, 1,..., (2K + 1)M− 1 and (k 0 ,m 0 )∈ K×M is given by j =m 0 (2K + 1) +k 0 +K. The structure of G is a direct consequence of how we vectorize H l [k,m] in (2.35). If we consider an alternative way of vectorizing H l [k,m], then the leakage matrix G needs to be recomputed accordingly, by appropriate permutation of the columns and rows of leak- age matrix defined in (2.37). As K, M, and the pulse shape are known, G is completely determined in (2.36). Thus, we can utilize the relationship in (2.36) to compensate for leakage. Consider that the source vehicle transmits a sequence of N r +M− 1 pilots, s[n], for n =−(M− 1),−(M− 2),...,N r − 1, over the channel. We collect the N r received samples in a column vector y = [y[0],y[1],...,y[N r − 1]] | . (2.38) Using (2.20), we have the following matrix representation: y = Sx l + z, (2.39) where z∼CN (0,σ 2 z I Nr ) is a Gaussian noise vector, and S is aN r ×N block data matrix of the form S = [S 0 ,..., S M−1 ], (2.40) where each block S m ∈C Nr×(2K+1) is of the form S m = diag{s[−m],...,s[N r −m− 1]} Ω, (2.41) 33 I 1 1 I 2 1 R 1 R 2 R 3 M m k k s I N g,1 1 R 1 R 2 R 3 M m k k s I N g,1 I N g,1 +1 I N g,1 +2 I N g,1 +N g,2 1 R 1 R 2 R 3 M m k k s I Ng,1 I Ng,1 +1 I Ng,1 +2 I Ng,1+Ng,2 I Ng,1+Ng,2+1 I Ng,1+Ng,2+2 I Ng 1 R 1 R 2 R 3 M m k k s I N g,1 I N g,1 +1 I N g,1 +2 I N g,1 +N g,2 I N g,1 +N g,2 +1 I N g,1 +N g,2 +2 I Ng 1 R 1 R 2 R 3 M m k k s I N g,1 I N g,1 +1 +1 I N g,1 +2 +2 I N g,1 +N g,2 I N g,1 +N g,2 +1 I N g,1 +N g,2 +2 I Ng 1 R 1 R 2 R 3 M m k k s I N g,1 I N g,1 +1 +1 I N g,1 +2 +2 I N g,1 +N g,2 I N g,1 +N g,2 +1 I N g,1 +N g,2 +2 I Ng 1 R 1 R 2 R 3 M m k k s I N g,1 I N g,1 +1 +1 I N g,1 +2 +2 I N g,1 +N g,2 I N g,1 +N g,2 +1 I N g,1 +N g,2 +2 I Ng 1 Figure 2.5: Schematic of the V2V channel vector partitioning for group vectors. for m = 0, 1,...,M− 1, and Ω∈ C Nr×(2K+1) is a Vandermonde matrix, Ω[i,j] = ω i(j−K) 2K+1 , where i = 0, 1...,N r − 1 and j = 0, 1,..., 2K. Finally, by combining (2.39) and (2.36) we have y = Sx l + z = Ax + z, (2.42) where A = SG and A∈C Nr×N . 2.6 Channel Estimation Based on our analysis in Section 2.4, the components in the vector x have both element- and group-wise sparsity. Given estimates of the parameters that define regions R 1 , R 2 , andR 3 (see Section IV and Appendix C), we illustrate how to partition the elements in the channel vector x to enforce the group-wise sparsity structure. Our partitioning is based on the sparsity structure of regions R 1 , R 2 , and R 3 , and Eq. (2.35), which maps the entries of x into channel components,H[k,m] fork =−K,...,K andm = 0,...,M− 1. To partition the elements in regions R 1 and R 2 , we collect channel components with a common Doppler value into a single group, e.g.,I 1 ,I 2 , ...,I N g,1 (blue segments), andI N g,1 +1 ,...,I N g,1 +N g,2 (red segments) as depicted in Fig.2.5. For R 3 , we know that this region contains only the 34 element-wise sparse components, thus we consider each element of x in this region as a single partition, e.g.,I N Ng,1 +N Ng,2 +1 ,...,I Ng , where N g =N Ng,1 +N Ng,2 +N Ng,3 as depicted in Fig.5. Now that we have a partition of all the elements in x, we can easily determine the group vectors x i as follows: x i [k] = x[k] k∈I i 0 k / ∈I i , for i = 1,...,N g (2.43) We know that most of components inR 3 are zero (or close to zero) and there is a significant numbers of non-zero diffuse components in region R 1 and R 2 . Therefore, if we enforce the group sparsity regularization over this partitioning of elements in x, it will improve the quality of estimation of diffuse components. We next specify the regularizations to exploit the jointly sparse structure of the V2V channel as follows ˆ x = argmin x∈C N 1 2 ky− Axk 2 2 +φ g (|x|;λ g ) +φ e (|x|;λ e ) , (P 0 ) where|x| = [|x[1]|,...,|x[N]|] T with N = M(2K + 1) and|x[i]| = q Re(x[i]) 2 +Im(x[i]) 2 . Here the regularization functions are φ g (|x|;λ g ) = Ng X j=1 f g (kx j k 2 ;λ g ), (2.44) φ e (|x|;λ e ) = N X i=1 f e (|x[i]|;λ e ). (2.45) 35 We develop a proximal ADMM algorithm to solve problem P 0 . Problem P 0 can be rewritten using an auxiliary variable w as follows, min x,w∈C N 1 2 ky− Axk 2 2 +φ g (|w|;λ g ) +φ e (|w|;λ e ) s.t. w = x (2.46) For the optimization problem in (2.46), ADMM consists of the following iterations, • Initialize: ρ6= 0, λ ρg = λg ρ , λ ρe = λe ρ , θ 0 = w 0 = 0, A 0 = ρ 2 I + A H A −1 , and x 0 = A 0 A H y. • Update-x: x n+1 =ρ 2 A 0 w n −θ n ρ + x 0 , (2.47) whereθ n ρ = θ n ρ 2 . • Update-w: for i = 1, 2,...,N g , w n+1 i = argmin w i 1 2 x n+1 i +θ n ρi − w i 2 2 +g(|w i |;λ ρg ) +E(|w i |;λ ρe ), (2.48) where the index i denotes the group number, and E(|w i |;λ ρe ) = X j f e |w i [j]| ρ ;λ ρe ! , (2.49) g(|w i |;λ ρg ) =f g kw i k 2 ρ ;λ ρg ! . (2.50) • Update-dual variable-θ: θ n+1 ρ =θ n ρ + x n+1 − w n+1 . (2.51) 36 Details of this derivation are provided in Appendix 2.13. Note that both A 0 and x 0 are known and can be computed in advance. In the update-w step, the index i denotes the group number, thus, this step can be done in parallel for all groups, simultaneously. Since the vectors in optimization problem for updating w in (2.48) are complex vectors, Theorem 1 in Section 2.2.2, cannot directly be applied to find a closed-form solution for this optimization problem. However, in order to apply Theorem 1, we introduce the following notation and lemma. The vector w∈C n can be written as w =|w| Phase(w), where the nth element of Phase(w) is exp(jAng(w[n])), and Ang(w[n]) is the angle of w[n] in polar form, i.e., w[n] =|w[n]| exp(jAng(w[n])), and is component-wise multiplication (Schur product). Lemma 1. For any c∈C N argmin z∈C N kc− wk 2 2 = Phase(c) argmin |w|∈R N k|c|−|w|k 2 2 . (2.52) TheproofisprovidedinAppendix2.14. Sincethetwolasttermsin(2.48)areindependent of the phase of w i , we use Lemma 1 to write the ith group problem in (2.48) as w n+1 i = Phase x n+1 i +θ n ρi argmin |w i | 1 2 |x n+1 i +θ n ρi |−|w i | 2 2 +g(|w i |;λ ρg ) +E(|w i |;λ ρe ) ! (2.53) Now, the vectors in the optimization problem in (2.53) involve only real vectors, therefore, the solution for this optimization problem can be directly computed using Theorem 1. We determine a closed form solution for the update-w step in Corollary 1, below, using the proximity operators of the univariate functions f e and f g . This update rule is a direct consequence of Theorem 1. 37 Table 2.1: Proposed V2V Channel Estimation Method Input: y, A, λ g , λ e , ρ, n max , . Initialize: w 0 =θ 0 ρ = 0. Pre-computation: A 0 = ρ 2 I + A H A −1 , x 0 = A 0 A H y. For n = 0 to n max − 1 x n+1 =ρ 2 A 0 w n −θ n ρ + x 0 w n+1 i =P λρg,g P λρe,E x n+1 i +θ n ρi Phase x n+1 i +θ n ρi for i = 1, 2,...,N g θ n+1 ρ =θ n ρ + (x n+1 − w n+1 ) ifkx n+1 − x n k 2 < then break End Output: Vector x. Corollary 1. The second step, update-w, can be performed as follows w n+1 i =P λρg,g P λρe,E x n+1 i +θ n ρi Phase x n+1 i +θ n ρi , where E and g are defined in Equations (2.49) and (2.50). Based on Corollary 1, the update-w step only depends on the proximity operators of the regularizer functions f e and f g . The proposed algorithm to estimate the channel vector x from the received data vector y is summarized in Table I. The complexity of our joint sparse signal estimation algorithm is as follows: Step 1) update x, incurs a computational complexity ofO(N 2 ), whereN =M(2K + 1), due to a matrix/vector multiplication. Step 2) updating w, requires N g group-wise threshold comparisons and N element-wise threshold comparisons. Thus, it has complexity of O(N + N g )≈O(N). Updating the dual variable in Step 3) has O(N) complexity. Therefore, the overall complexity of our proposed algorithm is O(N 2 ). In our algorithm (similar to the other methods), we compute a Least-Squares (LS) solution for the initial- ization. Computation of a least squares (LS) solution can be implemented with complexity 38 O(N 3 ). Algorithms such as the Wiener filter, the Hybrid Sparse/Diffuse (HSD) estima- tor that are designed based on statistical knowledge of channel parameters require a large number of samples to estimate the required covariance matrices. If the correlation matri- ces are known, then the Wiener filter and the Hybrid Sparse/Diffuse (HSD) estimator have computational complexity in order of O(N 3 ). 2.7 Numerical Results In this section, we demonstrate the performance gains that can be achieved with our pro- posed, structured, estimation using both convex and non-convex sparsity-inducing regulariz- ers, in comparison to prior methods such as Wiener filtering [107], the Hybrid Sparse/Diffuse (HSD) estimator [68, 69, 13], and the compressed sensing (CS) method [4, 95]. To simulate the channel, we consider a geometry with length of 1 km around the transmitter-receiver pair, road width D = 50m, and the width of the diffuse strip around the road d = 25m, as Fig. 2.2. The locations of the transmitter and receiver are chosen in this geometry with distance 100m to 200m from each other. The speeds of the transmit and receive vehicles are chosen randomly from the interval [60, 160] (km/h), the speed limits for a highway. It is assumed that the number of MD scatterers N MD = 10, and their speeds are also chosen randomly from the interval [60, 160] (km/h); we haveN SD = 10 andN DI = 400, SD and DI scatterers, respectively [57]. Using these parameters, we compute the delay and Doppler values for each scatterer. The statistical parameter values for different scatterers are selected as in Table I of [57], which are determined from experimental measurements. The scatterer amplitudes were randomly drawn from zero mean, complex Gaussian distributions with three different power delay profiles for the LOS and mobile discrete (MD) scatterers, static discrete scatterers, and diffuse scatterers. We assume that the mean power of the static scatterers is 10 dB less than the mean power of the LOS and MD scatterers, and the mean power of the diffuse scatterers also is 20 dB less than the mean power of the LOS 39 and MD scatterers [57]. Furthermore, we consider f c = 5.8 GHz, T s = 10 ns, N r = 1024, K = 512, andM = 256. The pilot samples are drawn from a zero-mean, unit variance Gaus- sian distribution. The interpolation/anti-aliasing filters p t (t) = p r (t) are root-raised-cosine filters with roll-off factor 0.25 andT supp = 1μs. The required regularization parameters were found by trial and error using a cross-validation algorithm on the data [77]. Performance is measured by the normalized mean square error (NMSE), normalized by the mean energy of the channel coefficients. The NMSE is defined as E{kˆ x− xk 2 2 /kxk 2 2 }, where ˆ x is the estimated channel vector and SNR defined as SNR =E{ky− zk 2 2 }/E{kzk 2 2 }. Fig. 2.6 depicts the MSE using our proposed estimator, our previously proposed hybrid sparse/diffuse (HSD) estimator as adapted to V2V channels[13], a compressed sensing (CS) method [4, 95], and the Wiener based estimator of [107]. We have also included a curve (known support) corresponding to the case that the support (location of non-zero compo- nents) of vector x is known when we apply our joint sparse estimation method. This curve provides a lower bound for the performance of our proposed estimator. The HSD estima- tor considers the channel components as a summation of sparse and diffuse components, i.e. x = x s + x d . Sparse components, x s , are modeled by the element-wise product of an unknown deterministic amplitude, a s , and a random Bernoulli vector, b s , x s = a s b s . Furthermore, the diffuse components are assumed to follow a Gaussian distribution with exponential profile, x d ∼N (0, Σ d ) where Σ d is diagonal and is the Kronecker product of covariance matrices of the channel component vectors with common Doppler values [13]. The profile parameters are retrieved using the expectation-maximization algorithm [68]. The HSD model estimation procedure can be stated as: first, the loca- tion of sparse components, i.e., the Bernoulli vector, are determined as ˆ b s [k] = 1 |x LS [k]| 2 ≥γ (Σ e [k,k]) −1 + Σ d [k,k] , where 1(.) is the indicator function, x LS = ρ 2 I + A H A −1 A H y is the regularized LS estimation, Σ e is the covariance matrix of the 40 5 10 15 20 −30 −25 −20 −15 −10 −5 0 SNR (dB) NMSE (dB) Wiener CS Method HSD method Nested−soft Nested−MCP Nested−SCAD Known support Figure 2.6: Comparison of NMSE v.s SNR for Wiener filter estimator [107], HSD estimator [13, 68], CS Method [4, 95], and proposed method with different regularizes, i.e., Nested-soft [90, 26], Nested-MCP, and Nested-SCAD regularizers. noise vector after LS estimation and γ is a known parameter [13]. Then, the sparse compo- nents are computed as ˆ x s = x LS ˆ b s , and finally the diffuse components can be estimated as ˆ x d = Σ d (Σ d + Σ −1 e ) −1 (x LS − ˆ x s ) [13, 68]. The Wiener based estimator [107] estimates the channel as ˆ x = R x A H AR x A H +σ 2 z I −1 y with R x =E{xx H },whichisnotknownatthereceiverside,butisapproximatedbyassuming a delay-Doppler scattering function prototype with flat spectrum in a 2D region as in [6]. The maximum path delay and Doppler in the support of scattering function are considered as τ max = 1.5μs and ν max = 860 Hz, respectively. As we will see, this idealized scattering function assumption results in degraded performance. It is clear from Fig. 2.6 that there is performance improvement when we consider the joint structural information of the channel in the delay and Doppler domain. For our pro- posed method, we have considered different types of regularizers. In Fig. 2.6, Nested-Soft 41 corresponds to our proposed structured estimator with a soft-thresholding regularizer, i.e., f g (x;λ g ) = λ g |x| and f e (x;λ e ) = λ e |x| (Note that this special case of our algorithm is the case considered in[90] and [26] for unknown parameter p = 1); Nested-SCAD corresponds to the case where f g (x;λ g ) is the SCAD regularizer function with μ S = 3 in Equation (2.7) and f e (x;λ e ) =λ e |x|; and Nested-MCP corresponds to the case where f g (x;λ g ) is the MCP regularizer function with μ M = 2 in Equation (2.8) and f e (x;λ e ) =λ e |x|, respectively. The penalty parameters for the simulations have been considered as λ g /λ e ≈ 10 andλ g ∈ [0, 10]. Fig. 2.6 shows that the non-convex regularizers improve estimation quality by about 5 dB at low SNR and 7 dB at high SNR values with the same computational complexity compared to the convex soft-thresholding regularizer. There is also a significant improvement in effective SNR due to the exploitation of the V2V channel structure in the delay-Doppler domain. For instance, to achieve MSE =−20 dB, the performance curve related to the the structured estimator shows a 10dB improvement in SNR compared to that for the HSD estimator, and 15 dB improvement in SNR compared to that for the Wiener Filter estimator. From the results in Fig.2.6, we can conclude that since the channel components in V2V channels (sparse and diffuse components) have different levels of energy, proximity operators such as SCAD and MCP with a multi-threshold nature of their proximal operators (see Fig.2.1 and Eq.s (2.14) and (2.15)) are (more) suited to the channel structure. In Fig. 2.7, we consider the effect of leakage compensation (Section 2.5) on the perfor- mance of sparse estimator of V2V channels, such as the HSD estimator and our proposed joint sparsity estimator using SCAD regularizer function for group sparsity. From the results in Fig. 2.7, we observe that the uncompensated leakage effect reduces performance severely, more than 7 dB, particularly at higher SNR, due to the channel mismatch introduced by the channel leakage. Next, we assess the performance of our proposed V2V channel estimation algorithm for different values of N SD , N DI , and N MD in the channel. In Fig.2.8, the performance of our algorithm for different choices of N SD +N MD is depicted. Note that the static (SD) 42 5 10 15 20 −30 −25 −20 −15 −10 SNR (dB) NMSE (dB) HSD Model * HSD Model Joint Structured * Joint Structured Figure 2.7: NMSE of the channel estimators. The (∗) in the legend means that the leakage effect is not compensated, i.e., G = I is assumed in the channel estimation algorithm. 50 250 450 650 850 −30 −25 −20 −15 −10 −5 N MD +N SD NMSE (dB) Figure 2.8: Performance of algorithm for different values of N DI and N SD +N MD . NMSE for varying N SD +N MD and fixed N DI = 500 and mobile discrete (MD) components have similar effects on the channel sparsity pattern. Therefore, we consider the value of N SD + N MD for our simulations. In Fig. 2.9, the performance of our algorithm for different choices of N DI is depicted For this numerical simulation, we consider K = 1024, M = 512, and N r = 2048, and SNR=10 dB. 43 400 800 1200 1600 2000 −28 −24 −20 −16 N DI NMSE (dB) Figure 2.9: Performance of algorithm for different values of N DI and N SD +N MD . NMSE for varying N DI and fixed N SD +N MD = 50 It can be seen that by decreasing the number of channel components, the performance of our method improves. The value of parameterK is lower bounded byN r , namely the total number of measure- ments, in the sense that 2K +1≥N r , as given in Eq. (2.21). To decrease the computational complexity, we consider K =d(N r − 1)/2e, which provides a good performance results as seen in Fig. 2.10. However, we can consider K ≥d(N r − 1)/2e by zero padding the DFT transform in Eq. (2.21). The larger the K, the better the leakage is compensated, but the higher the signal dimension of x. The first will improve the signal recovery, but the latter degrades the signal reconstruction as more unknown variables are introduced. To understand the effect of increasing K, we consider K = γd(N r − 1)/2e, where 1≤ γ. Here for computational efficiency, we consider γ = 2 n , where n is a nonnegative integer. Furthermore, N r = 512, M = 256, N SD +N MD = 20 and N DI = 200 are considered. Results in Fig.2.10 show that 44 1 2 4 8 16 32 −30 −25 −20 −15 −10 γ NMSE (dB) Figure 2.10: Performance of the proposed algorithm under different values of K =γd(N r − 1)/2e. by increasingγ, there is an improvement in the performance of the algorithm forn≤ 4. But after n≥ 5 (increasing the signal dimension) increasing K, increases the NMSE. 2.8 Real Channel Measurements In this section, we use experimental channel data, recorded by WCAE 5 measurements in a highway environment, to model the V2V channel and also assess the performance of our proposed channel estimation algorithm. The channel measurement data was collected using the RUSK-Lund channel sounder in Lund and Malmö city in Sweden. The complex 4× 4 multi-input-mutli-output (MIMO) channel transfer functions were recorded at a 5.6 GHz carrier frequency over a bandwidth of 240 MHz in several different propagation environments with two standard Volvo V70 cars used as the TX and RX cars during the measurements [1]. 5 Wireless Communication in Automotive Environment 45 2.8.1 Measurement setup V2V channel measurements were performed with the RUSK-LUND sounder using a multi- carrier signal with carrier frequency 5.6 GHz to sound the channel and records the time- variant complex channel transfer function H(t,f). The measurement bandwidth was 240 MHz, and a test signal length of 3.2μs was used. The time-varying channel was sampled every 0.307 ms, corresponding to a sampling frequency of 3255 Hz during a time window of roughly 31.2 ms. The sampling frequency implies a maximum resolvable Doppler shift of 1.5 kHz, which corresponds to a relative speed of about 350 km/h at 5.6 GHz. By Inverse Discrete Fourier Transforming (IDFT) the recorded frequency responses H(t,f), with a Hanning window to suppress side lobes, the complex channel impulse responsesh(t,τ)areobtained. Finally, bytakingDiscreteFourierTransform(DFT)ofh(t,τ) with respect to t, the channel scattering function in the delay-Doppler domain, H[k,m], is computed. In this experiment, K = 116 and M = 256 is considered. Three recorded chan- nel scattering functions, H[k,m], are plotted in Figures 2.11 and 2.13. Fig. 2.11 presents a V2V channel in the delay-Doppler domain. A discrete component is visible at approxi- mately 0.65μs propagation delay. Also plotted in the figure is the Doppler shift vs. distance as produced by Equation (2.24) and (2.25), i.e., for scatterers located on a line parallel to (and a distance 5 m away from) the TX/RX direction of motion. We notice that V2V channel scattering (in the delay-Doppler domain) in Figures 2.11 and 2.13 (a) and (b) are highly structured as predicted in Section 2.4. As seen in these figures, the diffuse compo- nents are confined in a U-shaped area that was also predicted by our analysis in Section 2.4. Furthermore, we can observe the sparse structure of the discrete components in all the figures. In Fig. 2.14, we compare the performance of our proposed nested estimators with the CS method [4, 95] to estimate the channel given in Fig. 2.13 (a). The training pilot samples are generatedasdiscussedinSection2.7. TovarytheSNR,weaddadditivewhiteGaussiannoise 46 to the signal at the output of channel. Note that bandwidth and observation time for the channel measurements are large enough to allow us to ignore the leakage effect. The results in Fig. 2.14 confirm our numerical analysis in Section 2.7 and show that our proposed joint sparse estimation algorithm has a better performance compared to the (only) element-wise sparse estimator methods [4, 95]. In Fig. 2.15, we investigate the effect of specifying regions on the channel estimation algorithm. We consider the channel given in Fig. 2.13 (b) for this experiment. We have considered three different region scenarios. In the first scenario, we determine the regions as computed by our heuristic method given in Appendix 2.12. In the second scenario, we keep Δτ the same as the first scenario, but we set Δν = νmax 2 , which means that we neglect the structure of the diffuse components in regionsR 2 and we assume that the diffuse components can occur in the entire delay-Doppler domain. Finally, in the third scenario, we set Δτ = τ max , i.e., R 1 extends to cover the entire delay-Doppler domain. Results in Fig. 2.15 indicate that considering the structural information of the V2V channel in the delay- Doppler (three regions) significantly improves the performance of our joint sparse estimation algorithm. 2.9 Conclusions We provide a comprehensive analysis of V2V channels in the delay-Doppler domain using the well-known geometry-based stochastic channel modeling. Our characterization reveals that the V2V channel model has three key regions, and these regions exhibit different sparse/hybrid structures which can be exploited to improve channel estimation. Using this structure, we have proposed a joint element- and group-wise sparse approximation method using general regularization functions. We prove that for the needed optimization, the opti- mal solution results in a nested estimation of the channel vector based on the group and element wise penalty functions. Our proposed method exploits proximity operators and the 47 Discrete(component( Equa1on(((24)( Figure 2.11: The channel delay-Doppler scattering function for a real channel measurement data [1]. alternating direction method of multipliers, resulting in a modest complexity approach with excellent performance. We characterized the leakage effect on the sparsity of the channel and robustified the channel estimator by explicitly compensating for pulse shape leakage at the receiver using the leakage matrix. Simulation results reveal that exploiting the joint sparsity structure with non-convex regularizers yields a 5 dB improvement in SNR compared to our previous state of the art HSD estimator in low SNR. Furthermore, using experimental data of V2V channel from a WCAE measurement campaign, we showed that our estimator yields 4 dB to 6 dB improvement in SNR compared to the compressed sensing method in [4, 95]. 48 Delay [μs] Doppler [Hz] 0.2 0.3 0.4 0.5 0.6 0.7 −1000 −800 −600 −400 −200 0 200 400 600 800 1000 Figure 2.12: Delay-Doppler spreading function. Diffuse components are confined to a U- shaped area - Δτ = 0.3 μs and Δν = 500 Hz. 2.10 Appendix A: Sparsity Inducing Regularizers In this section, we show that the regularizer functions summarized in Section 2.2.2 satisfy Assumptions I. We note that showing the assumptions for the convex regularizers is straight- forward and thus omitted; we focus on the non-convex regularizers which will be used to induce group sparsity. SCAD regularizer [38]: This penalty takes the form f g (x;λ) = λ|x|, for|x|≤λ − x 2 −2μ S λ|x|+λ 2 2(μ S −1) for λ<|x|≤μ S λ (μ S +1)λ 2 , 2 for|x|>μ S λ (2.54) where μ S > 2 is a fixed parameter. This penalty function is non-decreasing, f g (0;λ) = f g (x; 0) = 0 and it is clear that f g (αx;αλ) = α 2 f g (x;λ) for∀α > 0. The derivative of the SCAD penalty function for x6= 0 is given by f 0 g (x;λ) = sign(x) (μ S λ−|x|) + μ S − 1 I{|x|>λ} +λI{|x|≤λ} ! (2.55) 49 Delay [μs] Doppler [Hz] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 −1000 −800 −600 −400 −200 0 200 400 600 800 1000 Figure 2.13: Delay-Doppler spreading function. Diffuse components are confined to a U- shaped area - Δτ = 0.3 μs and Δν = 200 Hz. 0 4 6 8 10 12 14 16 18 20 −24 −20 −16 −12 −8 −4 SNR (dB) NMSE (dB) CS Method Group−wise Nested−Soft Nested−SCAD Figure 2.14: Comparison of NMSE v.s SNR for CS Method [4, 95], and proposed method i.e., Group-wise (λ e = 0), Nested-soft [90, 26], and Nested-SCAD regularizers. whereI(.) is the indicator function, and any point in interval [−λ, +λ] is a valid subgradient at x = 0, so condition (iv) is satisfied. For μ = 1 μ S −1 the function f g (x;λ) +μx 2 is also 50 2 4 6 8 10 12 14 16 −24 −20 −16 −12 −8 −4 0 SNR (dB) NMSE (dB) Δ τ = τ max Δ ν = ν max /2 Estimated Δ τ, Δ ν Figure 2.15: Performance of the proposed algorithm for different values of Δτ and Δν. convex. Thus, Assumption I holds for the SCAD penalty function with μ S ≥ 3. MCP regularizer [108]: This penalty takes the form f g (x;λ) = sign(x) Z |x| 0 λ− z μ M ! + dz (2.56) where μ M > 0 is a fixed parameter. This penalty function is non-decreasing for x≥ 0 and f g (0;λ) = f g (x; 0) = 0. Also, f g (αx;αλ) = α 2 f g (x;λ) for∀α > 0. The derivative of the MCP penalty function for x6= 0 is given by f 0 g (x;λ) = sign(x) λ− |x| μ M ! + (2.57) and any point in [−λ, +λ] is a valid subgradient at x = 0. For μ = 1 μ M the function f g (x;λ) +μx 2 is also convex. Thus, Assumption I holds for the MCP penalty function with μ M ≥ 2. 51 2.11 Appendix B: Proof of Theorem 1 We first prove two lemmas, needed for the proof of Theorem 1 and Corollary 1. Lemma 2. Consider that g(a;λ) = f kak 2 ρ ;λ , where f(x;λ) is a non-decreasing function of x. Furthermore, f(x;λ) is a homogeneous function, i.e., f(αx;αλ) =α 2 f(x;λ). Then, i) P λ,g (a) =γa, where γ∈ [0, 1]. ii) γ = 1 kak 2 P ρλ, f ρ 2 (kak 2 ) ifkak 2 > 0 0 ifkak 2 = 0 Proof: i). For every z, we can write z = a ⊥ +γa, where a ⊥ ⊥ a and γ∈R. Therefore, based on the proximity operator definition we have, P λ,g (a) = argmin z 1 2 ka− zk 2 2 +g(z;λ) (2.58) = argmin z=a ⊥ +γa 1 2 a−γa− a ⊥ 2 2 +g γa + a ⊥ ;λ Since γa + a ⊥ 2 ≥ max h |γ|kak 2 ,ka ⊥ k 2 i and f is a non-decreasing function, we have g γa + a ⊥ ;λ ≥ g (γa;λ). Therefore, a ⊥ = 0 in the optimization problem (2.58) and we can rewrite it as, P λ,g (a) = a argmin γ 1 2 (γ− 1) 2 kak 2 2 +g (γa;λ) . (2.59) The two terms in the objective function in (2.59) are increasing whenγ increases fromγ = 1 or decreases from γ = 0. Hence, the minimizer lies in the interval [0, 1]. Therefore, we have 52 P λ,g (a) =γa, where γ∈ [0, 1]. ii). Let t =γkak 2 , then the optimization problem in (2.59) forkak 2 > 0, can be written as, P λ,g (a) = a kak 2 argmin t∈[0,kak 2 ] " 1 2 (kak 2 −t) 2 +f t ρ ;λ !# (a) = a kak 2 argmin t∈[0,kak 2 ] " 1 2 (kak 2 −t) 2 + 1 ρ 2 f(t;ρλ) # (b) = a kak 2 P ρλ, f ρ 2 (kxk 2 ). (2.60) Equality (a) is due to the homogeneity of function f, i.e., f t ρ ;λ =f t ρ ; ρλ ρ = 1 ρ 2 f(t;ρλ). Equality (b) is due to the definition of the proximal operator. Thus, γ = 1 kak 2 P ρλ, f ρ 2 (kak 2 ), and the proof of the Lemma is completed. Lemma 3. If the function f(x;λ) is homogenous i.e., f(αx;αλ) =α 2 f(x;λ) for all α> 0, then P αλ,f (αb) =αP λ,f (b) for∀b∈R and λ> 0. Proof: By definition of the proximity operator, we have P αλ,f (αb) = argmin x h 1 2 (αb−x) 2 +f(x;αλ) i . Consider x = αz, and using the homoge- nous properties of f, we have P αλ,f (αb) = α argmin z h α 2 2 (b−z) 2 +α 2 f(z;λ) i = α argmin z h 1 2 (b−z) 2 +f(z;λ) i =αP λ,f (b). Proof of Theorem 1: Since the regularizer functions φ e and φ g are separable, it is easy to show that the solution of optimization problem in Equation (2.4) can be computed in parallel for all the groups as, ˆ a i = argmin a i ∈R N n 1 2 kb i − a i k 2 2 +g(a i ;λ g ) +E(a i ;λ e ) o for i = 1,...,N g , where g(a i ;λ g ) = f g (ka i k 2 ;λ g ) and E(a i ;λ e ) = P j f e (a i [j];λ e ). For the sake of simplicity in notation of the proof for Theorem 1 and Corollary 1, we drop the group index and we consider ˆ a = argmin a n 1 2 kb− ak 2 2 +g(a;λ ρg ) +E(a i ;λ ρe ) o where g(a;λ ρg ) = f g kak 2 ρ ;λ ρg and E(a;λ ρe ) = P j f e a[j] ρ ;λ ρe . Here λ ρg = λg ρ and λ ρe = λg ρ . Note that for ρ = 1, we have the claim in Theorem 1. Assume v =P λρe,E (b) and u =P λρg,g (v). 53 Based on above definitions, to prove the claim of Theorem 1, we need to show that a = u is the minimizer of J(a) = 1 2 kb− ak 2 2 +g(a;λ ρg ) +E(a;λ ρe ). To prove this claim, we consider two cases: I: u6= 0, and II: u = 0. Case (I): u6= 0. Since u =P λρg,g (v) and g(a;λ ρg ) =f g kak 2 ρ ;λ ρg , and f g is a homoge- nous non-decreasing function, by Lemma 2 we have u = γv, for some γ ∈ (0, 1]. Fur- thermore, u should satisfy the first order optimality condition for the objective function in u = argmin a h 1 2 kv− ak 2 2 +g(a;λ ρg ) i , namely 0∈ u− v +∂g(u;λ ρg ). (2.61) Using the definition of the proximity operator and Remark 1, we have [P λρe,E (b)] i = P λρe,fe b[i] ρ . Since f e is a homogeneous function, using Lemma 3, we have P γλρe,fe γ b[i] ρ = γP λρe,fe b[i] ρ or equivalently, P γλρe,E (γb) = argmin a 1 2 kγb− ak 2 2 +E(a;γλ ρe ) (2.62) =γ argmin z 1 2 kb− zk 2 2 +E(z;λ ρe ) =γP λρe,E (b) =γv = u and by the first order optimality condition (of u) for the objective function in (2.62), we have 0∈ u−γb +∂E(u;γλ ρe ). Since γ6= 0, above Equation can be rewritten as 0∈ v− b + 1 γ ∂E(u;γλ ρe ). (2.63) 54 Since E(u;λ ρe ) = P j f e u[j] ρ ;λ ρe , applying scale invariant property of function f e , i.e., f e u[j] ρ ;γλ ρe = γf e u[j] ρ ;λ ρe , we have 1 γ ∂E(u;γλ ρe ) = ∂E(u;λ ρe ). Therefore, we can rewrite (2.63) as 0∈ v− b +∂E(u;λ ρe ). (2.64) Summing Equations (2.61) and (2.64), we have 0∈ u− b +∂g(u;λ ρg ) +∂E(u;λ ρe ), which is the first order optimality of u for the objective function J(a). Case (II): u = 0. Here, we need to show the first-order optimality conditions for u = 0 for the objective function J(a), i.e., 0∈ [u− b +∂g(u;λ ρg ) +∂E(u;λ ρe )]| u=0 = ∂g(0;λ ρg ) +∂E(0;λ ρe )− b. This is equivalent to showing the existence of aχ 1 ∈ [−1, +1] N , equivalent to the term∂E(0;λ ρe ) , whereχ 2 withkχ 2 k 2 ≤ 1 ρ equivalent to the term ∂g(0;λ ρg ) such that b =λ ρe χ 1 +λ ρe χ 2 , due to property (iii) in Assumption I. By definition of the proximity operator we have u =P λρg,g (v) = argmin z 1 2 kv− zk 2 2 +g(z;λ ρg ) (2.65) = argmin z " 1 2 kv− zk 2 2 +f g kzk 2 ρ ;λ ρg !# . Using the first optimality condition of u = 0 for the objective function in (2.65), we have 0∈−v +∂f g (0;λ ρg ) ∂(k0k) ρ . (2.66) Since ∂(k0k 2 ) = n x∈R N ,kxk 2 ≤ 1 o and|z|≤ λ ρg for all z∈ ∂f g (0;λ ρg ) (using property (iii) in Assumption I), using Equation (2.66) we have kvk 2 ≤ λρg ρ . Furthermore, since v =P λρe,E (b), the first-order optimality condition implies that 0∈∂E(v;λ ρe )+v−b. Thus forχ 1 ∈∂E(v;λ ρe ) andχ 2 = v λρg , we have b =λ ρe χ 1 +λ ρe χ 2 and proof is completed. 55 R 1 R 2 R 3 M m k k s 1 R 1 R 2 R 3 M m k k s 1 R 1 R 2 R 3 M m k k s 1 R 1 R 2 R 3 M m k k s 1 R 1 R 2 R 3 M m k k s 1 R 1 R 2 R 3 M m k k s 1 R 1 R 2 R 3 M m k k s 1 R 1 R 2 R 3 M m k k s 1 R 1 R 2 R 3 M m k k s 1 R 1 R 2 R 3 M m k k s 1 Figure 2.16: Discrete representation of the regions R 1 , R 2 , and R 3 . 2.12 Appendix C: Region and Group Specification Here, we propose a heuristic method to find the regionsR 1 ,R 2 , andR 3 , depicted in Fig. 2.16, and introduced in Section 2.4. To describe the regions, we need to compute the value ofk S , Δk, and Δm, i.e., thediscreteDoppleranddelayparametersthatcorrespondstoν S , Δν, and Δτ, respectively. To estimate the Δm and Δk, we use a regularized least-squares estimate of xgivenby x LS = A 0 y = A 0 (Ax+z)≈ x+ewhere e = A 0 zand A 0 = (A H A+ρ 2 I) −1 A H and ρisasmallrealvalue. BasedonrelationshipintheEq.(2.35), wecanwrite x LS = vec{H LS }, where H LS is an estimate of the discrete delay-Doppler spreading function. Let us define the functionE d (m) = 1 m P m j=1 P +K i=−K |H LS [i,j]| 2 for 1≤ m≤ M. This function represent the energy profile of channel components in delay direction. Then, Δm = min m {m|E d (m)≤T 1 }, (2.67) where T 1 = α d E d (0). Here 0 < α d < 1 is a tuning parameter. For a highway environment [57], based on our numerical analysis, α d ≈ 0.4 is a reasonable choice. 56 After computing Δm, to compute the values of Δk and k s as labeled in Fig.8, due to symmetry of channel diffuse components around zero Doppler value, we define functions E ν (k) = P M i=Δm |H LS [k,i]| 2 +|H LS [−k,i]| 2 for 0≤ k≤ K. This function represents the energy profile of channel components in Doppler direction. Let us define k 0 = max k E ν (k). Then, we can estimate Δk and k s using following equations, k s − Δk = max k {k|E ν (k)<T 2 , 0≤k<k 0 }, (2.68) k s = min k {k|E ν (k)<T 2 ,k 0 <k≤K}, (2.69) where T ν = α ν E ν (k 0 ) with 0 < α ν < 1. For highway environment, based on our numerical analysis, α ν ≈ 0.6 is a good choice. 2.13 Appendix D: Proximal ADMM Iteration Devel- opment For the optimization problem given in (2.46), we form the augmented Lagrangian L ρ (x, w,θ) = 1 2 ky− Axk 2 2 +φ g (|w|;λ g ) +φ e (|w|;λ e ) +hθ, x− wi + ρ 2 2 kx− wk 2 2 , (2.70) whereθ is the dual variable, ρ6= 0 is the augmented Lagrangian parameter, andha, bi = Re(b H a). Thus, ADMM consists of the iterations: • update-x: x n+1 = argmin x 1 2 ky− Axk 2 2 +hθ n , x− w n i + ρ 2 2 kx− w n k 2 2 . • update-w: w n+1 = argmin w φ g (|w|;λ g )+φ e (|w|;λ e )+hθ n , x n+1 − wi+ ρ 2 2 kx n+1 −wk 2 2 . • update-dual variable: θ n+1 =θ n +ρ 2 (x n+1 − w n+1 ). 57 Derivingaclosedformexpressionsforupdate-x isstraightforward, x n+1 =ρ 2 A 0 w n −θ n ρ + x 0 , where θ n ρ = θ n ρ 2 , A 0 = ρ 2 I + A H A −1 and x 0 = A 0 A H y. If we pull the linear terms into the quadratic ones in the objective function of update-w and ignoring additive terms, independent of w, then we can express this step as w n+1 = argmin w ( 1 2 x n+1 +θ n ρ − w 2 2 + 1 ρ 2 (φ g (|w|;λ g ) +φ e (|w|;λ e )) = argmin w Ng X i=1 ( 1 2 x n+1 i +θ n ρi − w i 2 2 +f g kw i k 2 ρ ; λ g ρ ! + X j f e |w i [j]| ρ ; λ e ρ ! (2.71) where x i , w i , andθ ρi are computed using the partitions introduced for the channel vector in Section 2.6. Thus, we can perform the update-w step in parallel for all groups, w n+1 i = argmin w i ( 1 2 x n+1 i +θ n ρi − w i 2 2 +f g kw i k 2 ρ ; λ g ρ ! + X j f e |w i [j]| ρ ; λ e ρ ! (2.72) Here, for simplicity in representation, we define λ ρg = λg ρ and λ ρe = λe ρ . In addition, we define E(|w i |;λ ρe ) = P j f e |w i [j]| ρ ;λ ρe , g(|w i |;λ ρg ) =f g kw i k 2 ρ ;λ ρg . Thus, we have w n+1 i = argmin w i 1 2 x n+1 i +θ n ρi − w i 2 2 +g(|w i |;λ ρg ) +E(|w i |;λ ρe ) (2.73) To guarantee convergence to the optimal solution in (P 0 ), the overall objective function, i.e., 1 2 ky− Axk 2 2 +φ g (|x|;λ g )+φ e (|x|;λ e ), should be a convex function [36]. Note that since the first term in the objective function, i.e., the quadratic penalty function, is convex, for any functions f e and f g that satisfies condition (iv) in Assumption I, the overall objective 58 function is convex as well. Thus, ADMM yields convergence for all choices of the convex and non-convex functions given in Section 2.2.2. 2.14 Appendix E: Proof of Lemma 1 The functionkc− wk 2 2 =kck 2 2 +kwk 2 2 − 2Re{c H w} =k|c|k 2 2 +k|w|k 2 2 − 2Re{c H w} is minimized, with respect to phase of w, when Re{c H w} is maximized. Now, Re{c H w} = N X n=1 |c[n]||w[n]| cos(Ang(c[n])− Ang(w[n])) ≤ N X n=1 |c[n]||w[n]| =|c| T |w| (2.74) with equality if and only if Phase(w) = Phase(c), which in turn implies thatkc− wk 2 2 = k|c|−|w|k 2 2 . Hence, argmin |w|Phase(w)∈C N kc− wk 2 2 = argmin |w|Phase(c)∈C N k|c|−|w|k 2 2 = Phase(c) argmin |w|∈R N k|c|−|w|k 2 2 (2.75) and the lemma follows. 59 Chapter 3 MSML Channel Estimation 3.1 Introduction The underwater acoustic (UWA) communication channel is wideband due to the low speed of propagation of sound in water, as well as the relatively high Doppler induced by mobility [54, 64, 84]. Recent measurement campaigns [47, 56, 99, 7] and modeling efforts for wideband time-varying channels suggest the consideration of time-scaling as well as multipath for UWA channels. That is, the received signal in a UWA communication is modeled as the superposition of differently scaled, delayed, and attenuated versions of the transmitted signal [102, 64, 60, 103, 54, 10]. The multipath arises from reflections of the signal off scatterers in the environment. The delays are due to differing path lengths from the transmitter to the scatterer to the receiver. Finally, the relative motion of the transmitter, scatterers, and receiver causes time dilations/contractions of the transmitted waveform causing the Doppler scale effect [84, 54]. In much prior work on estimating underwater channels [60, 105, 64], it was assumed that all channel paths experienced a common, single, Doppler scale. Then using resampling [103, 10, 60, 7], the single Doppler scale was compensated and delays were estimated by classical methods such as MUSIC and ESPRIT [7], and channel gains were computed by a least squares estimation [7, 64, 60]. The inherent multi-scale nature of the channel implies that additional inter-carrier interference suppression is needed if a single scale model is adopted [10, 103, 60]; furthermore [7, 47] speficially consider the losses of using a single scale model in multi-scale channel estimation. 60 In this chapter, we consider the multi-scale multi-lag (MSML) channel model estimation problem for underwater acoustic communication using OFDM signals. There has been sig- nificant recent interest in the use of OFDM for underwater acoustic communications, e.g., [60, 105, 7]. One of the challenges in OFDM is the sensitivity of carrier orthogonality to time- varying multipath and motion-induced Doppler distortion. Therefore, high quality channel estimation is needed for equalization. For narrowband channels, maximum-likelihood (ML) approaches, which reduce to correlation methods, are effective for estimating the Doppler effect [98, 105]. This follows from the fact that in narrowband communication channels, the Doppler effect can be well modeled as a frequency offset. However, MSML channels estima- tion via ML requires solving a multi-dimensional non-linear least-squares problem, incurring a high complexity and typically requiring an exhaustive search. Due to the sparse nature of the UWA channel, sparse approximation methods [7, 104, 24] have been employed to estimate the MSML channel. The dictionary is based on discretizing the support of the Doppler scale and channel delays. MSML channel estimation is very sensitive to errors in scale and basis mismatch [29] can cause considerable degradation. In principle, this issue can be overcome: spectral estimation with a small number of unknown frequenciescanberecastasasemi-definiteprogram(SDP)usingtheatomicnorm[94,93,27]; however such schemes have a frequency resolution constraint for a one-dimensional spectral estimation, [94, 27], which is strongly violated in UWA channels and thus necessitates an impractical number of samples, if using the SDP approach. In contrast, we exploit the closeness of Doppler scales to show that our MSML data matrix has a low-rank, Hankel structure with rank equivalent to the number of active sub- carriers in the training signal. For our suggested training scheme the rank is approximately one. Maximum likelihood estimation of the channel unknown parameters is adapted to spec- tral estimation, and the Prony algorithm [32] is employed for scale estimation 1 . In classical 1 Due to the nature of spectral estimation, we could also modify matrix pencil methods [50, 88] to our problem; however, the Prony and matrix pencil methods achieve comparable performance, see , e.g., [88]. 61 spectral estimation from noisy measurements, the closer the unknown frequencies, the worse the performance. We use a deterministic training signal, furthermore no random projections or random sampling is required. Convex and non-convex regularizers are employed to enforce the low-rank structure in our optimizations and the Alternating Direction Method of Multi- pliers (ADMM) [36] is used to solve the optimization. The nature of our objective function as well as the complexity and convergence properties of ADMM motivate its use. We observe that [62] employs ADMM for a low-rank structure as well. In [62], a system identification problem is considered and a Hankel structure also occurs. The low-rank property is a further assumption which is invoked; whereas in our problem, the low-rank structure is inherent. To summarize, the contributions of this chapter are as follows: (1) We show that the closeness of scales can be an asset by proving that the data matrix is both Hankel in nature and low-rank; thus, frequency resolution does not pose an issue. (2) We adapt the Prony method for spectral estimation to exploit the low-rank of the data matrix and the inherent sparsity of the UWA MSML channel. And, (3) we provide a bound on the error between the true noiseless received signal and the estimated one which informs the choice of the regular- ization parameter in our objective function. We note that while the channel is varying within ourobservationinterval, theparameterswewishtoestimatearerelativelystaticandthusour estimation algorithm is not compromised. Furthermore, we are able to achieve our results with relatively few samples (2 times the sparsity level of the channel), simple deterministic training signals, and without estimating correlation matrices, randomized sampling or ran- domized projections. Finally, our proposed methods achieve strong performance gains over newly proposed estimation strategies based on sparse approximation for spectral estimation. The rest of this chapter is organized as follows. In Section 4.2, the signal and channel models are presented. The proposed noiseless channel estimation algorithm is presented in Section 3.3. Section 3.4 presents the near low rank property of the received OFDM signal. In Section 3.5, we propose an optimization problem to exploit the structural information to 62 remove the noise and estimate the MSML channel and we provide a bound on the regular- ization/relaxation parameter for the convex approach. Then, we review the sparse approx- imation approach based on basis pursuit algorithm and SDP method for MSML channel estimation in Section 3.6 against which our methods will be compared. Section 3.7 presents the numerical simulations to verify performance of our proposed algorithms, and Section 5.5 concludes the chapter. Notation We denote a scalar by x, a column vector by x, and its i-th element with x[i], and similarly, a matrix by X and its (i,j)-th element by X[i,j]. The transpose of X is given by X T and its conjugate transpose by X H . An N×N identity matrix as I N . The real part of a complex number is denoted by Re(.). The tr{A} operator denotes the trace of the matrix A. We denote the matrix inner products byhA, Bi = Re tr n A H B o The X =H(x) denotes the Hankel relationship between the vector x and matrix X, namely X[i,j] = x[i +j]. The set of real numbers byR, and the set of complex numbers byC. 3.2 Signal and Channel Models We consider OFDM signaling over an MSML channel. The transmitted passband OFDM signal is given by x(t) = Re s(t)e j2πfct where s(t) = K/2−1 X k=−K/2 s k e j2πf k t , (0≤t≤T ), (3.1) where T is the OFDM symbol duration, K is the number of subcarriers, s k is the data modulated onto thek th subcarrier;f k =kΔf is thek th subcarrier frequency, where Δf = 1 T is the subcarrier spacing; f c is the carrier frequency; and B = KΔf is the bandwidth of the system. A rectangular pulse shape over the interval t∈ [0,T ] is employed. Note that 63 we assume that the cyclic prefix is longer than the delay spread and the cyclic postfix has a sufficient duration to ensure signal continuity in the observation interval [103]. The time-varying channel model can be represented by h(t,τ) = P X p=1 h p (t)δ(τ−τ p (t)), (3.2) whereh p (t) is the path amplitude,τ p (t) is the time-varying path delay, andP is the number of dominant propagation paths. In the underwater acoustic communication, the continu- ously time varying delays are caused by the motion of the transmitter/receiver as well as scattering off of the moving sea surface and refraction due to sound speed variations. The path amplitudes change with the delays, as the attenuation is related to the distance traveled and the physics of the scattering and propagation processes. For the duration of an OFDM symbol, the time variation of the path delays τ p (t) evolves linearly as a function of time, namely τ p (t) =τ p −a p t where a p is Doppler scaling factor [64]. The delay and scale values are assumed to lie within finite intervals, τ p ∈ [0,τ max ] and a p ∈ [−a max ,a max ], (3.3) where τ max denotes the maximum delay spread of the channel and a max is the maximum Doppler scale. When the transmitter and receiver are moving in the same direction the sign of Doppler scale is positive and when transmitter and receiver are moving in the opposite direction the sign of Doppler scale is negative. From results derived using practical mea- surements by measurements campaign, it is shown that usually a max ≤ 0.001 and τ max ≤ 20 msec [79, 54]. Hence the channel impulse response can be simplified to h(t,τ) = P X p=1 h p δ(τ− [τ p −at]). (3.4) 64 This representation of channel is called the multi-scale multi-lag (MSML) model of the underwater acoustic communication channel [60, 7]. We assume that channel gains and delays are constant during a transmission packet. The OFDM block duration is often less than 100msecwhenthenumberofsubcarriersislessthan 1024. Whilethechannelcoherence time is on the order of seconds [47, 99, 7]. Therefore, this assumption is reasonable within this duration. The bandpass signal received through the linear time-varying (LTV) channel can be written as, y(t) = +∞ Z −∞ h(t,τ)x(t−τ)dτ +n(t), (3.5) where n(t) is an additive, white Gaussian noise. The received bandpass signal, y(t), can be represented as y(t) = P X p=1 h p x ((1 +a p )t−τ p ) +n(t) (3.6) If we consider that y(t) = Re r(t)e j2πfct and n(t) = Re w(t)e j2πfct , then we can express the baseband system model as, r(t) = P X p=1 h p e −j2πfc[τp−apt] s ((1 +a p )t−τ p ) +w(t) = P X p=1 X k c p,k e j2π[(1+ap)(fc+f k )−fc]t +w(t) (3.7) where c p,k =h p s k e −j2π(fc+f k )τp . To simplify the channel estimation strategy, we consider a simple training symbol structure to estimate the MSML channel parameters. Since underwa- ter acoustic channels fast time-variation, we consider only a single OFDM symbol for channel 65 estimation process. Our training symbol has one active subcarrier 2 , say the k 0 -th subcar- rier where k 0 ∈{−K/2,...,K/2− 1}, and all other subcarrier are zero (null subcarrier). Therefore, the received training signal in Eq. (3.7) can be simplified as r(t) = P X p=1 c p e j2π[(1+ap)(fc+f k 0 )−fc]t +w(t) (3.8) where c p = c p,k 0 . After sampling the received signal uniformly using sampling time T s = T/K, we express the sampled signal as r[i] = d[i] +w[i], where the index i denotes the sample index, and d[i] = P X p=1 c p z i p for ∀i∈{0,...,S− 1}, (3.9) where S≤ K denotes the total number of training signal samples for channel estimation purpose, and c p = h p s k 0 e −j2π(fc+f k 0 )τp , (3.10) z p = e j2π[(1+ap)(fc+f k 0 )−fc]Ts . (3.11) According to Eq. (3.10), the coefficientsc p contain the information about the channel atten- uation gains h p and delays τ p for 1≤ p≤ P. Similarly, using Equation (3.11), z p only depends on the scale values a p and|z p | = 1. Note that the Doppler scales, a p , are typically very small, thus the values of (f k 0 +f c )(1 +a p )−f c for p = 1,...,P are very close to each other. As our sampling rate is K/T, we have the potential for K samples per symbol. 2 Note that our theoretical results in this chapter are valid for any number of active subcarriers in the training symbol. But for simplicity, we consider only one active sub-carrier. 66 3.3 Noiseless Channel Estimation In this section, we develop algorithms to estimate the channel parameters (h p ,a p ,τ p ) for 1≤p≤P using the (noiseless) received signal samples,d[i] fori∈{0, 1, 2,...,S−1}, where S denotes the total number of required measurements for channel estimation algorithm. The MSML channel estimation problem, given in (3.8), is a parametric estimation problem, which also can be interpreted as the problem of retrieving the parameters of a sum of complex exponentials, given in (3.9), from noisy samples. Thus, we can formulate the MSML channel estimation problem in the absence of additive noise as follows. Given the noiseless measurements d[i] = P X p=1 c p z i p for ∀i∈{0,...,S− 1}, (3.12) where parameters c p and z p for 1≤l≤P are defined in Eqns. (3.10) and (3.11). How can we determine the unknown channel parameters,h p ,τ p , anda p , using available signal samples d[i]? The key observation is that Eq. (3.12) is the solution to a homogeneous difference equation d[m−P ] +q[P− 1]d[m−P + 1] +··· +q[0]d[m] = 0, (3.13) where theq[p] for 0≤p≤P− 1 are the unknown coefficients in the homogeneous difference equation and m∈ Z. If the coefficients q were known, one can retrieve the c p and z p as follows. (1) find the roots of the z-transform of Eq. (3.13), i.e. Q(z) =z −P + P P−1 p=0 q[p]z −p , which are in fact the z p ; (2) then, solve another linear system of equations to determine the weights c p . This idea was first introduced by Gaspard Riche de Prony in 1795 [32]. We denote the vector q = [q[P− 1],...,q[0]] T , as the coefficients of the annihilator filter of the received data, as seen in Eq.(3.13), the received signal after passing through this filter results in zero. We next elaborate the steps to retrieve the c p and z p from noiseless data d[i]. 67 Finding the coefficients of filter Q(z) By substitutingm fromP toS− 1 in (3.13), we can rewrite (3.13) in matrix/vector form as Dq =−d P , (3.14) where D = d[1] ··· d[P ] d[2] ··· d[P + 1] . . . . . . d[S−P ] ··· d[S− 1] , (3.15) and d P = [d[0],···,d[S−P− 1]] T . Note that D∈C (S−P )×P is a Hankel matrix [9], and d P [i] =d[i] for 0≤i≤S−P− 1. Hereafter,wedenotetheHankelrelationshipbetweenthevector d = [d[0],d[1],...,d[S− 1]] T and matrix [d P , D] in Equation (3.15) by [d P , D] =H(d). If S≥ 2P, then q =−D + d P , where D + denotes the Moore-Penrose pseudo-inverse of D. Finding the z p Given q, the z p are estimated as the roots of the polynomial Q(z) =z −P + P−1 X p=0 q[p]z −p = 0. 68 Finding the c p : Substituting the computed z p from first step into the Equation (3.12), the final step is to solve for the vector c as 1 1 ... 1 z 1 z 2 ... z P . . . . . . ... . . . z S−1 1 z S−1 2 ... z S−1 P c 1 . . . c P = d[0] . . . d[S− 1] . (3.16) Let us define the coefficient matrix V = [z 1 ,..., z P ] where z i = h z 0 i ,...,z S−1 i i T . Here V∈C S×P is a Vandermonde matrix, andz 1 , . . .,z P are the roots ofQ(z). Then, c = V + d. Up to this part, we have developed an algorithm to retrieve the parameters c p and z p using noiseless data d[i]. However, the received data r[i] are noisy, i.e., r[i]6= d[i], thus the solution to Hankel equations (3.15) will produce perturbed linear prediction coefficients. The algorithm would then find the z p by rooting the perturbed polynomial. Finally the algorithm would use the perturbed z p locations to generate the Vandermonde system of equations to determine c in (3.16). Hence, errors caused by noise in the data propagate through the algorithm. Least-squares (LS) methods with a large number of measurements can be employed to improve robustness to noise [91]. The main drawback for the LS based approaches, is that they need a fairly large number of measurements (S 2P). Increasing the size of the data matrix also increases the complexity of these methods. Furthermore, they do not use the information offered by the structure of the desired signal to remove the noise from received measurements. In the sequel, we show that noiseless signald[i] has a key structural feature and then in Section 3.5, we exploit the feature to improve the quality of channel estimation. 69 3.4 Structural Property of Noiseless Signal In this section, we show that the matrix D = H(d) can be approximated by a rank one Hankel matrix. Thus, the received signal can be approximated in a subspace with a dimension less than P. This structural feature enables data de-noising to reduce the perturbation error in low SNR. This unique structure stems from two following features: (1) Doppler scales in MSML channel are very small, thus the newly generated frequencies, (f k 0 +f c )(1 +a p )−f c for 1≤p≤P, are very close to each other and to the corresponding carrier frequency. (2) Our particular design of OFDM training symbol that only one subcarrier is active and the other subcarriers are null subcarriers. To show the low rank property of data matrix D, we need to states two key lemmas. Lemma 4 shows that in the MSML channel, if each path in the channel experiences a dis- tinct Doppler scale, then the data matrix D is a full rank square matrix with S = 2P measurements. This indicates that in the noiseless scenario, with S = 2P measurements, the algorithm proposed in Section 3.3, can exactly estimate the channel parameters. Fur- thermore, in Theorem 2, we show that the data matrix D can be approximated by a rank one Hankel matrix. Using this feature, we propose a noise reduction scheme for the measurement matrix R, and then apply the algorithm developed in Section 3.3 to estimate the channel parameters. Lemma 4. Assume that S = 2P. Then, in the noiseless case, the data matrix D can be decomposed as D = Π Π ΠΛ Λ ΛΠ Π Π T , (3.17) 70 where Π Π Π = [π 1 ,...,π P ] is a Vandermonde matrix with π T p = h z 0 p ,z 1 p ,...,z P−1 p i for p ∈ {1, 2,...,P}, and Λ = diag{c 1 ,c 2 ,...,c P }. If the scale parameters are distinct, namely ∀i,j :a i 6=a j , then the matrix D is full rank. TheproofisprovidedinAppendix3.9. WenextstateanotherlemmaaboutVandermonde decompositions of Hankel matrices that will enable the proof of the low rank approximation for the data matrix D. Lemma 5. Let π T p = h z 0 p ,z 1 p ,...,z P−1 p i for p ∈ {1, 2,...,P}. Then the matrix D = P p∈A P λ p π p π T p , is a Hankel matrix with rank at most|A P |, where A P ⊂{1, 2,...,P} is the index set andλ p s are arbitrary non-zero numbers. If the vectors,π p , are orthogonal, i.e., π H p 1 π p 2 = 0, for∀p 1 ,p 2 ∈A P and p 1 6=p 2 , then the rank of D is equal to|A P |. The proof is straightforward and thus omitted. In Example 1, we illustrate the low rank behavior of the data matrix D due to the Doppler scale effect, using a numerical examples. Example 1. Consider a particular realization of the MSML channel h(t,τ) = P X p=1 h p δ(τ− [τ p −a p t]), where the h p are generated randomly with a Rician distribution-(0, 1), the a p are randomly (uniformly) chosen from the interval [−a,a] with a = 0.001, and τ p are uniformly chosen from [0,τ max ] and τ max = 10 msec [79]. Furthermore, we consider that P = 5 and 20. As shown in Fig.1, surprisingly, in the noiseless scenario only one of the singular values of data matrix D is large and the remaining singular values are close to zero. While in the noisy scenario (for SNR=10 dB) all the singular value are large. Using Lemmas 4 and 5, we prove in Theorem 2 that the data matrix D can be approxi- mated by a rank one Hankel matrix. 71 Theorem 2. Consider the data matrix D defined in Equation (3.15). Then, D can be approximated by a rank one Hankel matrix ˆ D =λ c π ˆ m π T ˆ m , where λ c = P P p=1 c p with c p defined in Equation (3.10), and ˆ m = argmin m P P p=1 |h p (a p −a m )|. The approximation error is bounded as, 1 P kD− ˆ Dk F ≤γ 0 P X p=1 |h p (a p −a ˆ m )| (3.18) where γ 0 = √ 8πT s (f k 0 +f c ). The proof is provided in Appendix 3.10. We can see from (3.18) that if a p ' a m for all m6= p, then resulting error bound can be quite small, yielding a tight approximation. In Section 3.7, we show numerically that this approximation is satisfied for Doppler scale values a≤ 0.001 with normalized mean squared error upper bounded by 0.01 when the number of channel dominant paths, P, is less than 20. While our results are more straightforwardly shown for the case of distinct scales (a i 6= a j ,∀i6=j), the results can be generalized to the case of common scales. In fact, from Lemma 4, it is clear that if there are common scales, the rank of D will result in an improved low rank approximation. 72 3.5 Channel Estimation Algorithms We use the structural results proven in Section 3.4 to estimate the MSML channel parame- ters. Using the low rank structure, it follows that the best approximation of the data matrix using the available measurements can be done via the following optimization problem, minimize d kr− dk 2 2 s.t. rank (D) = 1 (3.19) where D = H(d) denotes the corresponding Hankel matrix for the data vector d = [d[0],...,d[S− 1]] T . In the following, we consider two approaches to solve the optimization problem defined in (3.19). In the first approach, we use a non-convex regularizer to enforce the rank constraint in the optimization problem, and in the second approach, we relax the rank constraint to a convex rank regularizer. 3.5.1 Non-convex Rank Regularizer Recall that the Hankel matrix R =H(r) has constant values on the anti-diagonals i.e., R[i,j] = r[i +j] for 0≤i≤S−P and 0≤j≤P− 1. Then, the problem defined in (3.19) can be rewritten as d = argmin d kξk 2 2 +f R (D) (P 1 ) s.t. r = d +ξ, where ξ = r− d is the residual vector and f R (D) is an indicator function for the rank of a matrix, f R (D) = 0 if rank(D) = 1 ∞ otherwise 73 We set the rank of data matrix D to 1 as suggest by Theorem 2. The objective function in the optimization problem (P 1 ) consists of two terms with a linear constraint. The first term only depends on the variable ξ and the second term also depends on the variable d. The problem is well suited for the use of the alternating direction method of multipliers (ADMM) [36]. Based on the method of multipliers, we form the augmented Lagrangian L ρ (d,ξ, Θ) =f R (D) +kξk 2 2 +hΘ, D + Ξ− Ri + ρ 2 kD + Ξ− Rk 2 F , (3.20) where Θ is the Lagrange multiplier matrix, Ξ =H(ξ), ρ > 0 is a constant, and the inner product defined ashA, Bi . = Re n tr n A H B oo . For the optimization problem (P 1 ), ADMM consists of the iterations d k+1 = argmin d L ρ d,ξ k , Θ k , (3.21a) ξ k+1 = argmin ξ L ρ d k+1 ,ξ, Θ k , (3.21b) Θ k+1 = Θ k +ρ h H ξ k+1 + d k+1 − R i . (3.21c) The algorithm is very similar to dual ascent and the method of multipliers: it consists of an d-minimization step (3.21a), aξ-minimization step (3.21b), and a dual variable update (3.21c). The first minimization step (3.21a) is non-convex due to the projection onto a non- convex set, but fortunately can be solved analytically in closed form. Considering only the terms depend on d, we can rewrite (3.21a) as, d k+1 = argmin d f R (D) +J(D). (3.22) whereJ(D) = D Θ k , D E + ρ 2 kDk 2 F +ρ D D, Ξ k − R E and D =H(d). To solve this optimization problem, we must find a D such that the indicator function, f R (D), becomes zero, i.e., rank(D) = 1 and D minimizes the J(D). The minimizer of the J(D) can be evaluated by 74 simple derivation with respect to D. Hence, it is straightforward to see that the matrix D J = R− Ξ k −ρ −1 Θ k minimizes the J(D). Based on the Eckart-Young theorem [9], if the singular value decomposition (SVD) of D J is D J = UΣV H , then the best rank one matrix approximation will be D k+1 = UΣ 1 V H (3.23) where Σ 1 is the same matrix as Σ except that it contains only the first largest singular values (the other singular values are replaced by zero). The second minimization step (3.21b) can be written as, ξ k+1 = argmin ξ kξk 2 2 + D Θ k , Ξ E +kΞk 2 F +ρ D Ξ, D k+1 − R E . (3.24) Then, by taking derivative with respect to ξ, we have ξ k+1 [i] = ρt[i]r[i]− P m+n=i n Θ[m,n] +ρD (k+1) [m,n] o 1 +ρt[i] , (3.25) where t[i] =i + 1 for 0≤i≤P− 1 and t[i] =S−i for P≤i≤S− 1. Note that ADMM is a dual method [36] (the convergence is in the dual space), and the dual objective is always a convex function, even if the primal objective is non-convex. Thus, ADMMalwaysfindstheoptimalsolutionofthedualandconvergesforevenconcaveobjective functions. To guarantee the convergence to the global optimal solution in the primal, the overall objective function should be a convex function. But the rank indicator function is a non-convex operation in the objective, thus the solution found by ADMM is potentially a local minima, although it appears to work well in practice as we show in our numerical results. Table I summarizes the estimation of the low rank approximation of R using the 75 Table 3.1: The Non-convex approach to remove the noise from the receive data using the low rank structure of data matrix. Non-convex Received Signal Denoising Input: r, ρ, I t Intialize: R =H(r), Ξ = 0, and Θ = 0 For k = 1 : I t [U, Σ, V] = SVD R− Ξ k −ρ −1 Θ k Update d: D k+1 = UΣ 1 V H Update ξ: evaluate Equation (3.25) Dual: Θ k+1 = Θ k +ρ h H(ξ k+1 ) + D k+1 − R i End proposed non-convex regularizer. Furthermore, in the sequel, we relax the non-convex term by its convex surrogate to guarantee the convergence of ADMM algorithm 3 . 3.5.2 Convex Rank Regularizer We relax the rank function to the nuclear norm of D =H(d). Thus the problem stated in Equation (3.19), is modified to minimize d kr− dk 2 2 +λkDk ∗ , (3.26) where the nuclear norm of a matrix,kDk ∗ , is the sum of its singular values, σ i (D) [9], i.e., kDk ∗ = P i σ i (D). Since σ i (D)≥ 0,kDk ∗ is equivalent to the norm l 1 of singular values of the matrix to induce the rank constrain. The Lagrangian parameter λ controls the tradeoff between the two terms. Theorem 3. Consider that r = d + n, where n is the noise vector and define N =H(n). If we choose λ≥ 2kNk 2 , then kd− ˆ dk 2 ≤ √ 32λ 3 Details about the convergence rate of ADMM algorithm for the convex objective with linear coupling constrain can be found in [36, 49]. 76 where ˆ d denotes the solution of the optimization problem in (3.26). Remark 5. Note that for a Gaussian noise N =H(n), where n∼N (0,σ 2 I), it is shown thatkNk 2 =cσ [82], where c is a constant number. Therefore, given the noise variance, we can select λ such that λ≥ 2cσ. For example, in our numerical results, we have considered c∈ (1, 2). The optimization problem in Equation (3.26) can be restated as, d = argmin d kξk 2 2 +λkH(d)k ∗ (P 2 ) s.t. r = d +ξ. Since the overall objective is a convex function of d, the ADMM algorithm is guaranteed to converge to the optimal solution of (P 2 ). It is clear that the only algorithmic change is in the iterative computation of d k+1 in (3.21a) and the two other steps, updating ξ and dual variable Θ, will remain same as before in (3.21b) and (3.21c), respectively. Thus, we have d k+1 = argmin d L ρ d,ξ k , Θ k = argmin d λkDk ∗ + D Θ k , D E + ρ 2 kDk 2 F +ρ D D, Ξ k − R E . (3.27) The objective function in (3.27) is a convex non-differentiable function due to the nuclear norm. There are numerous techniques to approach this kind of optimization problem. Here, we use a simple, sub-gradient based method [41]. In a classical gradient method, we have D k+1 = D k −μG k , (3.28) where μ is the step size and G k denotes a subgradient of L ρ d,ξ k , Θ k at d k and can be written as, G k =λ∂ D k ∗ + Θ k +ρ D k + Ξ k − R . To compute G k , we need to compute the subgradient of the nuclear norm, i.e. ∂k.k ∗ . Let 77 Table 3.2: D-update step for rank regularization using Nuclear norm of data matrix. Update d: Gradient method Input: D k , μ, I t , λ. Intialize: A 0 = D k . For n = 0 : I t − 1 A n+1 ← A n A nH A n + 3I 3A nH A n + I −1 . End G k =λA It + Θ k +ρ D k + Ξ k − R . Update D k+1 = D k −μG k . Returen : D k+1 . D k = UΣV H be the singular value decomposition of matrix D k . The subgradient of the nuclear norm at D k is then given by (see [41]) ∂ D k ∗ = UV H + Ω, where Ω and D k have orthogonal row/column spaces andkΩk≤ 1. A possible alternative here to compute the subgradient of the nuclear norm at D k , is to use the methods developed in [41]. Specifically, for a given matrix D k , the following iteration D k+1 ← D k D k H D k + 3I 3D k H D k + I −1 , (3.29) converges globally and quadratically to the sub-gradient of the nuclear norm [41]. This iteration method can be faster than a direct SVD computation. The second step, namely updating d in the ADMM iterations, for our convex regularization for the rank constraint is given in Table II. Other steps are the same as the non-convex approach given in Table I. 78 Table 3.3: Proposed MSML channel estimation methods MSML Channel Estimation Algorithms Initialization: R =H(r). Step A: remove the noise from R using either: (a) non-convex approach Table I, or (b) convex approach, Tables I and II. Output: ˆ R. Step B: Find q: solve ˆ Rq =−ˆ r P . Find z k : compute the roots of Q(z), i.e., {z k } P k=1 = Solve z −P + P P−1 l=0 q[p]z −p = 0 . Find c l : solve linear system of equations given in (3.16). Outputs: compute h p ,τ p ,a p , Section 3.5.3. 3.5.3 Channel Parameter Extraction The overall MSML channel estimation algorithms are summarized in Table III. After evalu- ating the parametersc p andz p by the proposed algorithms, compute the channel parameters using the following set of equations, a p = 1 2πTs ]z p +f c f c +f k 0 − 1, h p = c p s k 0 , τ p =− ] cp s k 0 2πf k 0 . where p∈{1, 2,...,P}. 3.6 Short Review of Prior Methods In this section, we briefly review the sparse approximation (SA) method proposed in [7] and the Semi-Definite Program (SDP) method proposed in [94, 93]. We shall numerically compare the performance of our new methods with these two approaches in Section 3.7. 79 3.6.1 Sparse Approximation Method In[7], toestimatetheUWAMSMLcommunicationchannels, asparseapproximationmethod is proposed as follows. Sampling the delay-Doppler plane on a grid, a linear and sparse representation of the channel matrix can be formulated. More precisely, the delay dimension is discretized using N τ values within [0,τ max ] with step-size Δτ = τmax Nτ . The Doppler scale dimension is similarly sampled using N a values within interval [−a max ,a max ] with step-size Δa = 2amax Na . Then, the channel model can be expressed as, h(τ,t) = Na X m=1 Nτ X n=1 h m,n δ (τ− [τ n −a m t]), (3.30) where τ n = nΔτ and a m = mΔa−a max . If we substitute the channel model in (3.30) into Eq. (3.5), we have r(t) = Na X m=1 Nτ X n=1 φ m,n (t)h m,n +w(t), (3.31) where φ m,n (t) = K/2−1 X k=−K/2 s k e −j2π(fc+f k )τn e j2π[(1+am)f k +amfc]t . Notethatinaboveequationwehaveconsideredthatallthesubcarriersinthetrainingsymbol are active for SA method. The received signal is then a combination of N τ N a delayed and Doppler scaled copies of the transmitted signal with weights h m,n . Let the vector h contain channel gains of possible paths in the discretized delay-Doppler plane, of which many entries will be zero or close to zero. Therefore, after sampling, using a time resolution as a multiple, ζ, of the baseband sampling time T K , from the continuous signal, r(t), at time instants t =i T Kζ for 0≤i≤Kζ− 1, one can write r = Φh + w (3.32) 80 where Φ∈C Kζ×(NaNτ ) with Φ[i,j] =φ m,n (i T Kζ ), and h[j] =h m,n , where j =N τ (m− 1) +n for 1≤m≤N a and 1≤n≤N τ . Since the vector h is a sparse vector, we apply the basis pursuit algorithm to estimate the channel parameters as follows: ˆ h = argmin h kr− Φhk 2 2 +λ e khk 1 (3.33) where λ e is the penalty parameter to control the sparsity level. Note that the proposed SA method in [7] is sensitive to basis mismatch [29], namely when the actual channel delay and Doppler values do not fall in the above grid. Therefore there will be an unresolvable error due to channel parameter mismatch with values on the grid. In our numerical results, in the next section, we show that there can be a large performance degradation due to basis mismatch with the approach of [7]. 3.6.2 Semi-Definite Program Method Let us define the set of atoms α i (θ) =e j2πθi . Therefore, (3.9) can be written as follows: d = P X p=1 |c p |e jθcp α(θ p ), where d = [d[0],d[1],...,d[S − 1]] T , α(θ p ) = [α 0 (θ p ),...,α S−1 (θ p )] T , θ p = (1 +a p )(f c +f k 0 )T s for 1 ≤ p ≤ P, and w is the noise vector. The target signal d may be viewed as a sparse non-negative combination of elements from the atomic norm setA = n α(θ)e jθc ,θ∈ [0, 1],θ c ∈ [0, 2π] o . The atomic norm is defined askdk A = inf{> 0 : d∈conv(A)}. In [94], it is shown that kdk A = inf 1 S trace (Toep(u)) +κ : Toep(u) d r κ 0 , 81 and the target signal can obtain by solving a semi-definite program (SDP) as [93] minimize d,u,κ 1 S trace (Toep(u)) +κ subject to : Toep(u) d r κ 0 kr− dk 2 ≤σ where r = d + w and w is the noise vector. This method can reconstruct the target signal from the measurements provided that the frequencies θ p are sufficiently far apart from one another. It is shown in [27, 94, 93], that this algorithm requires the following frequency separation criteria to be satisfied to work well, min ∀p i 6=p j |θ p i −θ p j |≥ 2 S . (3.34) In our UWA scenario, we have|θ p i −θ p j |/B =|a p i −a p j |(f c +f 0 )T s . Therefore, to satisfy the above condition need S≈ 2× 10 5 , which is a very large number compared to baseband sampling, i.e., K = 1024 samples or less. In the simulation results in the sequel, we show that due to the violation of the frequency resolution constraint, the SDP algorithm suffers significant performance degradation. Thus, the SDP method is not well matched to UWA MSML channel estimation. 3.7 Simulation and Discussion Herein, we perform numerical simulation to evaluate the performance of our proposed algo- rithms. In Section 3.7.1, we asses the quality of our approximation used in Theorem 2. We compare the performance of our proposed MSML channel estimation methods with other existing algorithms in Section 3.7.2. Furthermore, we provide a complexity analysis of our 82 5 10 15 20 25 30 −50 −40 −30 −20 −10 0 Number of Dominant Paths (P) Normalized MSE (dB) a max =0.1 a max =0.01 a max =0.001 a max =0.0001 Figure 3.1: The rank one approximation’s mean-squared-error (MSE) versus number of dominant paths in channel for different values of Doppler scale support. proposed algorithms and compressed sensing method based on bases pursuit algorithm [7] in Section 3.7.3. 3.7.1 Approximation Assessment First, to asses the quality of our approximation of received data matrix using a rank one Hankel matrix in Theorem 2, we simulate MSML channels for different values of the max- imum Doppler scale from a max = 0.1 to a max = 0.0001. For different values of a max , we select randomly the Doppler scale for each path from the interval [−a max ,a max ]. As shown in Theorem 2, the upper bound for the approximation error is governed by the Doppler scale values and number of dominant paths in the channel structure. Thus, in the first simulation, we consider the effect of these parameter on the error of approximation of the noiseless data matrix by a rank-one Hankel matrix. We assume that the number of dominant paths is P and varys from P = 5 to P = 30 and for each path the channel gain, h p , is drawn from a Rician distribution-(0, 1), and the delay τ p is uniformly chosen from interval [0,τ max ] with τ max = 15 msec [7, 79, 103]. 83 Fig.2 illustrates the mean squared error (MSE) between the data matrix, D, and its low rank approximation. MSE is normalized by the norm of data matrix, i.e., Normalized MSE= E{kD− ˆ Dk F /kDk F }, where ˆ D is the rank one approximation of D. Fig.2 shows that for a≤ 0.001, the rank one approximation has almost an average error of 0.01 for 10≤ P ≤ 30. Furthermore for P ≤ 10 the approximation error abruptly decreases. For example for a max ≤ 0.001 andP = 5, the average normalized MSE is about 10 −4 . It is clear that for a large Doppler scale value, like a = 0.01 by increasing of the number of dominant paths in the channel the normalized MSE increases up to 0.1. Lemma 4 shows that the data matrix is full rank, but only one of the singular values is much larger than the other P− 1 remaining singular values. Thus, increasing the number of dominant paths in the channel increases the approximation error due to neglecting the small, but non-trivial, singular values in the rank one approximation. Note that, in practice, for underwater acoustic channels, due to the limitation of the speed of underwater vehicles and the low speed of acoustic waves in water, a max is often less than 0.001. 3.7.2 Performance Comparison Inthispart, wecomparetheoverallchannelestimationperformanceofourproposedmethods (convex and non-convex) with the SA method in [7] and the SDP method in [27] for spectral estimation. For this simulation, we consider OFDM signaling with K = 1024 subcarriers, f c = 11 kHz, Δf = 11.72 Hz., and oversampling factor ζ = 1. We assume that the number of dominant paths is P = 7. The channel parameters are generated as follows: for each path p, (1≤ p≤ P ), the channel gain h p is drawn from a Rician distribution-(0, 1), the Doppler scale, a p , is uniformly selected from the discretized interval [−a max , +a max ] with a max = 0.001 and step-size Δa = 10 −5 i.e., the support is divided into 100 discrete points with equal distance , and the delay τ p is uniformly chosen from the discretized interval [0,τ max ] with τ max = 10 msec and resolution Δτ = 0.1 msec [7, 79, 103]. The number of 84 5 10 15 20 25 30 35 40 −35 −30 −25 −20 −15 −10 −5 0 5 SNR (dB) Normalized MSE (dB) SDP Method SA + mismatch (ν = 0.5) SA + mismatch (ν = 0.25) SA no mismatch Prony + Convex Prony + Non−convex 5dB 2.5dB 10dB 7dB Figure 3.2: Normalized MSE versus signal-to-noise ratio (SNR), performance comparison of our proposed convex and non-convex approaches with SA and SDP methods. iterations of ADMM algorithm is I t = 5 for convex and non-convex algorithms. To evaluate theperformanceoftheSAmethod, weconsiderthatallthesubcarriersinthetrainingsymbol are active. In Fig.3, the normalized MSE associated with each method is illustrated. We consider the channel vector as β∈ C P , where β[p] = c p z p for 1≤ p≤ P and MSE is computed as MSE = E{kβ− ˆ βk 2 /kβk 2 }. As observed in Fig.3, our proposed non-convex algorithm achieves the same MSE with 5 to 7 dB improvement in low SNR (SNR less than 20 dB) compared to the BP algorithm. On the other hand, our proposed convex method achieves about 5 dB SNR improvement in SNR greater than 30 dB (high SNR) compared to the SA method. The non-convex approach achieves 3 to 5 dB SNRs improvement in low SNR over the convex approach. The reason is that the non-convex rank constraint algorithm forces the estimated data matrix to fall into a lower dimensional subspace (rank one). But, the convex approach relaxes this assumption using the nuclear norm. Thus, the non-convex approach mitigates noise effect more effectively. In contrast, at high SNR, the non-convex approach 85 removes some parts of the signal. Thus the convex approach captures the data matrix structure better than non-convex approach and performs better. The SA method proposed in [7] uses a basis pursuit algorithm to estimate the channel parameters. This algorithm is sensitive to basis mismatch [29]. To evaluate the performance of SA method, we consider two different cases for basis selection. In the first case, labeled SA no mismatch Method in Fig.3, the mismatch issue is avoided by choosing Doppler scales and delays from the true grid. Note that in practice, there usually exists a mismatch error between the basis (discretization) and the actual delay and Doppler scale values, which substantially degrade the performance of sparse approximation method. For the second case, i.e., SA + mismatch, to build the bases, i.e., matrix Φ, we have considered a grid which results from shifting the (above) true channel grid, νΔa in the scale direction and νΔτ in the delay direction, whereν∈ (0, 1/2). In Fig.3 we consider two casesν = 0.25 and ν = 0.5. By increasingν the basis mismatch increases and becomes more severe. Therefore, the true delays and Doppler scales in channel occur in between the basis points. In Fig.3, we observe that SA+mismatch suffers from large errors in the presence of basis mismatch, e.g., for normalized MSE =−20 dB there is 10 dB degradation in the SA algorithm performance duetothebasismismatch. Notethatinrealapplications, thebasismismatchcanbedifferent for different basis. As mentioned in Eq. (3.34), the SDP algorithm suffers from a resolution constraint, which depends upon the minimum frequency separation (normalized by the bandwidth) in the target signal. If we wish to satisfy this condition for our UWA communication channel estimation problem, then we need at least S ≈ 2/(Δaf c Ts)≈ 2× 10 5 , which is a very large number compared to baseband sampling rate in our problem, i.e., K = 1024 samples. Results in Fig.3 show that SDP method has significant degradation due to the violation of this constraint. As seen in Section 3.7.1, the rank one approximation’s error grows with an increasing number of dominant paths. Thus, next we present the effect of the number of dominant 86 paths in the channel in the performance of our proposed methods and the SA algorithm. We consider the same set of parameters for the baseband system as those for Fig.3. We consider SNR=20 dB. The number of dominant paths, P, varies form 1 to 30 and a max = 0.001. Results in Fig.4 indicate that by increasing the number of dominant paths, the performance of all algorithms degrade. Fig.4 shows that for P≤ 10, the non-convex approach has the best performance results; increasing the number of dominant paths, the convex approach performance result improves relative to the non-convex approach. 3.7.3 Complexity Analysis In this part we compare the complexity of our proposed methods and compressed sensing method proposed in [7]. To compute the computational complexity of our proposed algorithms, we consider the cost of different operations in each algorithm. Both convex and non-convex algorithms have two stages. In the first stage, the data matrix is de-noised and in the second stage the noiseless channel estimation algorithm stated in Section III is applied to evaluate the channel parameters. Note that the second stage is common for both algorithms. Each iteration in the first stage of our proposed non-convex algorithm involves computing a singular value decomposition (SVD)of aP×P matrix, themultiplication oftwoP×1 vectorsto updatethe d, scalaroperationstoupdateξ, andadditionoftwoP×P matrices. Thecomplexityofthese operations are on the order ofP 3 ,P 2 ,P, andP 2 , respectively [70, 9]. Therefore, the overall complexity of each iteration in the first stage for non-convex algorithm isO (P 3 + 2P 2 +P ). It is clear that the only difference between the convex and non-convex algorithms is the process of updating vector d, given in Table II. Since the complexity of computing the inverse of a P×P matrix is on the order of P 3 [9], one can easily show that the overall complexity of each iteration of the algorithm in Table II is O (5P 3 + 2P 2 ). In the second stage of our algorithms, namely the noiseless channel estimation algorithm, involves computing roots of a polynomial of degreeP, a pseudo-inverse computation of aP×P matrix, and multiplication 87 5 10 15 20 25 30 −25 −20 −15 −10 Number of dominant paths (P) Normalized MSE (dB) Prony + Convex Prony + Non−convex SA Method Figure 3.3: Comparison of normalized MSE versus the number of dominant paths in a MSML channel - our proposed convex and non-convex approaches with SA method proposed in [7]. of a matrix by a vector to compute c in (15). The complexity of finding roots of a complex polynomial of degreeP is on the order ofO (P log 2 P ) [2]. Therefore, the overall complexity of the second stage is O (P log 2 P + 2P 3 ). The SA method can also be implemented using ADMM algorithm [19] efficiently with overall computational complexity ofO ((N r N τ N a ) 3 + 3N r N a N τ ), whereN τ ,N a ≥P. We see that both proposed non-convex and convex methods have lower computational complexities in each iteration compared to the SA method. 3.8 Conclusions In this chapter, we have investigated estimation of multi-scale, multi-lag channels for under- water acoustic communication and other ultra-wideband channels. We adapted spectral estimation to estimate the MSML channel, based on a path-based model. We showed that the received data matrix can be approximated by a low rank Hankel matrix with rank equal to the number of active subcarriers in OFDM signal. Taking advantage of the low rank 88 structure of the data matrix our proposed method is robust to noise and also requires a significantly low number of measurements (2 times the sparsity level of the channel) to effec- tively estimate the channel parameters. We proposed two iterative algorithms based on ADMM using both non-convex and convex regularizers to enforce the low rank structure of data matrix. We provide a bound on the regularization/relaxation parameter for the convex approach. Finally, in the simulation results, we show that the proposed non-convex algo- rithm provides a performance (in SNR sense) with 7 dB improvement in average in low SNR compared to compressed sensing algorithm and our convex algorithm achieves same MSE with almost 5 dB SNR improvement at high SNR. 3.9 Appendix A: Proof of Lemma 4 The proof falls into two parts. In the first part, we show the validity of Vandermonde decomposition given in (3.17), and in the second part we prove the full rank property of data matrix D. Part one: substituted[i] = P P p=1 c p z i p for∀i∈{0,...,S−1} in the matrix D defined in Equation (3.15). Then, it is just a matrix multiplications to validate the equality of Equation (3.17). Part two: for S = 2P, both Π and Λ are square matrices. Assuming that the channel gains are nonzero, i.e.∀p,c p 6= 0, Λ is invertible; in addition, since Π is a Vandermonde matrix and the Doppler scales are distinct, this matrix is also invertible [9]. Therefore D is invertible and full rank. 3.10 Appendix B: Proof of Theorem 2 Using Lemma 4, we know that D = ΠΛΠ T and Λ is a diagonal matrix and invertible. Thus the rank and singular values of D are governed by the structure of Π. If we define 89 Π = [π 1 ,...,π P ] whereπ i = [z 0 p ,z 1 p ,...,z P−1 p ] T for 1≤i≤P, then data matrix D can be represented as, D = P X p=1 c p π p π T p (3.35) In the sequel, we show that D can be approximated by ˆ D =λ c π m π T m (3.36) whereλ c = P P p=1 c p , and 1≤m≤P is afixed number. It is clear that ˆ D is arank oneHankel matrix using Lemma 5. To bound the approximation error, we use the triangle inequality, kD− ˆ Dk F =k P X p=1 c p π p π T p −π m π T m k F ≤ P X p=1 |c p | π p π T p −π m π T m F (3.37) By (3.10), we know that|c p | =|h p |. Let us define B . =π p π T p −π m π T m , and J . =kBk F = q Tr (B H B). Therefore, J 2 = Tr π ∗ p π H p π p π T p −Tr π ∗ p π H p π m π T m −Tr π ∗ m π H m π p π T p +Tr π ∗ m π H m π m π T m . Since the trace operator is invariant to ordering of the matrices (cyclic permutations), we can rewrite J 2 as follows, J 2 =π H p π p π T p π ∗ p −π H p π m π T m π ∗ p −π H m π p π T p π ∗ m +π H m π m π T m π ∗ m . (3.38) The inner products of columns in Π, can be written as, π H i π j = π T i π ∗ j ∗ = P X p=1 (z ∗ i ) p z p j = P X p=1 e −j2πTs(f k 0 +fc)(a i −a j )p (3.39) 90 Furthermore, forp =m, we haveπ H p π p =π H m π m =P. If we apply the identities in Equation (3.39) to simplify (3.38), we have, J 2 = 2P 2 1− |π H p π m | 2 P 2 ! (3.40) Therefore, using (3.37) and (3.40), we have kD− ˆ Dk F ≤ √ 2P P X p=1 |h p | s 1− |π H p π m | 2 P 2 (3.41) In addition, using geometric sum and Equation (3.39), we have, |π H p π m | 2 = 1− cos (2πT s (f k 0 +f c )(a p −a m )P ) 1− cos (2πT s (f k 0 +f c )(a p −a m )) (3.42) Using the first order term of the Taylor expansion of the RHS (3.42), we have |π H p π m | 2 P 2 ≤ 1− (2πT s (f k 0 +f c )(a p −a m )) 2 (3.43) If we substitute (3.43) in (3.41), we can rewrite it as 1 P kD− ˆ Dk F ≤ √ 8πT s (f k 0 +f c ) P X p=1 |h p (a p −a m )| (3.44) Define γ 0 = √ 8πT s (f k 0 +f c ). Then 1 P kD− ˆ Dk F ≤γ 0 P X p=1 |h p (a p −a m )| (3.45) 3.11 Appendix C: Proof of Theorem 3 Consider that D ∗ =H(d ∗ ) is a feasible solution for Eq. (3.26) found by convex programming and ˆ D =H( ˆ d) is the optimal solution, i.e., r = ˆ d + n. Note that in this proof we consider 91 a general case where the rank of rank( ˆ D) =η. While in Theorem 2, we have η = 1. Let us defineδ = d ∗ − ˆ d, Δ =H(δ), and N =H(n). We first prove a key lemma. Lemma 6. There exists a matrix decomposition for error matrix Δ such that Δ = Δ 1 +Δ 2 where (i) rank(Δ 1 )≤ 2η and (ii)kΔ 2 k ∗ ≤ 3kΔ 1 k ∗ . Proof: (i) assume that D ∗ = UΣV H is the singular value decomposition (SVD) of D ∗ . We define Ω = U H ΔV. Consider the block representation of Ω: Ω = Ω 11 Ω 12 Ω 21 Ω 22 , where Ω 11 is aη×η matrix and Ω 22 is a (S−P−η + 1)× (S−P−η + 1). Thus, if we define Δ 2 = U 0 0 0 Ω 22 V H , and Δ 1 = Δ− Δ 2 . (3.46) then we have, rank(Δ 1 )≤ rank( 1 2 Ω 11 Ω 12 0 0 ) +rank( 1 2 Ω 11 0 Ω 21 0 )≤ 2η. (ii) Let us define following subspaces, Γ P :={A|R{A}⊆ V η andC{A}⊆ U η } (3.47) Γ 0 P :={A|R{A}⊥ V η andC{A}⊥ U η } (3.48) whereR andC denote the row space and column space, respectively, of matrix A, and V η and U η represents the first η columns of matrices V and U, respectively. As Δ 2 defined in (3.46) above, we can writekP Γ P (D ∗ ) + Δ 2 k ∗ =kP Γ P (D ∗ )k ∗ +kΔ 2 k ∗ , whereP Γ P denotes 92 the projection on the subspace Γ P . In general, we can decompose D ∗ as D ∗ =P Γ P (D ∗ ) + P Γ 0 P (D ∗ ). Since D ∗ is a rank η matrix,P Γ 0 P (D ∗ ) = 0, therefore D ∗ =P Γ P (D ∗ ). Thus, k ˆ Dk ∗ =kD ∗ + Δk ∗ =kP Γ P (D ∗ ) + Δ 1 + Δ 2 k ∗ ≥kP Γ P (D ∗ ) + Δ 2 k ∗ −kΔ 1 k ∗ =kD ∗ k ∗ +kΔ 2 k ∗ −kΔ 1 k ∗ =RHS. (3.49) In result, we can write kD ∗ k ∗ −k ˆ Dk ∗ ≤kD ∗ k ∗ −RHS =kΔ 1 k ∗ −kΔ 2 k ∗ . (3.50) On the other hand, since d ∗ is a feasible solution and ˆ d is the optimal solution, we can write kr− ˆ dk 2 2 +λk ˆ Dk ∗ ≤kr− d ∗ k 2 2 +λkD ∗ k ∗ . (3.51) Then, by rearranging the terms in Equation (3.51) and substituting the error matrix Δ = H(δ) = D ∗ − ˆ D and r = ˆ d + n, we can rewrite (3.51) as (3.52), kδk 2 2 ≤ 2hn,δi +λ n k ˆ D + Δk ∗ −k ˆ Dk ∗ o (3.52) ≤ 2|hn,δi| +λ n k ˆ D + Δk ∗ −k ˆ Dk ∗ o (3.53) Using the definition of the adjoint of an operator, we have|hn,δi| =|trace{hN, Δi}|, and by the Holder inequality for Schatten p-norms (here p = 2) [9], we have|trace (hN, Δi)|≤ kNk 2 kΔk 2 ≤kNk 2 kΔk ∗ . Thus, we have 0≤kδk 2 2 ≤kNk 2 kΔk ∗ +λ n k ˆ D + Δk ∗ −k ˆ Dk ∗ o (3.54) 93 Based on our assumption in the theorem statement λ≥ 2kNk 2 . Therefore, substituting (3.50) into the inequality in Equation (3.54), we obtain 0≤ λ 2 kΔk ∗ +λ{kΔ 1 k ∗ −kΔ 2 k ∗ }≤ 3λ 2 kΔ 1 k ∗ − λ 2 kΔ 2 k ∗ (3.55) and the proof of this lemma is completed. By the triangle inequality for the nuclear norm, we can rewrite Eq. (3.54) as follows, kδk 2 2 ≤kNk 2 kΔk ∗ +λkΔk ∗ (3.56) Due to our choice ofλ in the theorem statemen, i.e.,λ≥ 2kNk 2 , and sincekδk 2 =k ˆ d−d ∗ k 2 , we have k ˆ d− d ∗ k 2 kδk 2 ≤ 2λkΔk ∗ (3.57) Now, we use Lemma 6 to bound the right-hand side of (3.57). We know that there exists a matrix decomposition for error matrix Δ as Δ = Δ 1 + Δ 2 such thatkΔ 2 k ∗ ≤ 3kΔ 1 k ∗ and rank(Δ 1 )≤ 2η. Therefore, by the triangle inequality, we havekΔk ∗ ≤kΔ 1 k ∗ +kΔ 2 k ∗ ≤ 4kΔ 1 k ∗ . Furthermore, using the rank constraint, we know thatkΔ 1 k ∗ ≤ √ 2ηkΔ 1 k F ≤ √ 2ηkΔk F ≤ √ 2ηkΔk 2 ≤ √ 2ηkδk 2 . Putting all of these statements together, we have k ˆ d− d ∗ k 2 kδk 2 ≤ 2λkΔk ∗ ≤ 8λ √ 2ηkδk 2 ork ˆ d− d ∗ k 2 ≤ √ 32λη. Note that in Theorem 2, η = 1. 94 Chapter 4 Narrowband TV Channel Estimation 4.1 Introduction Wireless communications have enabled intelligent traffic safety [66, 12], automated robotic networks, underwater surveillance systems [48, 11], and many other useful technologies. In all of these systems, establishing a reliable, high data rate communication link between the source and destination is essential. To achieve this goal, the system requires accurate channel state information to equalize the distortion of the transmit signal and recover the message with minimum error at the destination. One of the well-known approaches to acquire the channel state information is to probe the channel in time/frequency with known signals and reconstruct the channel response from the output signals (see [61] and references therein). Least-squares (LS) and Wiener filters [55, 101] are classical examples of this approach. How- ever, these methods do not take advantage of the rich, intrinsic structure of wireless commu- nication channels in their estimation process. More recent approaches have tried to exploit theintrinsicchannelstructure, suchassparsity[95,4], group-sparsityormixed/hybridsparse and group sparsity structures [69, 68, 12] using compressed sensing/sparse approximation algorithms. Due to practical communication system constraints such as finite block length and finite transmission bandwidth, the inherent sparsity of the channel in the received signal reduces. This effect is defined as the channel leakage in [95, 12]. It has been shown that the performance of CS methods are significantly degraded due to the leakage effect in practice [12, 95]. 95 In this work, we show that under the above practical communication systems constraints, the transmit signal after passing thought a time-varying narrowband channel can be rep- resented by a parametric low-rank bilinear form. The low-rank structure appears due to the small number of dominant paths in the wireless communication channels and bilinear structure is due to the separability of the (leaked) time-varying channels in the delay and the Doppler directions. In addition, we show that this representation can be interpreted as a summation of rank-one matrices that can be characterized by the channel and the pulse leak- age key parameters. To exploit these nice structures of the channel matrix in our estimation process, we propose two approaches. In the first approach, which is motivated by our prior work [30], we enforce the rank constraint and bilinear form of the channel estimation problem via non-convex programming. To solve the non-convex optimization problem, we use the alternating direction method due to the bilinearity of the measurement model. In the first step of this algorithm, we recover the channel in the delay direction and in the second step we estimate the channel in the Doppler direction. We repeat the steps iteratively until we converge to a stationary point. Due to the non-convex nature of the objective function in our first strategy, we may find a local minimum. Toimprovetheperformanceofouralgorithmandavoidlocalminima, weconsideranother approach to capture the channel structures using a convex programming. We define a set of atoms to describe the set of rank-one matrices in our channel estimation problem. Utilizing this set of atoms, we show that the channel estimation problem can be stated as a parametric low-rank matrix recovery problem. Motivated by the convex recovery for inverse problem via atomic norm heuristic [25, 94, 28], we develop a recovery algorithm employing the atomic normtoenforcethechannelmodelandleakagestructuresviaaconvexoptimizationproblem. We analyze the algorithm to show that the global optimum can be recovered in the absence of noise. Numerical results showed that the proposed algorithm can provides a performance 96 (in SNR sense) with 5-8 dB improvement in average in SNR> 5 compared tol 1 -based sparse approximation method. The rest of this paper is organized as follows. Section 4.2 derives the communication system model which is used in Section 4.3 to derive the discrete time observation model. Section 4.4 includes the proposed structured method for time-varying channel estimation. Section 5.4 is devoted to discussion and numerical results, and finally Section 4.6 concludes the paper. Notation: Scalar values are denoted by lower-case letters,x and a column vector by bold letters,x. The i-th element ofx is given by x[i]. Given vectorx∈R n ,kxk 2 = q P n i=1 x 2 i denotes the ` 2 norm ofx, respectively. A matrix is denoted by bold capital letters such X X X and its (i,j)-the element by X[i,j]. The transpose of X is given by X T and its conjugate transpose by X H . A diagonal matrix with elements x is written as diag{x} and the identity matrix as I. The set of real numbers by R, and the set of complex numbers by C. The element-wise (Schur) product is denoted by. 4.2 System Model We consider that the transmitted signal x(t) is generated by the modulation of a pilot sequence x[n] onto the transmit pulse p t (t) as x(t) = +∞ X n=−∞ x[n]p t (t−nT s ), where T s is the sampling period. Note that this signal model is quite general, and encom- passes OFDM signals as well as single-carrier signals. The signal x(t) is transmitted over a linear, time-varying channel. The received signal y(t) can be written as, y(t) = Z +∞ −∞ h (t,τ)x(t−τ)dτ +z(t). (4.1) 97 Here,h(t,τ) is the channel’s time-varying impulse response, andz(t) is a Gaussian noise. A common model for the narrowband time-varying (TV) impulse response is as follows, h(t,τ) = p 0 X k=1 η k δ(τ−t k )e j2πν k t , (4.2) where p 0 denotes the number of dominant paths in the channel, η k , t k , and ν k denote the kth channel path’s attenuation gain, delay, and Doppler shift, respectively. At the receiver, y(t) is converted into a discrete-time signal using an anti-aliasing filter p r (t). That is, y[n] = Z +∞ −∞ y(t)p r (nT s −t)dt. We assume that p t (t) and p r (t) are causal with support [0,T supp ). Under the reasonable assumption ν max T supp 1, where ν max = max (ν 1 ,...,ν p 0 ) denotes the Doppler spread of the channel [66] and if we let p(t) = p t (t)∗p r (t), we can write the received signal after filtering and sampling as [12], y[n] = m 0 −1 X m=0 h l [n,m]x[n−m] +z[n] (4.3) where h l [n,m] = P p 0 k=1 h l,k [n,m] and h l,k [n,m] =η k e j2πν k ((n−m)Ts−t k ) p(mT s −t k ) (4.4) for n∈ N. Here m 0 = j τmax Ts k + 1, where τ max is the maximum delay spread of the channel, denotes the maximum discrete delay spread of the channel and, without loss of generality, if we assume that p r (t) has a root-Nyquist spectrum with respect to the sample duration T s , implies that z[n] is a sequence of i.i.d circularly symmetric complex Gaussian random variables with a constant varianceσ 2 z . The pulse leakage effect is due to the non-zero support of pulsep(·) in (4.4). The leakage with respect to Doppler can be decreased by increasing the 98 observed received signal length, and the leakage with respect to delay can be decreased by increasing the bandwidth of the transmitted signal. However, due to practical constraints on the observed received signal length and the bandwidth of the transmitted signal, the leakage effect will increase the number of nonzero coefficients of observed leaked channel at receiver side (for more details, see [12, 95]). The main goal of channel estimation problem is the determination of channel coefficients, i.e.,{h k [n,m]| for 1≤k≤p 0 } and 0≤m≤m 0 − 1 at time instancen, in order to equalize their effect on the transmitted signal. It is clear that at each time instance, n, there exist m 0 p 0 (unknown) channel coefficients to be estimated. These coefficients should be estimated via measurements from the model derived in (4.3), where both the transmitted signal x[n] and the received signal y[n] are known during the training signal transmission, i.e., n∈ {m 0 ,··· ,n T +m 0 − 1}, and n T denotes the total number of training signal measurements. Traditional estimation approaches consider the maximum likelihood (ML) approach to estimate the channel coefficients given that the noise distribution is Gaussian with zero mean and a constant variance. This approach is well-known as least-squares (LS) method, e.g., [4]. The main challenge with this approach is that they require large number of measurements compared to the number of unknown parameters in the estimation problem to perform well [4]. To combat this challenge, we need to take advantage of side information about the structure of unknown parameters in the problem. By promoting the structure of unknown parameters in the estimation process, in fact we reduce the size of the feasible set of solutions ofourestimationproblem. Hence, weneedfewernumberofmeasurementstofindtheoptimal solution of our estimation problem. In the next section, we show that the channel coefficients follow nice structures that we can exploit them in our estimation process in order to decrease the set of feasible solutions for channel coefficients. 99 4.3 Parametric Signal Representation In this section, we explore useful intrinsic structures existed in the measurement model in (4.3). We show that even though leakage effect will ruin the sparsity of channel coefficients, it will introduce a nice parametric low-rank structure that helps to estimate the channel with high accuracy with small number of measurements. Define l k (t) =p(t−t k )e −j2πν k t , then using (4.3) and (4.4) the received signal from path k-th can be written as s k [n] = m 0 −1 X m=0 l k (mT s )x[n−m] =x T n l k , (4.5) wherel k = [l k (0T s ),··· ,l k ((m 0 − 1)T s )] T andx n = [x [n],x [n− 1]··· ,x [n− (m 0 − 1)]] T . Vectorsl k , for 1≤ k≤ p 0 , contain only the (shifted) leakage pulse shape information and the vectorx n is described by m 0 consecutive measurements from training signal up to time n. We can represent the (aggregated) received signal in (4.3) as y[n] = p 0 X k=1 ¯ η k s k [n]e j2π¯ ν k n , (4.6) where ¯ η k =η k e −j2πν k t k and ¯ ν k =ν k T s ∈ [− 1 2 , 1 2 ]. If we stack s k [n] for m 0 ≤n≤n T +m 0 − 1 in a vector ass k = [s k [m 0 ],··· ,s k [n T +m 0 − 1]] T , we can write s k = X X Xl k , (4.7) where X X X is an T -by-m 0 matrix and itsi-th row equals tox T i+m 0 −1 . In wireless communication systems typically n T > m 0 or, that is, all l k live in a common low-dimensional subspace spanned by the columns of a known n T ×m 0 matrix X X X with n T > m 0 . We assume that kl k k 2 = 1 without loss of generality. Using Equation (4.5), recovery ofs k is guaranteed ifl k can be recovered. Therefore, the number of degrees of freedom in (4.6) becomes O(m 0 p 0 ), 100 whichissmallerthanthenumberofmeasurementsn T whenp 0 ,m 0 n T . ApplyingEquation (4.5), we can rewrite (4.6) as y[n] = p 0 X k=1 ¯ η k e j2πn¯ ν k x T n l k . (4.8) Definingd(ν) = e −j2πm 0 ν , ··· ,e −j2π(n T +m 0 −1)ν T , which denotes a vector of all possible Doppler shifts in the channel representation, we have y[n] = * p 0 X k=1 ¯ η k l k d(ν k ) H ,x n e T n−m 0 +1 + , (4.9) forn =m 0 ,··· ,n T +m 0 −1,wherewehavedefinedhX X X,Y Y Yi = trace(Y Y Y H X X X)ande n , 1≤n≤n T , are a canonical basis for I R n T ×1 . A careful reader see that the first term in the matrix inner product given in (4.9) has only the (narrowband) time-varying channel information and the second term only can be describe by training signal measurements. Hereafter, we define leaked channel matrix as H H H l = p 0 X k=1 ¯ η k l k d(ν k ) H . (4.10) Since each term in above summation is a rank one matrix, one can write rank (H H H l )≤p 0 . We see that (4.9) leads to a parametrized rank-p 0 matrix recovery problem, which we write as y = Π(H H H l ), (4.11) where the linear operator Π : C m 0 ×n T → C n T ×1 is defined as [Π(H H H l )] n = D H H H l ,x n e T n E . 101 4.4 Structured Estimation of Time-varying Narrow- band Channels In this section, we propose two algorithms to estimate the leaked channel matrix H H H l using trainingsignalmeasurements,y[n], byexploitingtheparametrizedlow-rankmatrixstructure described in Section 4.3. We show that by applying the channel structure in the estimation process, the channel coefficients can be estimated with high accuracy with small number of measurements. 4.4.1 Non-convex Approach Recovering the channel from the linear measurements modely = Π(H H H l ), described in (4.11), is equivalent to searching for a low-rank matrix H H H that satisfies this measurement model. Thus, we seek to solve, ˆ H H H l = argmin H H H ky− Π(H H H)k 2 2 s.t. rank (H H H)≤p 0 H H H = P p 0 k=1 ¯ η k l k d(ν k ) H (4.12) Unfortunately, this optimization problem, due to the rank constraint is in general, a NP-hard problem [81]. A tractable way to relax the rank constraint is to minimize the nuclear norm of the target matrix [81]. In particular, we can rewrite the relaxed optimization problem as ˆ H H H l = argmin H H H ky− Π(H H H)k 2 2 +λkH H Hk ∗ . (4.13) 102 where the parameterλ in (4.13) determines the trade-off between the fidelity of the solution to the measurements y and its conformance to the low-rank model. Furthermore, from (4.10), we know that the channel matrix H H H l can be represented as H H H l = p 0 X k=1 ¯ η k l k d(ν k ) H = H H H l,t H H H l,ν (4.14) where H H H l,t = [¯ η 1 l 1 , ¯ η 2 l 2 ,··· , ¯ η p 0 l p 0 ] (4.15) H H H l,ν = [d(ν 1 ),d(ν 2 ),··· ,d(ν p 0 )] H . (4.16) Note that matrix H H H l,t contains the channel components in the delay direction and matrix H H H l,ν contains the channel components in the Doppler direction. We see that the leaked channel matrix H H H l is separable in the delay and Doppler directions. Therefore, we can reformulate the optimization problem in (4.12) with H H H = H H H t H H H ν as, argmin H H Ht,H H Hν ky− Π (H H H t H H H ν )k 2 2 +λkH H H t H H H ν k ∗ . (4.17) From [82], we know that the nuclear norm of products can be rewritten as a Frobenius norm minimization, kH H H t H H H ν k ∗ = min H H Ht,H H Hν 1 2 kH H H t k 2 F +kH H H ν k 2 F . (4.18) Thus, we can rewrite the overall optimization problem in (4.17) as follows, argmin H H Ht,H H Hν ky− Π (H H H t H H H ν )k 2 2 + λ 2 kH H H t k 2 F +kH H H ν k 2 F . (4.19) 103 From (4.16), we see that the matrix H H H ν is a partial Vandermonde matrix and can be fully determined by Doppler parameters ν = [ν 1 ,··· ,ν p 0 ]. Therefore, instead of optimizing H H H ν on C p 0 ×n T , we perform the optimization over the set of Doppler parametersν∈ [− 1 2 , 1 2 ] p 0 as follows: argmin H H Ht,ν ky− Π (H H H t H H H ν (ν))k 2 2 + λ 2 kH H H t k 2 F + λ 2 kH H H ν (ν)k 2 F . Note that the above optimization problem is non-convex in general [30], due to the com- bination of unknown products, i.e., Π (H H H t H H H ν ). But due to the separability of the objective function in this optimization problem in delay and Doppler directions [12], we use alternating projections algorithm which is a space efficient technique that stores the iterates in factored form. The algorithm is extraordinarily simple, and easy to interpret: in the first step, we fixed one of H H H t or H H H ν , and try to optimize the other one. Then, in the second step, we substi- tute the computed matrix in the first step and optimize over the second matrix. We iterate over these two steps untill the algorithm converges to the optimal (or a stationary point) solution. Given the current estimates of H H H k t and H H H k ν , the updating rules can be summarized as H H H k+1 t = argmin H H Ht y− Π H H H t H H H ν (ν k ) 2 2 + λ 2 kH H H t k 2 F (4.20) ν k+1 = argmin ν y− Π H H H k+1 t H H H ν (ν) 2 2 . (4.21) We can further simplify these iterations using Lemma 7. Lemma 7. [see e.g., [45]] Suppose that Π : C m 0 ×n T → C n T ×1 is a linear operator. We can express the action of this operator as [Π (H H H)] n = Tr (X X X n H H H) = m 0 X i=1 n T X j=1 X X X n [j,i]H H H[i,j], 104 where X X X n =x n e T n−m 0 +1 for n =m 0 ,··· ,n T +m 0 − 1, and H H H = H H H t H H H ν . Then, we can write Π (H H H) = A A A t vec (H H H H H H H H H ν ) = A A A ν vec (H H H H H H H H H t ) (4.22) where vec(·) stacks the columns of its matrix argument into a single column vector. The matrices A A A t and A A A ν are defined as A A A t [k,l +p 0 (j− 1)] = m 0 X i=1 X X X k [j,i]H H H t [i,l] and A A A ν [k,i +m(l− 1)] = n T X j=1 X X X k [j,i]H H H ν [l,j], where l∈{1,··· ,p 0 }. Applying Lemma 7, we can rewrite each iteration of alternating projections algorithm in (4.20) and (4.21) as H H H k+1 t = argmin H H Ht y− A A A k ν vec (H H H t ) 2 2 + λ 2 kvec (H H H t )k 2 2 ν k+1 = argmin ν y− A A A k+1 t vec (H H H ν (ν)) 2 2 . Thus, updating in the delay direction is just a Ridge estimator or Tikhonov regularization [96] and in the Doppler direction we need to find the roots of (exponential) polynomial using root-finding algorithms [78]. One of the challenge with this approach is that it is sensitive to the initial point. We observe in numerical simulations that for a bad initialization, i.e., the start point is far from optimal solution, this algorithm may stuck in a local minima. This problem motivate us to use convex programming approach to enforce the parametric bilinear low-rank structure. In the next section, we use atomic norm heuristic, introduced by [25], to promote the sparsity of number dominant paths in the channel as well as parametric low rank structure in the channel model. 105 4.4.2 Convex Approach From the physics of signal propagation in a wireless communication system, we know that the number of dominant paths in a narrowband time-varying wireless communication system is small [71]. This means that the number of terms in Equation (4.8), p 0 , is quite small compared to the number of training signal measurements, i.e., p 0 n T . In other word, the channel can be described as a summation of rank-one matrices of formld(ν) H in (4.10). This representation motivate us to use the atomic norm heuristic to promote this structure. For this purpose, we define an atom as A A A (l,ν) =ld(ν) H , where ν∈ [− 1 2 , 1 2 ] andl∈ C m 0 ×1 . Without loss of generality, we considerklk 2 = 1, since we have the freedom to design both transmit and received pulse shapes p t (t) and p r (t) to satisfy this equality. Then, we define the set of all atoms as A = A A A (l,ν)| ν∈ [− 1 2 , 1 2 ],klk 2 = 1,l∈ C m 0 ×1 . (4.23) Our goal here is to find a representation of channel with small number of dominant paths, i.e., kH H H l k A,0 = inf p ( H H H l = p X k=1 ¯ η k l k d(ν k ) H ) . (4.24) Due to combinatoric nature of the norm define in (4.24), the above optimization problem is NP-hard. Thus, we alternatively consider the convex relaxation of above norm as the atomic norm associated with the set of atoms defined in (4.23) as kH H H l k A = inf{t> 0 : H H H l ∈tconv(A)} (4.25) = inf ¯ η k ,ν k ,kl k k 2 =1 ( X k |¯ η k | : H H H l = X k ¯ η k l k d(ν k ) H ) . 106 This relaxation is similar to the relaxation of l 0 -norm of a vector by its l 1 -norm, which is prevalentrelaxationincompressedsensingliteraturetoavoidcombinatorialnatureofl 0 -norm in the recovery optimization problem. Remark 6. The atomic representation in Equation (4.25) for matrix H H H l , i.e., H H H l = P k ¯ η k A A A (l k ,ν k ) = P k ¯ η k l k d(ν k ) H , not only captures the functional forms of its elements, but also enforces the rank-one constraint on each term in the summation, i.e., rank (A A A (l k ,ν k )) = 1. Toenforcethesparsityoftheatomicrepresentationorlow-rankrepresentationofreceived signal, we solve minimize H H H kH H Hk A s.t. y = Π(H H H). (4.26) From [94], we know that due to the Vandermonde decomposition, the convex hull of the set of atomsA can be characterized by a semidefinite program. ThereforekH H Hk A in (4.26) admits an equivalent SDP representation. Proposition 1 (see, e.g., [23, 94]). For any H H H∈ C m 0 ×n T , kH H Hk A = inf z,W W W 1 2n T trace (Toep(z) +n T W W W) s.t. Toep(z) H H H H H H H W W W 0, (4.27) wherez is a complex vector whose first element is real, Toep(z) denotes the n T ×n T Her- mitian Toeplitz matrix whose first column isz, and W W W is a Hermitian m 0 ×m 0 matrix. Therefore, we can use an efficient available SDP solver software such as CVX [42], to solve the optimization problem in (4.26). For noisy measurements, we consider minimize H H H kH H Hk A s.t. ky− Π(H H H)k 2 ≤σ 2 z . 107 Next, we show that the solution of the optimization problem in (4.26) will be the optimal solution of our channel estimation problem in the noiseless scenario. Furthermore, we discuss the conditions under which the solution of this convex program become unique and we can evaluate the solution with small number of training signal measurements with high probability. Optimality and Uniqueness The dual of the optimization problem in (4.26), using standard Lagrangian analysis can be written as maximize λ hλ,yi s.t.kΠ ∗ (λ)k ∗ A ≤ 1, (4.28) where Π ∗ (λ) = P k λ(k)x k e T k−m 0 +1 is the adjoint operator of Π andk·k ∗ A denotes the dual norm of the atomic norm. Therefore, we have kΠ ∗ (λ)k ∗ A = sup kΘ Θ Θk A ≤1 hΠ ∗ (λ),Θ Θ Θi (4.29) = sup ν∈[− 1 2 , 1 2 ],klk 2 =1 D Π ∗ (λ),ld(ν) H E . Equality in (4.29) holds, since the set n ld(ν) H o ν,l covers all the extremal points of the atomic norm unit ball, i.e.,{Θ Θ Θ :kΘ Θ Θk A ≤ 1}. If we define the vector-valued functionμ(ν) = Π ∗ (λ)d(ν), we have kΠ ∗ (λ)k ∗ A = sup ν∈[− 1 2 , 1 2 ],klk 2 =1 l H μ(ν) ≤ sup ν∈[− 1 2 , 1 2 ] kμ(ν)k 2 . (4.30) 108 Now, if we consider the following condition that kμ(ν)k 2 ≤ 1, (C-1) then, we can rewrite the optimization problem in (4.28) as follows: maximize λ Re{hλ,yi} subject tokμ(ν)k 2 ≤ 1, (4.31) whereμ(ν) = Π ∗ (λ)d(ν) or μ(ν) = n T +m 0 −1 X n=m 0 λ(n−m 0 + 1)e j2πnν x H n . (4.32) Similarly, we havehλ,yi =hΠ ∗ (λ),H H Hi = P k ¯ η ∗ k l H k μ (ν k ). Besides, if we also assume that μ(ν k ) = sign (¯ η k )l k , (C-2) for k∈{1,··· ,p 0 }, then we havehλ,yi = P k |¯ η k |≥kH H Hk A . Moreover, using the Hölder inequality, we know that hλ,yi≤kΠ ∗ (λ)k ∗ A kH H Hk A ≤kH H Hk A . Therefore, if condition (C-2) holds, thenhλ,yi =kH H Hk A . In other words, under conditions (C-1) and (C-2) the solution of the primal (Equation (4.26)) and dual (Equation (4.28)) optimization problems introduce a zero duality gap. Thus, H H H l andλ are optimal solutions of the primal and dual optimization problem. Furthermore, using proof by contradiction, we can see that condition (C-2) ensures the uniqueness of the optimal solution. Suppose ˆ H H H = P k ˆ η k ˆ l k d(ˆ ν k ) H is another optimal solution. Since ˆ H H H and H H H l are different, there are some 109 ˆ ν k that are not in the support of H H H l . DefineT ν ={ν 1 ,ν 2 ,··· ,ν p 0 } as the Doppler shifts’ support of H H H l . Then, we have hλ,yi = D Π ∗ (λ), ˆ H H H E = X k∈Tν ¯ η ∗ k l H k μ (ν k ) + X k/ ∈Tν ˆ ¯ η ∗ k ˆ l H k μ (ˆ ν k ) < X k∈Tν |¯ η ∗ k | + X k/ ∈Tν | ˆ ¯ η ∗ k |, which is in contradiction of the optimality of ˆ H H H. Thus, we show that if we could guarantee the existence of a dual polynomial μ(ν) = Π ∗ (λ)d(ν) with two key properties (C-1) and (C-2), then the optimization problem in Equation (4.26) will find the optimal solution of the channel estimation problem. In Theorem 4, we show that a proper dual polynomial μ(ν) exist under two main conditions: 1) minimum Doppler separation and 2) enough number of measurements. Theorem 4. Suppose n T ≥ 64 and the training sequence{x[n]|m 0 ≤n≤n T +m 0 − 1} is generated using a i.i.d. random source with Rademacher distribution 1 . Assume that min 1≤i<j≤p 0 |ν i −ν j |≥ 4 n T , then there exists a constant c, such that for n T ≥cp 0 m 0 log 3 n T p 0 m 0 δ (4.33) the proposed optimization problem in (4.26) can recover H H H l with probability at least 1−δ. The proof is given in Section 4.7. We observe that if (C-1) and (C-2) are satisfied, the optimization problems in (4.26) and (4.31) both recover the optimal solution of our channel estimation problem. 1 The Rademacher distribution is a discrete probability distribution where x i [j] =±1 with probability 1/2. 110 Channel Estimation Algorithm After we evaluate the dual parametersλ by solving the optimization problem in (4.26), we can construct the function μ (ν) in (4.32). Then, we can use it to estimate the Doppler parameters by enforcing condition (C-2), as we know that|μ (ν k )| = 1 for k∈{1,··· ,p 0 }. Towards this goal, we need to find the roots of the following polynomial Q(ν) = 1−kμ(ν)k 2 2 = 1−μ(ν) H μ(ν) = 1−kμ(ν)k 2 2 , (4.34) which are equal to{ν k } p 0 k=1 . After estimation of{ν k } p 0 k=1 , we can substitute them in (4.8) to achieve a linear system of equations to evaluate{¯ η k l k } p 0 k=1 . Note that we do not need to evaluate the values of ¯ η k andl k separately in order to equalize the channel distortion. As seen in (4.8), to construct an equalizer we just require ¯ η k l k for 1≤ k≤ p 0 , because using (4.4) we can rewrite channel coefficients as h l [n,m] = p 0 X k=0 ¯ η k l k [m]e j2π¯ ν k n , (4.35) and to design a channel equalizer, one just need to evaluate above coefficients in (4.35). 4.5 Numerical Simulations In this section, we perform several numerical experiments to validate the performance of the proposed channel estimation algorithms. Furthermore, we compare the performance of our proposed algorithms with sparsity-based method, see e.g., [4] and the references therein. 111 Figure 4.1: Normalized mean-squared-error (NMSE) of estimation strategies vs. signal-to-noise- ratio (SNR) at the receiver side - leaked channel matrix estimation. 4.5.1 Parameters setting We construct a narrowband time-varying channel based on the model given in (4.2). We first generate the channel delay, Doppler, and attenuation parameters randomly. In our experiments, the (normalized) delay and Doppler parameters are generated via uni- form random variables and the channel attenuation parameters are generated using a Rayleigh random variable, unless otherwise stated. The transmit training signal x = [x[1],x[2],··· ,x[n T +n 0 − 1]] T is generated by a random BPSK modulation, i.e.,{−1, +1} with equal probability. Moreover, the transmit and received pulse shapes are considered as Gaussian pulses with 50% window’s support. 4.5.2 Performance Comparsion In the first numerical simulation, we compare the normalized mean-squared-error (NMSE) of estimation performance of different algorithms. We define NMSE as follows: NMSE = kθ− ˆ θk kθk , (4.36) 112 whereθ is the true value of the target parameter and ˆ θ denotes its estimated value. Results in Fig.4.1 depict NMSE correspond to the estimation of the leaked channel’s matrix, H H H l , using the proposed convex and non-convex approaches, and the sparsity-based method, with n T = 64 measurements of training signal. From Fig.4.1, we observe that both convex and non-convex approaches perform better than the sparsity-based method using the l 1 -norm to promote the sparsity of channel coefficients. The main reason is that our approaches exploit the leakage structures that were discussed in Section 4.3 and 4.4, while the sparsity-based method ignores the leakage effect and considers the channel coefficients to be sparse, which results in a poor performance of sparsity-based method. We tried our non-convex approach based on different random initializations and the blue and the red curves in this figure (Fig. 4.1) illustrate the best and the worst performance results in our simulations, respectively. In Fig.4.1, we observe that non-convex approach may find a local minima (red curve) due to the bad initialization. Finding a good strategy for a proper initialization of the non-convex approach is an interesting research problem and is left for future research. 4.5.3 Doppler Estimation and Resolution Constraint As discussed in Section 4.4.2, in our proposed convex approach, we compute the Doppler shift parameters, ν k for 1≤ k≤ p 0 , by finding the roots of polynomial Q(ν), defined in Equation (4.34), then we use the evaluated Doppler shifts to estimate the leaked channel gains using Equation (4.8). In Fig. 4.2, the results show that the NMSE of our proposed convex algorithm for estimation of Doppler shift parameters. In this figure, for SNR≥ 5 dB, we see that for all values ofn T , the proposed algorithm can estimate the Doppler parameters with at most 0.01 (normalized) error. Furthermore, the accuracy of Doppler shifts estimation improves by increasing the number of measurements. Note that based on the results in Fig. 4.2, we see that our convex approach can perform quite well with only measurements in order of constant factor of m 0 p 0 . 113 Figure 4.2: Normalized mean-squared-error (NMSE) of estimation strategies vs. signal-to-noise- ratio (SNR) at the receiver side - channel Doppler shifts estimation. Fig. 4.3 and 4.4 illustrate the Doppler shifts recovery using the (dual) functionμ(ν). The channel Doppler shift parameters to generate this figure were ν k ∈{−0.4,−0.2, 0, 0.2, 0.3}. It is clear that by increasing n T from 100 to 200 measurements, other picks in the curve in Fig. 4.4 are getting much smaller than the picks on the location of Doppler shifts of the channel estimation problem. In other word, by increasing the number of measurements, n T , the resolution constraint in Theorem 1 become smaller and we are able to estimate the Doppler shifts with higher accuracy. 4.5.4 Bit Error Performance Comparison In this part, we compare the performance of our convex method and sparsity approximation approachbasedonthel 1 -norm,fordatarecoveryduringthedatatransmissionphase. Forthis comparison, we build a minimum mean square error (MMSE) equalizer using the estimated channel matrix using these two algorithms, to equalize the channel distortion. We consider m 0 = 10, p 0 = 5, and n T = 150 during the channel estimation phase. Then, we generate n = 1000 bits randomly and modulated by BPSK signaling to compute the bit-error-rate 114 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Doppler shift ( ) 0 0.2 0.4 0.6 0.8 1 1-Q( ) Figure 4.3: Doppler shifts are the roots ofQ(ν). This figure depicts 1−Q(ν) vs. ν. Forn T =100, p 0 =5, and m 0 =10. -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 Doppler shift ( ) 0 0.2 0.4 0.6 0.8 1 1-Q( ) Figure 4.4: Doppler shifts are the roots ofQ(ν). This figure depicts 1−Q(ν) vs. ν. Forn T =200, p 0 =5, and m 0 =10. (BER)duringdatatransmission. ResultsinFig. 4.5, showstheBERvs. SNRperformanceof these algorithms. The label "Convex - not satisfied" indicates that the resolution constraint is not enforced in the channel realization, while "Convex - satisfied" indicates the Doppler shifts are well-separated as is indicated in Theorem 4. The performance of sparse approximation method does not depend on the Doppler shift separation constraint, because of that in Fig. 4.5 only one curve for this method is illustrated. From results in Fig. 4.5, we see that the 115 0 2 4 6 8 10 12 14 16 SNR (dB) 10 -3 10 -2 10 -1 BER Sparse Appr. (L 1 -norm) Convex - not satisfied Convex - satisfied Figure 4.5: BER vs. SNR performance comparison. The label "Convex - not satisfied" indicates that the resolution constraint is not enforced in the channel realization, while "Convex - satisfied" indicates the Doppler shifts are well-separated as given in Theorem 4. quality of convex method in high SNR can (slightly) improve, if the Doppler shifts in the channel realization are well-separated. 4.6 Conclusions In this work, we have proposed new methods to estimate the time-varying narrowband channels by promoting key received signal structures. We showed that the received signal measurements in time domain follow a parametric low-rank structure due to the pulse shape leakage and the channel model. We proposed non-convex and convex optimization problems to estimate the channel by promoting the low-rank bilinear structures. The proposed non- convex approach is sensitive to initialization, therefore using atomic norm heuristic, we come up with a convex method to avoid this problem. For convex approach, the optimality and uniqueness conditions, and theoretical guarantee for the noiseless channel estimation problem with small number of measurements are characterized. Numerical results showed 116 that the proposed convex algorithm can provides a performance (in SNR sense) with 5− 8 dB improvement in average in SNR> 5 compared to l 1 -based SA method. 4.7 Appendix A: Proof of Theorem 4 According to our analysis in Section 4.4.2, if we could design a vector-valued function (poly- nomial) μ(ν) that satisfies conditions (C-1) and (C-2), then the optimization problem in (4.26) will recover the optimal solution of our channel estimation problem. In this proof, we use a technique called dual certifier construction [23, 94, 28]. Based on this technique, we construct a randomized dual polynomialμ(ν), i.e., the dual certifier, with help of Fejer’s ker- nel [23] and show that given enough number of measurements, the constructed dual certifier with high probability satisfies both conditions (C-1) and (C-2). Consider the the squared Fejer’s kernel as [23] f r (ν) = 1 n T n T +m 0 −1 X n=m 0 f n e −j2πnν , (4.37) where f n = 1 n T P n T k=n−n T 1− |k| n T 1− |n−k| n T . Define the randomized matrix-valued version of the squared Fejer’s kernel as [23, 94, 28] F F F r (ν) = 1 n T n T +m 0 −1 X n=m 0 f n e −j2πnν x n x H n . (4.38) Sincethetrainingsignalisgeneratedusingani.i.d. randomsourcewithRademacherdistribu- tion,itisclearthat E{F F F r (ν)} =f r (ν)I I I m 0 andforitsderivativewehave E n F F F 0 r (ν) o =f 0 r (ν)I I I m 0 , where I I I m 0 denotes an m 0 ×m 0 identity matrix. Now, we define a candidate vector-valued dual certifier polynomialμ(ν) as μ(ν) = p 0 X k=1 F F F r (ν−ν k )α k + F F F 0 r (ν−ν k )β k (4.39) 117 whereα k = [α k,1 ,··· ,α k,m 0 ] T andβ k = [β k,1 ,··· ,β k,m 0 ] T are constant coefficients. Clearly, thecandidateμ(ν)definedin(4.39)followsthevalidformofμ(ν)givenin(4.32). Coefficients α k and β k are selected such that the candidate μ(ν) satisfies condition (C-2) and part of (C-1), namely μ(ν k ) = sign (¯ η k )l k (4.40) μ 0 (ν k ) = 0, i.e., maximum occurs at ν k . (4.41) We can summarize above equations as Γ Γ Γ α T 1 ,··· ,α T p 0 ,γβ T 1 ,··· ,γβ T p 0 T =l, (4.42) where γ = q |f 00 r (0)|,l = sign (¯ η 1 )l T 1 ,··· , sign (¯ η p 0 )l T p 0 , 0 T ,··· , 0 T T , and the matrix Γ Γ Γ can be written as Γ Γ Γ = 1 n T n T +m 0 −1 X n=m 0 (ν n ν H n )⊗ (x n x H n )f n , (4.43) where⊗istheKroneckerproductandν n = h e −j2πnν 1 ,··· ,e −j2πnνp 0 , j2πn γ e −j2πnν 1 ,··· , j2πn γ e −j2πnνp 0 i H . Thus, if we show that Γ Γ Γ is invertible, then we can easily evaluate α k andβ k from system of equations in (4.42) and accordingly mvecμ(ν) will satisfy both (4.40) and (4.41). Lemma 8 shows that for enough number of measurements and well-separated Doppler shift parameters, matrix Γ Γ Γ is invertible with high probability. Lemma 8. [See Proposition 16 in [94] and Lemma 2.2 in [23] ] Define an event E ε = {kΓ Γ Γ− E{Γ Γ Γ}k≤ε} for the generated random i.i.d. sequence x n with Rademacher distribu- tion, then 118 1. Let 0<δ< 1 and|ν i −ν j |≥ 1 n T for∀i6=j. Then for any ε∈ (0, 0.5], as long as n T ≥ 80m 0 p 0 ε 2 log 4m 0 p 0 δ , event E ε occurs with probability at least 1−δ. 2. Define ¯ Γ Γ Γ = E{Γ Γ Γ}. Let|ν i −ν j |≥ 1 n T for∀i6=j. Then ¯ Γ Γ Γ is invertible. 3. Given that E ε holds for an ε∈ (0, 0.25], then we have Γ Γ Γ −1 − ¯ Γ Γ Γ −1 ≤ 2ε ¯ Γ Γ Γ −1 and Γ Γ Γ −1 ≤ 2 ¯ Γ Γ Γ −1 . Thus, construction ofμ(ν) in (4.39) ensures condition (C-2) andμ 0 (ν k ) = 0 for∀k. To complete the proof we need to show thatkμ(ν)k 2 < 1 for allν∈ [−0.5, 0.5]\T ν to guarantee condition (C-1). We show that this condition will be satisfied by proposedμ(ν) in Lemma 10 and Lemma 11. But before stating these lemmas, let us to define some notations and state Lemma 9 that we use in the proof of Lemma 10 and 11. Let Γ Γ Γ −1 = [L L L R R R] where L L L∈ I R 2m 0 p 0 ×m 0 p 0 and R R R∈ I R 2m 0 p 0 ×m 0 p 0 , then using (4.42), we have α T 1 ,··· ,α T p 0 ,γβ T 1 ,··· ,γβ T p 0 T = L L Ll. If we multiply both side of above equation by Ω Ω Ω (m) (ν) = 1 γ m h F F F (m) r (ν−ν 1 ),··· ,F F F (m) r (ν−ν p 0 ), 1 γ F F F (m+1) r (ν−ν 1 ),··· , 1 γ F F F (m+1) r (ν−ν 1 ) # H (4.44) where Ω Ω Ω (m) (ν) denotes the mth order derivative of the function Ω Ω Ω(ν) for m = 0, 1, 2,···, then we can present the mth order entry-wise derivative ofμ(ν) by 1 γ m μ (m) (ν) = Ω Ω Ω (m) (ν) H L L Ll (4.45) 119 Similarly, if we define ¯ μ(ν) = E{μ(ν)} and ¯ Γ Γ Γ −1 = h ¯ L L L ¯ R R R i , then we have 1 γ m ¯ μ (m) (ν) = h E n Ω Ω Ω (m) (ν) oi H ( ¯ L L L⊗ I I I m 0 )l. (4.46) Furthermore, we can write Ω Ω Ω (m) (ν) = 1 n T n T +m 0 −1 X n=m 0 ( j2πn γ ) m f n e j2πnν ν n ⊗x n x H n , (4.47) and E n Ω Ω Ω (m) (ν) o =ω (m) (ν)⊗ I I I, where ω (m) (ν) = 1 γ m h f (m) r (ν−ν 1 ),··· ,f (m) r (ν−ν p 0 ), 1 γ f (m+1) r (ν−ν 1 ),··· , 1 γ f (m+1) r (ν−ν 1 ) # H = 1 n T n T +m 0 −1 X n=m 0 ( j2πn γ ) m f n e j2πnν (ν n ⊗ I I I). (4.48) Now using these relationships, we can use Lemma 9 to show thatμ (m) (ν) is concentrated around ¯ μ (m) (ν) with high probability. Lemma 9 (see the proof of Theorem 3 in [46]). Consider|ν i −ν j |≥ 1 n T for∀i6=j and let δ∈ (0, 1). Then, for m = 0, 1, 2,···, we have 1 γ m μ (m) (ν)− ¯ μ (m) (ν) 2 ≤ε (4.49) for n T ≥c m 0 p 0 ε 2 log 3 n T m 0 p 0 δε , where c is a constant number, with probability at least 1−δ. Let us defineT near ν =∪ p 0 k=1 [ν k −ν ε ,ν k +ν ε ] andT far ν = [−0.5, 0.5]\T near ν where ν ε = O 1 n T , e.g., say ν ε = 0.1 n T . Lemma 10. Assume|ν i −ν j |≥ 1 n T for∀i6=j and let δ∈ (0, 1). Then, kμ(ν)k 2 < 1, for∀ν∈T far ν 120 with probability at least 1−δ for n T ≥cm 0 p 0 log 3 n T m 0 p 0 δ . Proof. We start by a relationship results from triangular inequality kμ(ν)k 2 ≤kμ(ν)− ¯ μ(ν)k 2 +k¯ μ(ν)k 2 . (4.50) where ¯ μ(ν) = E{μ(ν)}. To complete the proof, since from Lemma 9, we know that kμ(ν)− ¯ μ(ν)k 2 approaches to zero with high probability for ν ∈ (0, 1), we just need to show thatk¯ μ(ν)k 2 < 1 for ν∈T far ν . From (4.46), we have k¯ μ(ν)k 2 = sup x:kxk 2 =1 x H ¯ μ(ν) (4.51) = sup x:kxk 2 =1 x H [E{Ω Ω Ω(ν)}] H ( ¯ L L L⊗ I I I)l (4.52) = sup x:kxk 2 =1 p 0 X k=1 h ω(ν) H ¯ L L L i k x H ¯ η k l k < 0.99992. (4.53) Above inequality follows from the fact that x H ¯ η k l k ≤ 1 and the proof of Lemma 2.4 in [23] for ν∈T far ν (or see Lemma 10 in [28]). Lemma 11. Assume|ν i −ν j |≥ 1 n T for∀i6=j and let δ∈ (0, 1). Then, kμ(ν)k 2 < 1, for∀ν∈T near ν with probability at least 1−δ for n T ≥cm 0 p 0 log 3 n T m 0 p 0 δ . Proof. Our choice of the coefficients implies that dkμ(ν)k 2 2 dν | ν=ν k = 0 and d 2 kμ(ν)k 2 2 dν 2 | ν=ν k = Re n 2μ(ν) H dμ(ν) dν o | ν=ν k = 0. Thus, for ν∈T near = [ν k −ν ε ,ν k +ν ε ], to proof the claim of the theorem, it is sufficient to show that d 2 kμ(ν)k 2 2 dν 2 < 0. Note that 1 2 d 2 kμ(ν)k 2 2 dν 2 =kμ 0 (ν)k 2 2 + Re n μ 00 (ν) H μ(ν) o , (4.54) 121 for ν∈T near ν . Using Lemma 9, we can write 1 γ 2 kμ 0 (ν)k 2 2 = 1 γ (μ 0 (ν)− ¯ μ 0 (ν) + ¯ μ 0 (ν)) 2 2 ≤ε 2 + 2εk¯ μ 0 (ν)k 2 γ + k¯ μ 0 (ν)k 2 2 γ 2 . (4.55) Similar to calculations in Lemma 2.3 and 2.4 in [23], we havek¯ μ 0 (ν)k 2 ≤ 1.6n T andγ > πn T √ 3 for n T ≥ 2. Therefore, we have 1 γ 2 kμ 0 (ν)k 2 2 ≤ε 2 + 1.75ε + k¯ μ 0 (ν)k 2 2 γ 2 (4.56) Similarly, we havek¯ μ(ν)k 2 ≤ 1 andk¯ μ 00 (ν)k 2 ≤ 21.15n 2 T for ν∈T near , thus 1 γ 2 Re n μ 00 (ν) H μ(ν) o = 1 γ 2 Re n (μ 00 (ν)− ¯ μ 00 (ν) + ¯ μ 00 (ν)) H (μ(ν)− ¯ μ(ν) + ¯ μ(ν)) o ≤ε 2 + 4.25ε + Re n ¯ μ 00 (ν) H ¯ μ(ν) o γ 2 (4.57) Therefore, substituting (4.56) and (4.57) in the inequality (4.54), we have 1 2γ 2 d 2 kμ(ν)k 2 2 dν 2 < 2ε 2 + 6ε+ 1 γ 2 k¯ μ 0 (ν)k 2 2 + Re n ¯ μ 00 (ν) H ¯ μ(ν) o (4.58) Similar to argument in Lemma 2.3 in [23] (see equation 2.19 and afterward), we can conclude that 1 γ 2 k¯ μ 0 (ν)k 2 2 + Re n ¯ μ 00 (ν) H ¯ μ(ν) o ≤−0.029. Therefore, we have 1 2 d 2 kμ(ν)k 2 2 dν 2 ≤ 2ε 2 + 6ε− 0.029 < 0, for ε small enough, e.g., ε≤ 10 −5 . This completes the proof. 122 Putting the results of Lemma 10 and Lemma 11 together, Theorem 4 is proved, since μ(ν) is verified to satisfy both conditions (C-1) and (C-2) with high probability for enough number of measurements. 123 Chapter 5 Compression-based Compressed Sensing The main problem of compressed sensing is to recover an unknown target signal x∈ I R n from undersampled linear measurementsy∈ I R m , y = A A Ax +z, (5.1) where A A A ∈ I R m×n and z ∈ I R m denote the measurement matrix and the measurement noise, respectively. Since m<n,x cannot be recovered accurately, unless we leverage some prior information about the signal x. Such information can mathematically be expressed by assuming thatx∈S, whereS⊂ I R n is a known set. Intuitively, it is expected that the “smaller” the setS, the fewer number of measurementsm required for recoveringx. In other words, having more side information about the target signal x limits the feasible set in a way that enables the recovery algorithm to identify a high-quality estimate ofx using fewer measurements. In the last decade, researchers have explored several different instances of the setS, such as the class of sparse signals or low-rank matrices [34, 21, 20, 11, 12]. Despite the mathematical elegance of these research forays, these methods do not always provide good performance in several key application areas. A major barrier is that real-world signals often exhibit far more complex structures than sparsity or low-rankness, for example. To combat such limitations, we propose an approach which enlarges the set of structures that can be considered in compressed sensing to those already employed by data compression codes. Our approach hinges upon the following simple assumption: for a given class of signals, there 124 exists an efficient compression algorithm that is able to represent the signals in that class with a small number of bits per symbol. In many application areas, such as image and video processing, thanks to the extensive research performed in the past fifty years, such compression algorithms exist and are able to leverage the sophisticated structures shared by signals in a particular class. Compressive sensing recovery algorithms that take advantage of similar complex structures have the potential to outperform current algorithms. This raises the following question, that we address in this paper: How can one effectively employ a compression algorithm in a compressed sensing recovery method? In response to this question, we propose an iterative compression-based compressed sens- ing recovery algorithm, called compression-based gradient descent (C-GD). C-GD, with no extra effort, by employing state-of-the-art compression codes, enlarges the set of structures which can be exploited beyond the simple ones studied to date in the context of compressed sensing. Our simulation results confirm that C-GD combined with state-of-the-art image compression codes such as JPEG2000 outperforms state-of-the-art methods in compressive imaging. In addition to its remarkable performance in imaging applications, we provide theoretical analysis proving linear convergence rate of C-GD and its robustness to additive noise and other system non-idealities. Using compression codes for compressed sensing was first proposed in [51] and then further studied in [35]. Both papers studied this problem from a theoretical standpoint for deterministic signal models and for stationary processes, respectively. In [51], inspired by the Occam’s razor principle, the compressible signal pursuit (CSP) optimization was proposed. (Refer to Section 5.1.3 for a brief review of CSP.) The goal of the CSP optimization as a compression-based recovery algorithm is to find the signal that gives the lowest measurement error amongst all compressible signals. It can be proved that, in the noise-free setting, asymptotically, asn goes to infinity and the per-symbol reconstruction distortion approaches zero, the CSP optimization achieves the optimal sampling rate, in cases where such optimal rates are already known in the literature. In other words, for such cases, using appropriate 125 compression codes, asymptotically, CSP is able to achieve almost lossless recovery using the minimum required sampling rate required by a Bayesian recovery algorithm. While these results theoretically support the idea of designing compression-based compressed sensing algorithms, they do not provide a recipe of how to achieve this goal in practice. Solving the CSP optimization requires an exhaustive search over all compressible signals, i.e., all signals in the codebook, whose size is often prohibitively large. Consider the class of images that can be compressed in 1000 bits only, then the size of the codebook will be 2 1000 . Hence, CSP cannot be used in most real-world applications. On the other hand, the C-GD algorithm proposedhereinisinspiredbyprojectedgradientdescent. Eachstepofthealgorithminvolves two operations: i) moving in the direction of the gradient ofky− A A Axk 2 , and ii) projecting on the set of compressible signals. Both steps can be performed efficiently. Endeavors to go beyond sparsity and low-rankness in compressed sensing have ben previ- ously considered [25, 67, 5, 16]. Here, we summarize this prior art and provide comparisons toC-GD.Tousetheframeworkdevelopedin[25]foraclassofsignalsS, oneshouldconstruct a set of atoms such that everyx∈S can be represented using a few of those atoms. Finding non-standard atoms for real-world signals, such as images, is a sophisticated and challenging task and hence this approach has not found any application in compressive imaging or video recording to date. The work of [67] is restricted to independently and identically distributed (i.i.d.) measurement matrices. We also examine i.i.d. measurement matrices for our theoret- ical analysis; however, our simulation results also consider partial-Fourier matrices employed in radar application. In contrast, D-AMP [67] fails to work with such matrices. In [5], it is assumed that many of n k subspaces of k-sparse vectors, do not belong to the signal space. As such the methods of [5] still focus on simple structures and do not perform well in imaging applications as demonstrated in [67]. Blumensath [16] studies a union of linear subspaces signal model and proposes a generic algorithm based on projected gradient descent for signal 126 recovery. While we also employ iterations of gradient descent, there are several major dif- ferences between our work and [16]. Our model is built based on compression codes, which is not a union of subspaces. Hence, the analysis of [16] cannot be applied to our algorithm. Finally, thereissignificantpriorartthathasdevelopedheuristicalgorithmsinanattempt to exploit complex structures of images. (See [33] and references therein.) In our simulations, we compare the performance of our algorithm with the state-of-the-art heuristic algorithm NLR-CS proposed in [33]. C-GD offers comparable performance; we emphasize that our method offers important advantages over the heuristic approaches such as that in [33]: (i) It is general and can be applied to different applications with no extra effort for tailoring or spe- cialization, (ii) C-GD is theoretically analyzed and thus comes with performance guarantees including analysis in the presence of noise. The organization of this chapter is as follows. Section 5.1 reviews some background information. Section 5.2 presents the main contributions in this chapter , namely the C- GD algorithms and its theoretical analysis. Section 5.3 consider the application of the C-GD algorithm to some standard classes of signals. Section 5.4 presents our simulation results, which show that the C-GD algorithm achieves state-of-the-art performance in image compressed sensing. Section 5.6 provides the proofs of the mains results in this chapter. Finally, Section 5.5 concludes this chapter. Notations: Scalar values are denoted by lower-case letters,x and a column vector by bold letters, x. The i-th element of x is given by x i . Given vector x∈ R n ,kxk ∞ = max i |x i | andkxk 2 = q P n i=1 x 2 i denote the ` ∞ norm and the ` 2 norm ofx, respectively. A matrix is denoted by bold capital letters such X X X and its (i,j)-th element by X i,j . The transpose of X X X is given by X X X T . The maximum singular value of A A A is given by σ max (A A A). Calligraphic letters such asD andC denote sets. The size of a setC is given by|C|. The unit sphere in R n is denoted by S n−1 , i.e., S n−1 ,{u∈ R n |kuk 2 = 1}. Throughout the paper, log(·) and ln(·) refer to logarithm in base 2 and natural logarithm, respectively. Finally, we use O and Ω notation, defined below, to describe the limiting behavior of certain quantities. That is, 127 f(n) = O(g(n)) as n→∞, if and only if there exist n 0 and c such that for any n > n 0 , |f(n)|≤ c|g(n)|. Likewise, f(n) = Ω(g(n)) as n→∞, if and only if there exist n 0 and c such that for any n>n 0 ,|f(n)|≥c|g(n)|. 5.1 Background In this section, we first review the notation used throughout the paper and provide key definitions. Then we review the rate-distortion function of a compression code, and for- mally define the objective of compression-based signal recovery. Finally, we review the CSP optimization, which is the first proposed method for employing compression algorithms for compressed sensing. 5.1.1 Definitions In this chapter, the main results are proved for both Gaussian and sub-Gaussian measure- ment matrices. In the following, we briefly review the definition of sub-Gaussian and sub- exponential random variables. Definition 2 (Sub-Gaussian). A random variable X is sub-Gaussian when kXk ψ 2 , inf ( L> 0 : E " exp |X| 2 L 2 !# ≤ 2 ) <∞. Note that Gaussian random variables are also sub-Gaussian random variables. For X∼ N (0,σ 2 ),kXk ψ 2 = q 8 3 σ. Definition 3 (Sub-exponential). A random variableX is a sub-exponential random variable when kXk ψ 1 , inf L> 0 : E e |X| L ≤ 2 <∞. (5.2) 128 Using the above definitions, it is straightforward to show the following result. Lemma 12. Let X and Y be sub-Gaussian random variables. Then XY is sub-exponential. Moreover,kXYk ψ 1 ≤kXk ψ 2 kYk ψ 2 . 5.1.2 Compression-based compressed sensing problem statement Consider a compact setQ⊂ I R n . Throughout the paper, we focus on this deterministic signal model. However, all the results can be extended to the stochastic setting as well, see [83] for details. A rate-r lossy compression code for setQ is characterized by its encoding and decoding mappings (f,g), where f :Q→{1, 2,..., 2 r }, and g :{1, 2,..., 2 r }→Q. Encoding and decoding mappings (f,g) define a codebookC, where C ={g (f (x)) : x∈Q}. Clearly,|C|≤ 2 r . The performance of a code defined by (f,g) is characterized by its i) rate r and ii) maximum distortion δ defined as δ = sup x∈Q kx−g(f(x))k 2 . The problem of compression-based compressed sensing is stated as follows. Suppose that a family of compression codes (f r ,g r ), parameterized with rater is given forQ. For instance, JPEG or JPEG2000 compression algorithms at different rates can be considered as a family 129 of compression algorithms for the class of natural images. The deterministic distortion-rate function of this family is given by δ(r) = sup x∈Q kx−g r (f r (x))k 2 . Note that for any reasonable code, δ(r) is expected to be a monotonically, non-increasing function of r. We also define the deterministic rate-distortion function of the compression code as r(δ) = inf{r :δ(r)≤δ}. Remark 7. Typically, a family of compression codes is defined as a sequence of compression codesthatareindexedbyblocklengthnandoperateeitheratafixedrateorafixeddistortion. In this paper, we are more interested in the setting where the blocklength is fixed and the rate or the distortion changes. Therefore, we consider a family of compression codes with fixed blocklength n indexed by rate r. Based on these definitions, we can formally state the objectives of this paper as follows. Consider the problem of compressed sensing of signals in a setQ. Suppose that instead of knowing the setQ explicitly, for signals inQ, we have access to a family of compression algorithms with rate-distortion function r(δ). For x∈Q our goal is not to compress it, but to recover it from its undersampled set of linear projections y = A A Ax +z. The goal of compression-based compressed sensing is summarized in the following two questions: Question 1: Can one employ a given compression code in an efficient (polynomial time) signal recovery algorithm? Question 2: Can we characterize the number of observations such an algorithm requires to accurately recoverx, in terms of the rate-distortion performance of the code? Note that the algorithms that are developed in response to the above questions will automatically employ the structure captured by the compression algorithms, and hence can immediately enlarge the set of structures beyond those used in classical compressed sensing. 130 For instance, an MPEG4-based compressed sensing recovery algorithm will use not only the intra-frame, but also the inter-frame dependencies among different pixels. 5.1.3 Compressible signal pursuit Consider the following simplified version of our main questions: Can one employ a com- pression code for signal recovery? Can we characterize the number of observations such algorithm requires for an accurate recovery ofx in terms of the rate-distortion performance of the code? Note that the only simplification is that we have removed the constraint on the computational complexity of the recovery scheme. In response to this simplified question, [51] proposed the compressible signal pursuit (CSP) optimization that estimates signal x based on measurementsy as follows. Among all the signalsu∈Q that satisfy the measure- ment constraint, i.e. y = A A Au, it searches for the one that can be compressed well by the compression code described by (f r ,g r ). More formally, given a lossy compression code with codebookC r ={g r (f r (x) :x∈Q}, the CSP optimization recoversx from its measurements y = A A Ax as follows ˆ x = argmin u∈Cr ky− A A Auk 2 2 . (5.3) The performance of the CSP optimization is characterized in [51] and [35], under determin- istic and stochastic signal models, respectively. Before we mention the theoretical results, we should emphasize that at this point CSP is based on an exhaustive search over the codebook and hence is computationally infeasible. The following result from [51] characterizes the performance of the CSP optimization in the noiseless setting, wherez = 0. Theorem 5 (Corollary 1 in [51]). Consider a family of compression codes (f r ,g r ) for setQ with corresponding codebookC r and rate-distortion function r(δ). Let A A A∈ I R m×n , where A i,j 131 are i.i.d.N (0, 1). Forx∈Q andy = A A Ax, let ˆ x denote the solution of (5.3). Given ν > 0 and ζ > 1, such that ζ log 1 eδ <ν, let m = ζr log 1 eδ . Then, P (kx− ˆ xk 2 ≥θδ 1− 1+ν ζ )≤ e −0.8m + e −0.3νr , where θ = 2e −(1+ν)/ζ . For a simpler interpretation of this result, we define the α-dimension of a family of compression codes with rate-distortion function r(δ) as α = lim sup δ→0 r(δ) log(1/δ) . Suppose that a small value ofδ (or large value ofr) is used in CSP. Then, roughly speaking, Theorem 5 implies that CSP returns an almost accurate estimate of x as long as m > α. Note that α is usually much smaller than n, and hence the number of measurements CSP requires is much smaller than the ambient dimension of the signal. See Section 5.3 for some classical examples. Remark 8. In this paper, we focus entirely on the deterministic setting. The performance of CSP in the stochastic setting is studied in [83]. Using the connection between the Rényi information dimension and rate distortion dimension of a random variable [58], it has been provedin[83,53]thatfori.i.d.sourceswithamixtureofdiscreteandcontinuousdistribution, CSP achieves the optimal sampling rate. Remark 9. The robustness of CSP to deterministic and stochastic measurement noises has also been proved in [51]. However, for the sake of brevity we do not repeat those results here and only discuss the noiseless setting in Theorem 5. 132 Unfortunately, the positive theoretical properties of CSP are overshadowed by its pro- hibitive complexity. In the next section, we propose an efficient CS recovery algorithm that employs compression codes and compare its performance with that of CSP. 5.2 Our main contributions 5.2.1 Compression-based gradient descent (C-GD) As discussed in the last section, CSP is based on an exhaustive search and is computation- ally infeasible for real-world signals. In response to this drawback of CSP, we propose a computationally efficient and theoretically analyzable approach to approximate the solution of CSP. Towards this goal, inspired by the projected gradient descent (PGD) algorithm [85], we propose the following iterative algorithm: Start from somex 0 ∈R n . For k = 1, 2,..., x k+1 ←P Cr x k +η k A A A T y− A A Ax k , (5.4) where P Cr (z) = argmin u∈Cr ku−zk 2 2 . (5.5) Here indexk denotes the iteration number andη k ∈ I R denotes the step size. We refer to this algorithm as compression-based gradient descent (C-GD). Each iteration of this algorithm involves performing two operations. In the first step, it moves in the direction of the negative ofky− A A Axk 2 2 to find solutions that are closer to they = A A Au hyperplane. The second step, i.e., the projection step, ensures that the estimate C-GD obtains belongs to the codebook. The first step of C-GD is straightforward and requires two matrix-vector multiplications. For the second step, ideally, applying the encoder and decoder to a signal x yields the 133 closest codeword of the compression code. Hence, we make the following assumption about the compression code tox. Assumption 1. In analyzing the performance of C-GD, we assume that the compression code (f r ,g r ) satisfies P Cr (x) =g r (f r (x)). (5.6) Under Assumption 1, the projection step of C-GD can be implemented efficiently. In our numerical simulation, we find that this assumption holds perfectly for most of well-known compression algorithms. Furthermore, in Theorem 7, we analysis the effect of deviation of compression algorithm from this assumption, i.e., imperfect projection, on the convergence of C-GD method. More precisely, under Assumption 1, the C-GD algorithm is simplified to x k+1 ←g r f r x k +η k A A A T y− A A Ax k . (5.7) Hence, each step of this algorithm requires two matrix-vector multiplication and an appli- cation of the encoder and the decoder of the given compression code. In the next section, we summarize our theoretical results regarding the performance of C-GD. Note that for the notational simplicity, we present all our results under Assumption 1. However as will be discussed after Corollary 2, we can relax this assumption and still analyze the iterative algorithm proposed in (5.7). 5.2.2 Convergence Analysis of C-GD The objective of this section is to theoretically analyze some of the properties of C-GD. As discussed before, the measurement vector is denoted with y = A A Ax +z, where x∈Q, A A A∈ I R m×n , and z is the noise. Furthermore, we assume that a family of compressions 134 codes (f r ,g r ), parameterized with the rate r, is known forQ. Starting withx 0 , C-GD uses iterations x k+1 ←P Cr x k +η A A A T y− A A Ax k . to obtain a good estimate of x. In our theoretical analysis of C-GD, we focus on popular measurement matrices in the compressed sensing area, namely dense i.i.d. Gaussian and sub-Gaussian matrices. In Section 5.4, we present our simulation results that confirm the success of C-GD for partial-Fourier matrices as well. However, the theoretical study of this importantclassofmatricesisleftforfutureresearch. WefirststudytheperformanceofC-GD for i.i.d. Gaussian measurement matrices. Then, we extend our results to i.i.d. sub-Gaussian measurement matrices. Theorem 6. Let A A A∈ I R m×n be a random Gaussian measurement matrix with i.i.d entries A i,j ∼N (0,σ 2 a ), and z∈ I R m be an i.i.d. Gaussian noise vector with z i ∼N (0,σ 2 z ). Let η = 1 mσ 2 a and define ˜ x =P Cr (x), whereP Cr (·) is defined in (5.6). If Assumption 1 holds, then, given > 0, for m≥ 80r (1 +), with a probability larger than 1− 2 −2r+1 , we have kx k+1 − ˜ xk 2 ≤ 0.9kx k − ˜ xk 2 + 2 2 + r n m 2 δ + σ z σ a s 8(1 +)r m , (5.8) for k = 0, 1, 2,.... The proof of Theorem 6 is given in Section 5.6.1. A few important features of this theorem are discussed in the following remarks. Remark 10. According to Theorem 6, C-GD requires Ω(r(δ)) measurements (for small val- ues of δ) to obtain an accurate estimate. Hence, according to this theorem, even in the noiseless setting we should not let δ→ 0. Otherwise, for many signal classes r(δ)→∞ and hence C-GD will require more measurements than the ambient dimension of x o . As we will demonstrate in several examples in Section 5.3, one can set δ to a small dimension- dependent value, e.g. δ = 1/n, to ensure that C-GD can obtain a very good estimate with a 135 few observations. Section 5.3 studies how δ is set and connects Theorem 6 to some classical results in compressed sensing. Remark 11. According to Theorem 5, CSP requires Ω r(δ) log(1/δ) . This implies that, in the noiseless setting, for fixed n and m, the estimate of CSP improves as δ decreases. However, this seems not to be the case for C-GD. According to Theorem 6, C-GD requires more than Ω(r(δ)). Hence, as δ decreases, C-GD requires more measurements. This is not an issue in almost all the applications of compressed sensing, where n is very large and hence there is not much difference between setting δ = 1/n and δ = 0. However, from a theoretical perspective, it is interesting to discover whether this mismatch is an artifact of our proof technique or it is a fundamental loss that is incurred by the reduction in the computational complexity. This question is left for future research. We postpone the discussion of the relationship between convergence rate and the recon- struction error to Section 5.3, where we discuss some classical examples and compare the conclusion of this theorem with some classical results in compressed sensing. In the setup considered in Theorem 6, as n increases, the per measurement, signal to noise ratio (SNR) is approaching infinity. Note that by scaling the measurement matrix, we can also obtain results for fixed SNR. The next corollary clarifies our claim. Corollary 2. Consider the setup of Theorem 6, where now A A A∈ I R m×n is a random Gaussian measurement matrix with i.i.d entries A i,j ∼ N 0, σ 2 a n and η = n σ 2 a m . If Assumption 1 holds, then, given > 0, for m≥ 80r (1 +), with a probability larger than 1− 2 −2r+1 , for k = 0, 1, 2,..., we have 1 √ n kx k+1 − ˜ xk 2 ≤ 0.9 √ n kx k − ˜ xk 2 + 2 2 + r n m 2 δ √ n + σ z σ a s 8(1 +)r m . (5.9) Finally, Assumption 1 seems to play a critical role in Theorem 6 and Corollary 2. How- ever, thanks to the linear convergence of C-GD one can relax Assumption 1 in several ways 136 and still obtain recovery guarantees for C-GD. For the sake of brevity we only mention one such result in the paper. Theorem 7. Consider the setup of Theorem 6, where now A A A∈ I R m×n is a random Gaus- sian measurement matrix with i.i.d entries A i,j ∼N 0, σ 2 a n and η = n σ 2 a m . Suppose that sup x kg r (f r (x))−P Cr (x)k 2 ≤ ξ. Then, given > 0, for m≥ 80r (1 +), with a probability larger than 1− 2 −2r+1 , for k = 0, 1, 2,..., we have 1 √ n kx k+1 − ˜ xk 2 ≤ 0.9 √ n kx k − ˜ xk 2 + 2 2 + r n m 2 δ √ n + σ z σ a s 8(1 +)r m + ξ √ n . (5.10) The proof can be found in Section 5.6.2. Note that at every iteration the imperfect projection introduces an error. These errors accumulate as the algorithm proceeds. However, thanks to the linear convergence of the algorithm the overall error caused by the imperfect projection remains at the order of O(ξ/ √ n). All our results so far have been stated for Gaussian measurement matrices. However, they can be generalized to subgaussian matrices too. To prove our claim we extend one of our results, i.e., Theorem 6, to sub-Gaussian matrices below. Theorem 8. Let A A A∈ I R m×n be a zero-mean random sub-Gaussian measurement matrix with i.i.d entries, such thatkA i,j k ψ 2 ≤K and E h A 2 i,j i =σ 2 a . The noise vectorz is distributed as N (0,σ 2 z I I I m×m ). Set η = 1 mσ 2 a . Then, given > 0 and μ 0 ∈ (0, 1), such that μ 0 σ 2 a ≤ 2K 2 , for m> 16K 4 (1 +) μ 2 0 σ 4 a log e ! r, with probability at least 1− 2 −4r − e − m 4 − 2 −2r , 137 for k = 0, 1, 2,..., kx k+1 − ˜ xk 2 ≤μ 0 kx k − ˜ xk 2 + 8 1 + 3Kn σ 2 a m ! δ + 9Kσ z σ 2 a s r(1 +) m . (5.11) The proof is given in Section 5.6.3. Since all the terms in Theorem 8 are similar to the corresponding terms in Theorem 6 we do not discuss them here. We only discuss the convergence rate. Note that the convergence rate μ 0 has a direct impact on the number of measurements. If we want the convergence to be fast (μ 0 to be small), we should either increase the number of measurements or decrease the rater. If we decrease the rate, the two error terms 8 1 + 3Kn σ 2 a m δ + 9Kσz σ 2 a q r(1+) m grow. 5.3 Standard signal classes In this section we discuss the corollaries of our main theorem for the following two standard signal classes that have been studied extensively in the literature: (i) sparse signals, and (ii) piecewise polynomials. For each class, we first construct a simple compression algorithm that can be efficiently implemented in practice, and then explain the implications of C-GD and its analysis for that class. These examples enable us to shed light on different aspects of C-GD, such as (i) convergence rate, (ii) number of measurements, (iii) reconstruction error in noiseless setting, and (iv) reconstruction error in the presence of noise. 5.3.1 Sparse signals LetB n p (ρ),{x∈R n : kxk p ≤ ρ} represent a ball of radius ρ inR n . Also, let Γ n k denote the set of all k-sparse signals inB n p (ρ), i.e., Γ n k ,{x∈B n p (1) : kxk 0 ≤k}. (5.12) 138 In order to apply C-GD to this class of signals, we first need to construct a family of compression codes for such sparse bounded signals. Consider a family of compression codes for set Γ n k defined as follows. Forx∈ Γ n k , (i) encode the locations of its at most k non-zero entries (≈ log n k ) bits) and (ii) apply a uniform quantizer to the magnitudes of the non-zero entries. (Usingb bits for the magnitude of each entry and one bit for its sign, this step spends (b + 1)k bits.) Using this specific compression algorithm in the C-GD framework yields an algorithm which is very similar to iterative hard thresholding (IHT) [17]. At every iteration, after moving in the opposite direction of the gradient, the standard IHT algorithm keeps the k largest elements and sets the rest to zero. The C-GD algorithm on the other hand, while having the same first step, performs the projection step in a slightly different manner. For the projection onto codewords, similar to IHT, it first finds the k largest entries. Then, for each entry x i , it first limits it between [−1, 1] by computing x i 1 x i ∈(−1,1) + 1 x i ≥1 − 1 x i ≤−1 . Then, it quantizes the result by b + 1 bits. The following corollary enables us to compare our results with that of hard thresholding. Considerx∈ Γ n k and lety = A A Ax+z, whereA i,j i.i.d. ∼ N (0,σ 2 a /n) andz i i.i.d. ∼ N (0,σ 2 z ). Let ˜ x denote the projection ofx onto the codebook of the above-described code. The following corollary of Theorem 6 characterizes the convergence performance of C-GD applied to y when using this code. Corollary 3. Given γ > 0, set the quantization level of the compression code to b + 1 = dγ logn + 1 2 logke + 1 bits. Also, set η = n σ 2 a m . Then, given > 0, for m≥ 80˜ r(1 +), where ˜ r = (1 +γ)k logn + k 2 logk + 2k, 1 √ n kx t+1 − ˜ xk 2 ≤ 0.9 √ n kx t − ˜ xk 2 + 2 2 + r n m 2 n −1/2−γ + σ z σ a s 8(1 +)˜ r m , (5.13) for t = 1, 2,..., with probability larger than 1− 2 −2˜ r . 139 Proof. Consider u ∈ [−1, 1]. Quantizing u by a uniform quantizer that uses b + 1 bits yields ˆ u, which satisfies|u− ˆ u| < 2 −b . Therefore, using b + 1 bits to quantize each non- zero element of x∈ Γ n k yields a code which achieves distortion δ ≤ 2 −b √ k. Hence, for b + 1 =dγ logn + 1 2 logke + 1, δ≤n −γ . On the other hand, the code rate r can be upper-bounded as r≤ k X i=0 log n i ! +k(b + 1)≤ logn k+1 +k(b + 1) = (k + 1) logn +k(b + 1), where the last inequality holds for all n large enough. The rest of the proof follows directly from inserting these numbers in the statement of Theorem 6. This corollary enables us to provide further intuition on the performance of C-GD. We start with the noiseless observations and for the moment we only study the required number of measurements and the reconstruction error in the absence of noise. The number of mea- surements required by the C-GD algorithm is m = Ω(k logn). For the final reconstruction error, we can use (5.13) and obtain lim t→∞ 1 √ n kx t+1 − ˜ xk 2 =O 2 + r n m 2 n −1/2−γ ! . This implies that the recovery error satisfies lim t→∞ 1 √ n kx t+1 − ˜ xk 2 =O n 1 2 −γ m . Hence, if γ > 0.5, the error vanishes as the dimension grows. Regarding the number of measurements, there are two points that we would like to emphasize here: 140 1. The α-dimension of the above-described code is k. Hence, CSP is able to accurately recoverthesignalfromonlyk measurements. However,solvingCSPrequiresanexhaus- tive search over all the codebooks. On the other hand, C-GD requiresk logn measure- ments. Therefore, it seems that the extra logn factor is the price for having an efficient recovery algorithm. 2. For large values of n, n −γ is very small, and hence C-GD becomes very similar to IHT. The results we have obtained for C-GD in this case are slightly weaker than those provided for IHT. First, our reconstruction is not exact even in the noiseless setting. Second, the number of measurements C-GD requires is O(k logn) compared toO(k log(n/k)) required by IHT. These minor differences seem to be the price of the generality of the bounds derived for C-GD. So far, we have studied two important quantities in Corollary 2, i.e., (i) required number of measurements, and (ii) reconstruction error in the absence of noise. The last important term, isthereconstructionerrorinthepresenceofthenoise. FromCorollary3, thedistortion caused due to the presence of a Gaussian noise is O σz σa q ˜ r m , or O σz σa q k logn m . Note that there is no result on the performance of IHT in the presence of stochastic measurement noise. However, this noise sensitivity is comparable with the performance of algorithms that are based on convex optimization such as LASSO and the Dantzig selector [22, 15]. 5.3.2 Piecewise polynomial functions Let Poly Q N denote the class of piecewise-polynomial functions p(·) : [0, 1]→ [0, 1] with at most Q singularities 1 , where each polynomial has a maximum degree of N. For p∈ Poly Q N , let (x 1 ,x 2 ,...,x n ) be the samples of p at 0, 1 n ,..., n− 1 n . 1 A singularity is a point at which the function is not infinitely differentiable. 141 For ` = 1,...,Q, let{a ` i } N ` i=0 denote the set of coefficients of the ` th polynomial in p. Here, N ` ≤ N denotes the degree of the `-th polynomial. For notational simplicity, assume that the coefficients of each polynomial belong to [0, 1] interval and that P N ` i=0 a ` i < 1, for every`. Define P, n x∈R n | x i =p(i/n), p∈ Poly Q N o . (5.14) Notethatthisclassoffunctionsisageneralizationoftheclassofpiecewise-constantfunctions that are popular in many applications including imaging. To apply C-GD to this class of signals, we need to design an efficient compression code for signals inP and describe how to project signals on the codewords ofP. For the first part, consider a simple code which for any signal x∈P, it first describes the locations of its discontinuities and then, using a uniform quantizer that spends b bits per coefficient, describes the quantized coefficients of the polynomials. For the other task, which is projecting a signalx∈R n , Appendix 5.8 describes how we can find the closest signal tox inP, i.e., ˜ x = arg min z∈P kx−zk 2 2 , (5.15) using dynamic programing. Once that signal is found, its quantized version using the described code represents the desired projection. Note that C-GD combined with the described compression code is an extension of IHT to piecewise-polynomial functions. At every iteration, C-GD projects its current estimate of the signal to space of piecewise-polynomial functions. Considerx∈P andy = A A Ax+z, whereA i,j i.i.d. ∼ N (0,σ 2 a /n) andz i i.i.d. ∼ N (0,σ 2 z ). Similar to Corollary 3, the following corollary characterizes the convergence performance of C-GD combined with the described compression code, when applied to measurementsy. 142 Corollary 4. Set the step size in C-GD as η = n σ 2 a m and the quantization level in the compression code as b =d(γ + 0.5) logn + log(N + 1)e, whereγ > 0 is given. Set ˜ r = ((γ + 0.5)(N + 1)(Q + 1) +Q) logn + (N + 1)(Q + 1)(log(N + 1) + 1) + 1. Then, given > 0, for m≥ 80˜ r (1 +), for t = 0, 1, 2,..., we have 1 √ n kx k+1 − ˜ xk 2 ≤ 0.9 √ n kx k − ˜ xk 2 + 2 2 + r n m 2 n −0.5−γ + σ z σ a s 8(1 +)˜ r m , (5.16) with a probability larger than 1− 2 −2˜ r+1 . Proof. To apply Theorem 6, we need to find the rate-distortion performance of the described compression code. Using b bits per coefficient, the described code, in total, spends at most r bits, where r≤ (N + 1)(Q + 1)b +Q(logn + 1). For b =d(γ + 0.5) logn + log(N + 1)e, r≤ ((γ + 0.5)(N + 1)(Q + 1) +Q) logn + (N + 1)(Q + 1)(log(N + 1) + 1) + 1. On the other hand, using b bits per coefficient, the distortion in approximating each point can be bounded as N ` X i=0 a ` i t n − N ` X i=0 [a ` i ] b t n ≤ N ` X i=0 |a ` i − [a ` i ] b | ≤ (N ` + 1)2 −b ≤ (N + 1)2 −b , (5.17) where [a ` i ] b denotes theb-bit quantized version ofa ` i . Therefore, the overall error is bounded as δ≤ √ n(N + 1)2 −b . 143 Choosing b =d(γ + 0.5) logn + log(N + 1)e, as prescribed by the corollary, ensures that δ≤n −γ . Inserting these numbers in Theorem 6 yields the desired result. The important quantities in the above corollary are explained in the following. 1. Required number of measurements: If we assume that n is much larger than N and Q, then the required number of measurements is Ω((N + 1)(Q + 1) logn). Note that given the degrees of freedom of piecewise polynomial functions, we do not expect to be able to recoveryx∈P with fewer than (N + 1)(Q + 1) observations. 2. Reconstruction error in the absence of measurement noise: Similar to the discussion of the previous section we can argue that lim k→∞ 1 √ n kx k+1 − ˜ xk 2 =O n 1 2 −γ m . Hence, the error goes to zero for every γ > 0.5. 3. Reconstruction error in the presence of measurement noise: In this case, the impact of Gaussian noise in the upper bound is O σz σa q (Q+1)(N+1) logn m . 5.4 Simulation Results and Discussion In this section, we assess the performance of the C-GD algorithm in various settings. Fur- thermore, we compare our results with the state-of-the-art recovery algorithms, such as denoising-based approximate message passing (D-AMP) [67] and nonlocal low-rank regular- ization (NLR-CS) [33]. Throughout this section, when compression algorithmX is used in 144 theplatformofC-GD,theresultingalgorithmisreferredtoX-GD.Forinstance, therecovery algorithm that employs the JPEG code is called JPEG-GD. 5.4.1 Parameters setting Running C-GD involves specifying three free parameters: (i) step-size η, (ii) compression rate r, and (iii) the number of iterations for which we run the algorithm. The success of C-GD relies on proper tuning of the first two parameters, i.e., step-size and compression rate, and herein, we explain how we tune these parameters in practice. In Section 5.2.2, we theoretically showed that the algorithm converges to the optimal solution for η = 1 mσa . However, this choice of step size might yield a very slow convergence in practice. Hence, in our simulations we follow an adaptive strategy for setting η. Let η k denote the step size at the k th iteration. Then, we set η k to η k = argmin η l P Cr x k +η A A A T y− A A Ax k , (5.18) where, foru∈R n ,l (u),ky− A A Auk 2 . In other words,η k is set such that the next estimate is moved as close as possible to the subspaceV ={u| y = A A Au}. Note that regardless of the value of η, the next estimate will be a codeword. Hence, intuitively speaking, the closer this codeword is to the subspaceV, a better estimatex k+1 will be. The optimization problem proposed in (5.18) is a simple scalar optimization problem, and we use derivative- free methods, such as Nelder-Mead method (or downhill simplex) method [74], to solve it. In our simulations, we noticed that this strategy speeds up the convergence rate of C-GD. Finding the optimal choice ofr is an instance of the model selection problem in statistics and machine learning, see Chapter 7 of [39]. Hence, standard techniques such as multi-fold cross validation can be used. Note that multi-fold cross validation increases the computa- tional complexity of our recovery algorithm. Reducing the computational complexity of such model selection techniques is left for future research. 145 To control the number of iterations in the C-GD method, we consider two standard stopping rules. One is to limit the maximum number of iterations, which is defined as the parameter K 1,max in Algorithm 1. The second stopping rule is a predefined threshold on the reduction of squared-error in each iteration, i.e., x k+1 −x k 2 . In our numerical simulation we set K 1,max = 50 and ε T = 0.001. The specific algorithm that is employed in our simulations is presented below in Algorithm 1. Algorithm 1 C-GD: Compression-based (projected) gradient descent 1: Inputs: compression code (f r ,g r ),y, A A A 2: Initialize: x 0 , η 0 , K 1,max , K 2,max , T 3: for k≤K 1,max do 4: x k+1 ← g r f r x k +η k A A A T y− A A Ax k 5: k←k + 1 6: η k ← Apply Nelder-Mead method with maximum K 2,max iterations to solve (5.18). 7: 8: if 1 √ n x k+1 −x k 2 < T then returnx k+1 9: Output: x k+1 5.4.2 Algorithms and comparison criteria We explore the performance of our C-GD algorithm for the compressive imaging application. We employ the standard image compression algorithms JPEG and JPEG2000 in our C- GD framework and obtain JPEG-GD and JP2K-GD recovery schemes. In our numerical simulation, we use the implementation of JPEG2000 and JPEG codecs in the Matlab-R2016b Image and Video processing package. We compare the performance of our algorithm on six standard test images that are shown in Fig. 5.1. To quantitatively evaluate the quality of an estimated image, we use the peak signal-to-noise (PSNR) defined as PSNR = 20 log 255 √ MSE ! , (5.19) 146 Figure 5.1: Test images used in our simulations. where for a noise-free grayscale image x and its reconstructed image ˆ x, the mean square error (MSE) is defined as MSE = 1 n kx− ˆ xk 2 2 . Furthermore, we define the signal-to-noise ratio (SNR) in the measurement vectory as SNR = 20 log kA A Axk 2 kzk 2 ! (5.20) Even though our theoretical results consider i.i.d. Gaussian matrices, we evaluate the performance of our algorithm with both i.i.d. Gaussian and partial-Fourier measurement matrices, which are closer to the matrices used in radar and magnetic resonance imaging applications. We summarize the results of our simulations for Gaussian matrices and partial- Fourier matrices in Sections 5.4.3 and 5.4.4, respectively. 5.4.3 Compressive imaging with i.i.d. measurement matrices Noiseless In Table 5.1, we compare the results of JPEG-GD and JP2K-GD with that of the state- of-the-art BM3D-AMP method in reconstructing several test images from their compressive 147 Table 5.1: PSNR of 128× 128 reconstructions with no measurement noise - Sampled by a random Gaussian measurement matrix. Method m/n Boat House Barbara Dog Panda Snake BM3D-AMP 30% 29.66 39.71 31.3 21.30 23.90 20.87 50% 34.19 43.70 33.70 24.35 26.76 22.76 JPEG-GD 30% 23.77 30.61 24.34 18.01 19.35 18.23 50% 26.46 33.25 27.01 22.78 23.11 19.98 JP2K-GD 30% 30.68 35.22 29.96 22.45 24.00 20.44 50% 35.28 40.18 34.67 26.35 27.13 23.03 measurements. In these simulations, the measurement matrices are i.i.d. Gaussian. Further- more, the test images are resized to 128× 128 pixels. We consider two sampling ratios m n = 30% and 50%. At each iteration, the step-size parameter η is set by solving (5.18). To find the solution of this optimization problem we used K 2,max = 25 iterations of Nelder-Mead method algorithm. Furthermore, we used the stopping criteria discussed in Section 5.4.1. For BM3D-AMP we also used the default setting proposed in [67]. The results of this simulation are presented in Table 5.1. Interestingly, results in Table 5.3 indicate that JP2K-GD considerably outperforms JPEG-GD. The main reason is that JP2K codec exploits more complex structures of natural images compared with JPEG codec. Therefore, intuitively we can state that for a same amount of measurements, JP2K-GD can perform better. Also the performance of JP2K-GD is comparable with that of BM3D-AMP. When an image has more geometry (as in House), BM3D-AMP outperforms JP2K-GD. However, when an image has more irregular structures and texture such as the Dog image or the Panda image, then JP2K-GD seems to often outperform BM3D-AMP. Noisy In Table 5.2, we present the performance results of our proposed JPEG-GD and JP2K-GD, andcomparethemwiththeperformanceoftheBM3D-AMPmethodforimagereconstruction from noisy compressive measurements. Similar to the previous section, the measurement 148 Table 5.2: PSNR of reconstruction of 128× 128 test images with Gaussian measurement noise with various SNR values - sampled by a random Gaussian measurement matrix Barbara Boat Panda Method m/n SNR=10 SNR=30 SNR=10 SNR=30 SNR=10 SNR=30 BM3D-AMP 30% 19.15 28.20 20.12 28.50 13.67 18.82 50% 21.38 30.16 22.72 33.65 18.44 21.17 JPEG-CG 30% 14.87 22. 71 13.50 22.01 10.50 18.79 50% 18.44 24.60 19.21 25.36 15.83 22.01 JP2K-CG 30% 16.82 26.23 21.79 28.43 15.63 22.93 50% 20.78 30.93 24.82 34.13 20.40 25.85 matrix is i.i.d. Gaussian, the images are resized to 128×128, and two sampling ratios of 30% and 50% are considered. Unlike before, the measurements are corrupted by i.i.d. Gaussian noise. We consider two different values of SNR = 10dB and SNR = 30dB. The results of our simulations are presented in Table 5.2. As is again clear from this table our results are comparable and in most cases better than the results of the state-of-the-art BM3D-AMP. 5.4.4 Compressive imaging with partial-Fourier matrices In many application areas, measurement matrices such as partial-Fourier matrices are employed. In this section, we evaluate the performance of our algorithm on partial-Fourier matrices. We note that, in our numerical results, we observe that even though BM3D-AMP performs well for i.i.d. Gaussian measurements, its performance degrades dramatically for partial-Fourier matrices. Hence, we compare the performance of our algorithm with the state-of-the-art algorithm for partial-Fourier matrices, i.e., NLR-CS [33]. As in the previous section, we consider both noiseless and noisy measurements. Noiseless In Table 5.3, we present the performance results of our proposed JPEG-GD and JP2K-GD, andcomparethemwiththeperformanceoftheNLR-CSmethodinimagereconstructionfrom 149 Table 5.3: PSNR of 512× 512 reconstructions with no noise - sampled by a random partial- Fourier measurement matrix. Method m/n Boat House Barbara Dog Panda Snake NLR-CS 10% 23.06 27.26 20.34 19.53 21.61 18.20 30% 26.38 30.74 23.67 23.04 25.60 21.80 JPEG-CG 10% 18.38 24.11 16.36 16.30 17.00 15.10 30% 24.70 30.51 20.37 21.10 22.01 21.63 JP2K-CG 10% 20.75 26.30 18.64 19.74 18.24 18.36 30% 27.73 38.07 24.89 24.82 25.70 24.37 compressive samples, sampled by a random partial-Fourier measurement matrix. Images in this numerical comparison are resized to 512× 512. We consider two sampling ratios 10% and 30%. For m n = 10% (sampling rate), JP2K-GD performs comparable and in some cases, e.g., Dog and Snake, even better than the state-of-the-art NLR-CS. Increasing the sampling ratio to m n = 30%, JP2K-GD outperform both JPEG-GD and NLR-CS methods. Note that NLR-CS method has two main steps. In the first step, it estimates an initial image ˆ x using a standard compressed sensing (CS) recovery method based on the sparsity of image coefficients in DCT/Wavelet domain. Then, in the second step, it enforces a low- rank and group-sparsity constraint on the group of similar patches detected in the estimated image [33]. This step involves the singular value decomposition of a matrix and hence is computationally expensive for large images. Furthermore, since in the second step, detection of similar patches is performed using an estimated image, exploiting structures in the second step heavily relies on the performance of the first step. For this particular reason, as observed in Section 5.4.4, the performance of NLR-CS method degrades significantly once noise is added to the observations. On the other hand, results in Section 5.4.4 show that both JPEG-GD and JP2K-GD are robust to the measurement noise. 150 Noisy In Table 5.4, we present the performance results of JPEG-GD and JP2K-GD, and NLR- CS methods for image reconstruction from noisy compressive measurements. Similar to the previous section, the measurement matrix is a random partial-Fourier matrix. Images in this numerical comparison are resized to 512×512, and we consider two sampling ratios 10% and 30%. The measurements are corrupted by i.i.d. Gaussian noise. We consider two different values of SNR = 10dB and SNR = 30dB. The results of our simulations are presented in Table 5.4. As is again clear from this table, JP2K-GD method outperforms both JPEG-GD and NLR-CS for all SNRs and sampling ratios (m/n). Interestingly we observe that for low SNRs, e.g., SNR = 10 dB, even the JPEG-GD algorithm performs much better than NLR-CS. Our next goal is to visually compare the reconstruction of JP2K-GD with that of NLR- CS. Fig. 5.2 shows the reconstructed images for three different sampling ratios m/n. The size of the test image for all scenario is 512× 512 and the measurement SNR is set to 30 dB. As is clear from Fig. 5.2, in all cases, the reconstruction from JP2K-GD looks more appealing than the reconstruction from NLR-CS. 5.4.5 Convergence rate evaluation As proved in our main theorems, we expect the convergence to be linear when the measure- ment matrix is Gaussian. It turns out that the convergence is also fast for partial-Fourier matrices. Fig. 5.3 depicts the normalized mean-squared-error (MSE) of image reconstruction using JP2K-GD method. We consider two different sampling ratios m n = 5% and 10% in this test. Results in Fig. 5.3 and 5.4 show that (i) the algorithm converges very fast (often in less than 50 iterations), (ii) by increasing the number of measurements the convergence of JP2K-GD improves, and (iii) the final reconstructed image has a better PSNR. Note that all these conclusions are consistent with the results we proved for sub-Gaussian matrices. 151 Visual Comparisons 10% 30% 50% 10% 30% 50% NLR (Sparsity-based) C-GD (JP2000) Design matrix: random partial-Fourier matrix Barbara test image 512-by-b12 Figure 5.2: Image reconstruction using partial-Fourier matrices via using m n = 10%, 30%, and 50% noisy measurements with SNR = 30 dB. The first row illustrates the images recon- structed by NLR-CS method and the second row illustrates reconstructed images by JP2K- GD method. The test image Barbara is resized to 512× 512. Table 5.4: PSNR of 512×512 reconstructions with Gaussian measurement noise with various SNR value - sampled by a random partial-Fourier measurement matrix. Dog Barbara Snake Method m/n SNR=10 SNR=30 SNR=10 SNR=30 SNR=10 SNR=30 NLR-CS 10% 11.66 24.14 12.10 19.83 10.50 18.75 30% 12.60 26.84 13.32 24.05 11.98 24.82 JPEG-GD 10% 14.34 20.50 15.60 18.60 12.33 15.67 30% 19.20 24.70 18.17 22.89 14.40 22.37 JP2K-GD 10% 17.33 25.40 16.53 21.65 18.00 23.12 30% 21.56 35.38 21.82 28.19 21.06 29.30 5.5 Conclusions In this paper, we have studied the problem of designing efficient compression-based com- pressed sensing recovery algorithms. Specifically, we have proposed C-GD, an iterative robust-to-noise compression-based compressed sensing algorithm that is able to find the optimal solution of CSP optimization. Given measurementsy = A A Ax +z and a compression code with codebookC, at iterationk, C-GD updates its current estimate ofx,x k , by moving 152 0 10 20 30 40 50 Iteration 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Normalized MSE = kx−x k k2 kxk2 m/n = 5% m/n = 10% Figure 5.3: Normalized reconstruction error in each iteration of JP2K-GD method on com- pressive measurements, sampled by a random partial-Fourier measurement matrix. House 512× 512-test image. towards the negative of the gradient of the cost function (f(u) =ky− A A Auk 2 ) and then pro- jecting the result onto the set of codewordsC. For a given compression code, the projection step can typically be implemented by applying the compression code’s encoder and decoder. We have proved that, given enough measurements, with high probability, C-GD has a linear convergence rate and is robust to additive white Gaussian noise. In summary, C-GD provides a platform for using commercial compression codes such as JPEG2000 or MPEG4 for com- pressed sensing of images and videos, respectively. Since well-known compression algorithms exploiting less obvious structures during their encoding/decoding process, the C-GD algo- rithm performs significantly better than classical sparsity-based compressed sensing methods on practical signals such as images. Furthermore, in contrast to mostly heuristic methods in compressed sensing for imaging applications, we provided a theoretical guarantee for the performance of the C-GD method. In our simulation results, we have focused on compressed sensing of images and have shown that C-GD combined with the state-of-the-art compression codes yields state-of-the-art compressed sensing performance, both for i.i.d. Gaussian and partial-Fourier measurement matrices. 153 Convergence of Algorithm 1 10 20 30 40 50 m/n=5% m/n=10% Iteration: • Our algorithm convergence very fast • Increasing the number of measurements -speed up the convergence -reduces the recovery error Figure 5.4: Reconstructed image in different iterations using JP2K-GD method via com- pressive measurements, sampled by a random partial-Fourier measurement matrix. First row of images associated withm/n = 5% and the second row is associated withm/n = 10% scenario. Numbers between the figures indicate the corresponding iteration number. 5.6 Appendix A: Proofs of Theorems In this section we present the proofs of the main results of the paper. Appendix 5.7 reviews some required background information and also derives some necessary tools. 5.6.1 Proof of Theorem 6 Define s k+1 =x k +η A A A T y− A A Ax k . (5.21) Using this notation, we havex k+1 =P C (s k+1 ). But since ˜ x =P C (x), ˜ x is also inC. Hence, x k+1 −s k+1 2 2 ≤ ˜ x−s k+1 2 2 , or, equivalently, x k+1 − ˜ x − s k+1 − ˜ x 2 2 ≤ ˜ x−s k+1 2 2 . By removing the common terms from both sides, we have x k+1 − ˜ x 2 2 ≤ 2 D x k+1 − ˜ x,s k+1 − ˜ x E (5.22) 154 For k = 0, 1,..., define the error vector and its normalized version as θ k ,x k − ˜ x, and θ k , θ k kθ k k , respectively. Also, givenθ k ∈ I R n ,θ k+1 ∈ I R n , η∈ I R + , and A A A∈ I R m×n , define coefficient μ as μ θ k+1 ,θ k ,η , D θ k+1 ,θ k E −η D A A Aθ k+1 ,A A Aθ k E . Using this definition, substituting for s k+1 from (5.21) and noting that y = A A Ax +z, from (5.22), it follows that θ k+1 2 ≤ 2 D x k+1 − ˜ x,x k +η A A A T A A Ax +z− A A Ax k − ˜ x E θ k+1 −1 2 = 2 D x k+1 − ˜ x,x k − ˜ x E + 2η D x k+1 − ˜ x,A A A T A A A x−x k E + 2η D x k+1 − ˜ x,A A A T z E θ k+1 −1 2 = 2 D θ k+1 ,θ k E − 2η D A A Aθ k+1 ,A A Aθ k E + 2η D A A Aθ k+1 ,A A A (x− ˜ x) E + 2η D θ k+1 ,A A A T z E θ k+1 −1 2 ≤ 2μ θ k+1 ,θ k ,η θ k 2 + 2ηkA A Ak 2 S n−1kx− ˜ xk 2 + 2η D θ k+1 ,A A A T z E , (5.23) wherekA A Ak S n−1 =σ max (A A A). We next find upper bounds for the three terms on the right hand side of (5.23): (i) Bounding μ θ k+1 ,θ k ,η : We show that given the parameter setting of the theorem, with high probability μ (u,v,η)≤ 0.45, for∀u,v∈C 0 (5.24) 155 where C 0 , ( ˆ x 1 − ˆ x 2 kˆ x 1 − ˆ x 2 k 2 :∀ ˆ x 1 , ˆ x 2 ∈C ) . (5.25) To achieve this goal, we define eventE 1 as E 1 , ( μ u,v, 1 mσ 2 a ! < 0.45 :∀u,v∈C 0 ) . (5.26) From Corollary 7 (or Lemma 16), givenu,v∈C 0 , we have P μ u,v, 1 mσ 2 a ! ≥ 0.45 ! ≤ 2 − m 20 . (5.27) Therefore, by the union bound, P (E c 1 )≤|C 0 | 2 2 − m 20 . (5.28) Note that|C 0 |≤|C| 2 ≤ 2 2r . Therefore, P (E 1 )≥ 1−|C 0 | 2 2 − m 20 ≥ 1− 2 (4r−0.05m) . Therefore, for m≥ 80r (1 +), where > 0, with probability at least 1− 2 −40r , event E 1 happens. (ii) BoundingkA A Ak 2 S n−1 kx− ˜ xk 2 : Define eventE 2 as E 2 , n σ max (A A A)≤ 2 √ m + √ n o . From Corollary 5, for t = 1 we have P (E c 2 )≤ e − m 2 . 156 Also, since the compression code has supremum distortionδ,kx− ˜ xk 2 ≤δ. Therefore, conditioned onE 2 , we have 2 m (σ max (A A A)) 2 kx− ˜ xk 2 ≤ 2 m 2 √ m + √ n 2 δ = 2 2 + r n m 2 δ. (5.29) (iii) Bounding 2η D θ k+1 ,A A A T z E : Note that 2η D θ k+1 ,A A A T z E = 2 mσ 2 a D A A Aθ k+1 ,z E . Let A A A i ∈ I R n be the i-th row of matrix A A A. Then, A A Aθ k+1 = hD A A A 1 ,θ k+1 E , D A A A 2 ,θ k+1 E ,··· , D A A A n ,θ k+1 Ei T . For any fixed θ k+1 , nD A A A i ,θ k+1 Eo n i=1 are i.i.d. N (0, 1) random variables. Hence, from Lemma 14, we know that the distribution of D θ k+1 ,A A A T z E is the same as σ a kzk 2 D θ k+1 ,g E , whereg = [g 1 ,··· ,g n ] T is independent ofkzk 2 and g i i.i.d. ∼ N (0, 1). To boundσ a kzk 2 D θ k+1 ,g E we will bound 1 σ 2 z kzk 2 2 and|hθ,gi| 2 separately. Givenτ 0 1 > 0 and τ 0 2 > 0, define eventsE 3 andE 4 as follows E 3 , ( 1 σ 2 z kzk 2 2 ≤ (1 +τ 0 1 )m ) and E 4 , n |hθ,gi| 2 ≤ 1 +τ 0 2 ,∀θ∈C 0 o . Following Lemma 15, we have P (E c 3 )≤ e − m 2 (τ 0 1 −ln(1+τ 0 1 )) , (5.30) and letting m = 1 in Lemma 15, for fixedθ k+1 , it follows that P D θ k+1 ,g E 2 ≥ 1 +τ 0 2 ≤ e − 1 2 (τ 0 2 −ln(1+τ 0 2 )) . (5.31) 157 Hence, by the union bound, P (E c 4 )≤|C 0 |e − τ 0 2 2 = 2 2r e − 1 2 (τ 0 2 −ln(1+τ 0 2 )) ≤ 2 2r− τ 0 2 2 , (5.32) where the last inequality holds for τ 0 2 > 7. Setting τ 0 2 = 4(1 +)r− 1, where > 0, ensures that P (E c 4 )≤ 2 −2r+0.5 . Setting τ 0 1 = 1,P(E c 3 )≤ e −0.15m , and conditioned on E 3 ∩E 4 , we have 2η D θ k+1 ,A A A T z E = 2 mσ a D θ k+1 ,A A A T z E ≤ 2 mσ a q σ 2 z (1 +τ 0 1 )m(1 +τ 0 2 ) = 2σ z mσ a q 8m (1 +)r = σ z σ a s 8(1 +)r m . (5.33) Combining (5.33), (5.29), and (5.27) with (5.23) yields the desired bound on the reduction of error. Finally, note that, by the union bound, P (E 1 ∩E 2 ∩E 3 ∩E 4 )≥ 1− 4 X i=1 P (E i )≥ 1− e − m 2 − 2 −40r − 2 −2r+0.5 − e −0.15m ≥ 1− 2 −2r+1 . (5.34) 5.6.2 Proof of Theorem 7 The proof of this result is a simple extension of the proof of Theorem 6 presented in Section 5.6.1. Define s k+1 =x k +ηA A A T (y− A A Ax k ). Note thatx k+1 =g r (f r (s k+1 )). Hence, kx k+1 − ˜ xk 2 =kx k+1 −P S (s k+1 )k 2 +kP S (s k+1 )− ˜ xk 2 ≤kP S (s k+1 )− ˜ xk 2 +ξ (5.35) 158 It is now straightforward to follow exactly the same step as the one discussed in the proof of Theorem 6 and show that with probability at least 1− 2 −4r − e − m 4 − 2 −2r 1 √ n kP S (s k+1 )− ˜ xk 2 ≤ 0.9 √ n kx k − ˜ xk 2 + 2 2 + r n m 2 δ √ n + σ z σ a s 8(1 +)r m . (5.36) Combining (5.35) and (5.36) completes the proof. 5.6.3 Proof of Theorem 8 Following the proof of Theorem 6, and definingθ k+1 =x k+1 − ˜ x, fork = 0, 1, 2,..., it follows from (5.23) that θ k+1 2 ≤ 2μ θ k+1 ,θ k ,η θ k 2 + 2ησ 2 max (A A A)kx o − ˜ xk 2 + 2η D θ k+1 ,A A A T z E , (5.37) where μ θ k+1 ,θ k ,η , D θ k+1 ,θ k E −η D A A Aθ k+1 ,A A Aθ k E . LetC 0 denote the set of normalized distance vectors of the codewords inC, which is defined in (5.25). Define eventE 1 as E 1 , μ u,v, 1 mσ a ≤μ 0 :∀u,v∈C 0 . Similar to the proof of Theorem 6, we show that the probability of occurrence of E c 1 approaches 0. Givenu,v∈C 0 , from Lemma 17, we have P ( μ u,v, 1 mσ 2 a ! ≥μ 0 ) ≤ exp ( − mμ 0 σ 2 a 2K 2 min 1, μ 0 σ 2 a 2K 2 !) = 2 −(log e) mμ 0 σ 2 a 2K 2 min 1, μ 0 σ 2 a 2K 2 . (5.38) Therefore, by the union bound, since|C 0 |≤|C| 2 = 2 2r , we have P (E 1 )≥ 1−|C 0 | 2 2 −(log e) mμ 0 σ 2 a 2K 2 min 1, μ 0 σ 2 a 2K 2 ≥ 1− 2 4r−(log e) mμ 0 σ 2 a 2K 2 min 1, μ 0 σ 2 a 2K 2 . 159 Therefore, for m> 8K 2 r(1+) μ 0 σ 2 a min 1, μ 0 σ 2 a 2K 2 log e , P (E c 1 )≤ 2 −4r . But, since by assumption μ 0 σ 2 a ≤ 2K 2 , we have min 1, μ 0 σ 2 a 2K 2 = μ 0 σ 2 a 2K 2 . Define eventE 2 asE 2 , n σ max (A A A)≤ 2σ a q m +n 3K σ 2 a o . From Corollary 6, we have P (E 2 )≥ 1− e − mσ 2 a 2K . Since the compression code is such thatkx− ˜ xk 2 ≤ δ, conditioned onE 2 , we have 2 mσ 2 a (σ max (A A A)) 2 kx− ˜ xk 2 ≤ 8 m m + 3K σ 2 a n ! δ = 8 1 + 3Kn σ 2 a m ! δ. (5.39) To complete the proof, we need to bound 2η D θ k+1 ,A A A T z E = 2 mσ 2 a D A A Aθ k+1 ,z E , which is the term related to the noisez. Again, let A A A i ∈ I R 1×n denote the i-th row of matrix A A A and, for a givenu∈ I R n , lety u = A A Au. Hence, for i = 1,...,m, y u (i) =hA A A i ,ui. To upper bound the term corresponding to noise, given τ > 0, define eventE 3 as E 3 = ( 2 mσ 2 a hy u ,zi≤τ : ∀u∈C 0 ) . From Lemma 13, foru∈C 0 , we know that{y u (i)} m i=1 are independent sub-Gaussian random variables. Also, foru∈C 0 , Lemma 13 states that ky u (i)k ψ 2 ≤kuk 2 max 1≤j≤n kA A A i (j)k ψ 2 ≤K, 160 where the last inequality follows because for u ∈ C 0 , kuk 2 = 1. Since every Gaussian random variable is also a sub-Gaussian random variable, z(i) is a sub-Gaussian random variable withkz(i)k ψ 2 =σ n q 8 3 . As a result,ky u (i)z(i)k ψ 1 ≤ky u (i)k ψ 2 kz(i)k ψ 2 ≤K q 8 3 σ n . Using Theorem 9, foru∈C 0 , we have P 2 mσ 2 a hy u ,zi≥τ ! =P m X i=1 y u (i)z(i)≥ mσ 2 a τ 2 ! ≤ exp ( − min 3mσ 4 a τ 2 16× 8K 2 σ 2 n , √ 3mσ 2 a τ 4K √ 8σ n !) ≤ exp ( − min mσ 4 a τ 2 16× 3K 2 σ 2 n , mσ 2 a τ 4 √ 3Kσ n !) , where the last line follows because 3 8 > 1 3 . Therefore, by the union bound, since|C 0 |≤ 2 2r , P (E c 3 )≤ 2 2r exp ( − mσ 2 a τ 4 √ 3Kσ n min σ 2 a τ 4K √ 3σ n , 1 !) . Choosing τ = σ n σ 2 a s 96K 2 r(1 +) m log e , and given our choice of m, it follows that P (E c 3 )≤ 2 −2r . But, since q 96 log e ≤ 9, P ∃u∈C 0 s.t. 2 mσ 2 a hy u ,zi> 9Kσ n σ 2 a s r(1 +) m ≤P (E c 3 )≤ 2 −2r . Finally, combining (5.37) with the bounds derived on the three terms on the right hand side of (5.37) yields the desired result. 161 5.7 Appendix B: Concentration of Measure Back- ground In this section we briefly review some useful results from the literature and derive some new lemmas that are going to be used in the proofs. Lemma 13 (see Lemma 5.9 in [100]). Let{X i } n i=1 be independent, mean zero, sub-Gaussian random variables and{a i } n i=1 are real numbers. Then P n i=1 a i X i is also a sub-Gaussian random variable, and n X i=1 a i X i ψ 2 ≤ v u u t n X i=1 a 2 i kX i k 2 ψ 2 . (5.40) Theorem 9 (Bernstein Type Inequality, see e.g., [100]). Suppose that{X i } n i=1 are indepen- dent, and that, fori = 1,··· ,n,X i is a sub-exponential random variable. Let max i kX i k ψ 1 ≤ K, for some K > 0. Then for every t≥ 0 and everyw = [w 1 ,··· ,w n ] T ∈ I R n×1 , we have P n X i=1 w i (X i − E [X i ])≥t ! ≤ exp ( − min t 2 4K 2 kwk 2 2 , t 2Kkwk ∞ !) . (5.41) Lemma 14 (Lemma 3 from [52]). Consider two independent random vectors X = [X 1 ,··· ,X n ] T ∈ I R n and Y = [Y 1 ,··· ,Y n ] T ∈ I R n . Assume that X i i.i.d. ∼ N (0, 1) and Y i i.i.d. ∼ N (0, 1). ThenhX,Yi and GkXk 2 have the same distribution, where G∼N (0, 1) and is independent ofkXk 2 . Lemma 15 (Lemma 2 from [52]). Let G i , i = 1, 2,··· ,m, be i.i.d.N (0, 1). Then, for τ∈ (0, 1), P m X i=1 G 2 i ≤m(1−τ) ! ≤ exp m 2 (τ + ln(1−τ)) 162 and for τ > 0, P m X i=1 G 2 i >m(1−τ) ! ≤ exp − m 2 (τ− ln(1−τ)) . Theorem 10 (see e.g., [87]). Let A A A∈ I R m×n be a dense random matrix whose entries are i.i.d. zero-mean Gaussian random variables with unit variance. Then, t> 0, P σ max (A A A)≥ √ m + √ n +t ≤ e − t 2 2 . (5.42) If we substitute t←t √ m, then Theorem 10 results in Corollary 5. Corollary 5. Let A A A∈ I R m×n be a random matrix whose entries are independent, zero-mean Gaussian random variables with unit variance. Then P σ max (A A A)≥ (1 +t) √ m + √ n ≤ e − mt 2 2 , (5.43) t> 0 is arbitrary variable. Theorem 11. Let A A A∈ I R m×n be an i.i.d. matrix such thatA i,j is a zero-mean sub-Gaussian random variable withkA i,j k ψ 2 ≤K and E h A 2 i,j i =σ 2 a . Then, if m<n, for any t> 0, P σ max (A A A)≥ q 2mσ 2 a + 12nK(1 +t) ≤ e −3nt , (5.44) Proof. LetN ε denote a maximal ε-separated subset of S n−1 . It is straightforward to show that [Lemma 5.2 from [87]] |N ε |≤ 1 + 2 ε n . (5.45) 163 Consider vectoru∈S n−1 that satisfies σ max (A) = max u kAuk. Letu 0 ∈N n,ε be such that ku−u 0 k 2 ≤ε. Then, by the triangle inequality, we have |hA A Au,A A Aui−hA A Au 0 ,A A Au 0 i| =|hA A Au,A A A(u−u 0 )i−hA A Au 0 ,A A A(u 0 −u)i| ≤|hA A Au,A A A(u−u 0 )i| +|hA A Au 0 ,A A A(u 0 −u)i| ≤ 2(σ max (A)) 2 ku−u 0 k ≤ 2ε(σ max (A)) 2 . (5.46) On the other hand, again by the triangle inequality, |hA A Au,A A Aui−hA A Au 0 ,A A Au 0 i|≥|hA A Au,A A Aui|−|hA A Au 0 ,A A Au 0 i| = (σ max (A)) 2 −|hA A Au 0 ,A A Au 0 i| ≥ (σ max (A)) 2 − max x∈Nn,ε |hA A Ax,A A Axi|. (5.47) Combining (5.46) and (5.47) yields (σ max (A)) 2 ≤ (1− 2ε) −1 max x∈Nn,ε |hA A Ax,A A Axi|. (5.48) To finish the proof, we need to upper bound max x∈Nn,ε hA A Ax,A A Axi. For a fixed x∈N ε , by Theorem 9, P 1 m kA A Axk 2 2 −σ 2 a kxk 2 2 ≥t ≤ exp ( − min mt 2 4K 2 , mt 2K !) . (5.49) Therefore, P max x∈Nε 1 m kA A Axk 2 2 −σ 2 a kxk 2 2 ≥t ≤|N ε | exp ( − min mt 2 4K 2 , mt 2K !) . 164 Let ε = 1 4 . Then, from (5.45), |N ε | exp ( − min mt 2 4K 2 , mt 2K !) ≤ 9 n exp ( − min mt 2 4K 2 , mt 2K !) = exp − mt 2K min t 2K , 1 +n ln 9 ≤ exp − mt 2K min t 2K , 1 + 3n . (5.50) Substituting t by 6n(1 +t)K/m and noting that for this value of t, if m< 3n, t 2K is always larger than 1, it follows that P max x∈Nε 1 m kA A Axk 2 2 −σ 2 a kxk 2 2 ≥ 6n(1 +t)K m ! ≤ exp{−3nt}. Therefore, in summary, form (5.48), with probability larger than 1− e −3nt , σ max (A A A)≤ q 2mσ 2 a + 12nK(1 +t). Substituting t by mσ 2 a 6nK in Theorem 11 results in the following corollary. Corollary 6. Let A A A∈ I R m×n be an i.i.d. matrix such A i,j is a zero-mean sub-Gaussian random variable withkA i,j k ψ 2 ≤K and E h A 2 i,j i =σ 2 a . Then, for m<n, P σ max (A A A)≥ 2σ a s m + 3K σ 2 a n ! ≤ e − σ 2 a 2K m . (5.51) Lemma 16 (Lemma 5 in [53].). Consideru,v∈S n−1 and dense random Gaussian matrix A A A∈ I R m×n with i.i.d zero mean Gaussian entries asN (0,σ 2 a ). Then, for any t> 0 P hu,vi− 1 mσ 2 a hA A Au,A A Avi≥t ! ≤ e −mf ∗ (t) (5.52) 165 where f ∗ (t) = min u∈[−1,1] max s∈(0, 1 1−u ) n s(t−u) + 1 2 ln h (1 +su) 2 −s 2 io . Corollary 7. Consider u,v∈ S n−1 and dense random Gaussian matrix A A A∈ I R m×n with i.i.d. zero mean Gaussian entries asN (0,σ 2 a ). Then, P hu,vi− 1 mσ 2 a hA A Au,A A Avi≥ 0.45 ! ≤ 2 − m 20 . (5.53) Lemma 17. Consider u,v ∈ S n−1 and dense matrix A A A∈ I R m×n with i.i.d. zero-mean sub-Gaussian entries withkA A A i,j k ψ 2 ≤K and E h A 2 i,j i =σ 2 a . Then, for any t> 0, P hu,vi− 1 mσ 2 a hA A Au,A A Avi≥t ! ≤ exp ( − mtσ 2 a 2K 2 min 1, tσ 2 a 2K 2 !) . (5.54) Proof. Definey(u) = A A Au andy(v) = A A Av. Using these definitions,hA A Au,A A Avi =hy(u),y(v)i. Let A A A i ∈ I R 1×n denote the i-the row of matrix A A A. Thus,y i (u) =hA A A i ,ui andy i (v) =hA A A i ,vi are both sub-Gaussian random variables. Using Lemma 13, we have ky i (u)k ψ 2 ≤Kkuk 2 =K, ky i (v)k ψ 2 ≤Kkvk 2 =K. Furthermore, E [y i (u)y i (v)] =u T E h A A A T i A A A i i v =σ 2 a hu,vi. Note that P hu,vi− 1 mσ 2 a hA A Au,A A Avi≥t ! =P m X i=1 σ 2 a hu,vi−y i (u)y i (v) ≥mtσ 2 a ! (5.55) By Lemma 12,y i (u)y i (v) is a sub-exponential random variable with ky i (u)y i (v)k ψ 1 ≤ky i (u)k ψ 2 ky i (v)k ψ 2 ≤K 2 . 166 Therefore, by applying Theorem 9 to sub-exponential random variables y i (u)y i (v), and setting all weights equal to−1, we derive P hu,vi− 1 mσ 2 a hA A Au,A A Avi≥t ! ≤ exp ( − min mt 2 σ 4 a 4K 4 , mtσ 2 a 2K 2 !) . 5.8 Appendix C: Finding the best piecewise polyno- mial approximation Consider the following problem: givenx∈R n , find ˜ x∈P,P defined in (5.14), such that ˜ x = arg min z∈P kx−zk 2 2 . In this section, we briefly describe how ˜ x can be found using dynamic programing. Note that, given singularity points s 1 ,s 2 ,...,s Q , one can easily find the best polynomial fit in each piece. Hence, the challenge is to find the optimal singularity points. Each singularity point s i is a point in the set{ 1 n ,..., n−1 n }. Given i 1 ,i 2 ∈{0,...,n}, i 1 ≤ i 2 , let e(i 1 ,i 2 ) denote the minimum error achievable in approximating (x i 1 ,...,x i 2 ) as samples of P N j=0 a j y j at i 1 n ,..., i 2 n , where P n j=0 a j ≤ 1 and a j ∈ (0, 1), j = 0,...,N. That is, e(i 1 ,i 2 ) = min a 0 ,...,a N :a j ∈(0,1), P N j=0 a j ≤1 i 2 X k=i 1 x k − N X j=0 a j k n ! j 2 . (5.56) Using this definition, given singularity pointss 0 = 0,s 1 ,...,s Q ,s Q+1 = 1∈{0, 1 n ,..., 1}, the minimum achievable error in approximatingx by signals inP whose singularities happen at s 1 ,...,s Q can be written as Q+1 X i=1 e(ns i−1 ,ns i ). 167 This representation suggests that the minimizer ˜ x can be found using the Viterbi algorithm. In summary, the Viterbi algorithm will operate on a Trellis diagram with Q full stages corresponding to the possible Q singularities and two single-state stages corresponding to the start and the end of the interval. Each intermediate stage has n− 1 states, which correspond to the possiblen−1 singularity points. States at staget∈ 1,...,Q is connected to state s 0 at state t + 1, if s≤ s 0 . The weight of this edge is set as e(ns,ns 0 ), defined in (5.56). Otherwise, if s 0 > s, there is no edge between the two states. Let E i (s) denote the minimum cost associated with state s at stage i. Also, let E 0 (s 0 ) = 0. The goal is to find the path on the Trellis diagram that achieves E Q+1 (s Q+1 ) = E Q+1 (1). It is straightforward to show that, for t = 1,...,Q, E t+1 (s) = min s 0 (E t (s 0 ) +e(s 0 ,s)), where the minimum is taken over all states s 0 which are connected to s, i.e., s 0 < s. This breakdown of the cost function describes the essence of how the Viterbi algorithm operates. At staget, among its incoming edges, each states only keeps the edge that achieveE t+1 (s). At the end, backtracking from the final state s Q+1 = 1 at stage Q + 1 reveals the optimal singularities. 168 Reference List [1] Taimoor Abbas. Measurement Based Channel Characterization and Modeling for Vehicle-to-Vehicle Communications. PhD thesis, Lund University, 2014. [2] Oliver Aberth. Iteration methods for finding all zeros of a polynomial simultaneously. Mathematics of computation, 27(122):339–344, 1973. [3] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimiza- tion with sparsity-inducing penalties. Foundations and Trends R in Machine Learning, 4(1):1–106, 2012. [4] Waheed U Bajwa, Jarvis Haupt, Akbar M Sayeed, and Robert Nowak. Compressed channel sensing: A new approach to estimating sparse multipath channels. Proceedings of the IEEE, 98(6):1058–1076, 2010. [5] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. 56(4):1982 –2001, Apr. 2010. [6] Philip Bello. Characterization of randomly time-variant linear channels. IEEE Trans. Commun. Systems, 11(4):360–393, 1963. [7] Christian R Berger, Shengli Zhou, James C Preisig, and Peter Willett. Sparse channel estimation for multicarrier underwater acoustic communication: From subspace meth- ods to compressed sensing. Signal Processing, IEEE transactions on, 58(3):1708–1721, 2010. [8] Laura Bernadó, Thomas Zemen, Fredrik Tufvesson, Andreas F Molisch, and Christoph F Mecklenbräuker. Delay and doppler spreads of non-stationary vehicu- lar channels for safety relevant scenarios. IEEE Trans. Veh. Technol., 63:82–93, 2014. [9] Dennis S Bernstein. Matrix mathematics: theory, facts, and formulas. Princeton University Press, 2009. [10] Sajjad Beygi and Urbashi Mitra. Optimal bayesian resampling for ofdm signaling over multi-scale multi-lag channels. IEEE Signal Processing Letters, 20:1118–1121, 2013. 169 [11] Sajjad Beygi and Urbashi Mitra. Multi-scale multi-lag channel estimation using low rank approximation for ofdm. IEEE Transactions on Signal Processing, 63(18):4744– 4755, 2015. [12] Sajjad Beygi, Urbashi Mitra, and Erik G Ström. Nested sparse approximation: struc- turedestimationofv2vchannelsusinggeometry-basedstochasticchannelmodel. IEEE Transactions on Signal Processing, 63(18):4940–4955, 2015. [13] Sajjad Beygi, Erik G Ström, and Urbashi Mitra. Geometry-based stochastic modeling andestimationofvehicletovehiclechannels. InProceedings IEEE Int. Conf. Acoustics, Speech and Signal Process., 2014. [14] Sajjad Beygi, Erik G Ström, and Urbashi Mitra. Structured sparse approximation via generalized regularizers: with application to v2v channel estimation. IEEE Global Communication Conference, (Globecom), pages 1–6, 2014. [15] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705–1732, 2009. [16] Thomas Blumensath. Sampling and reconstructing signals from a union of linear subspaces. IEEE Transactions on Information Theory, 57(7):4660–4671, 2011. [17] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing. Applied and computational harmonic analysis, 27(3):265–274, 2009. [18] Alireza Borhani and Matthias Patzold. Correlation and spectral properties of vehicle- to-vehicle channels in the presence of moving scatterers. IEEE Trans. Veh. Technol., 62:4228–4239, 2013. [19] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Dis- tributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1–122, 2011. [20] E. J Candès and T. Tao. Near-optimal signal recovery from random projections: Uni- versal encoding strategies? 52(12):5406–5425, 2006. [21] E.J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. 52(2):489–509, Feb. 2006. [22] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, pages 2313–2351, 2007. [23] Emmanuel J Candès and Carlos Fernandez-Granda. Towards a mathematical theory of super-resolution. Communications on Pure and Applied Mathematics, 67(6):906–956, 2014. 170 [24] Cecilia Carbonelli, Satish Vedantam, and Urbashi Mitra. Sparse channel estimation with zero tap detection. IEEE Trans. Wireless Commun., 6(5):1743–1763, 2007. [25] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse problems. Found. of Comp. Math., 12(6):805–849, 2012. [26] Rick Chartrand and Brendt Wohlberg. A nonconvex ADMM algorithm for group sparsity with sparse groups. In Proceedings IEEE Int. Conf. Acoustics, Speech and Signal Process., pages 6009–6013. IEEE, 2013. [27] Yuxin Chen and Yuejie Chi. Robust spectral compressed sensing via structured matrix completion. Information Theory, IEEE Transactions on, 60(10):6576–6601, 2014. [28] Yuejie Chi. Guaranteed blind sparse spikes deconvolution via lifting and convex opti- mization. IEEE Journal of Selected Topics in Signal Processing, 10(4):782–794, 2016. [29] Yuejie Chi, Louis L Scharf, Ali Pezeshki, et al. Sensitivity to basis mismatch in compressed sensing. Signal Processing, IEEE Transactions on, 59(5):2182–2195, 2011. [30] Sunav Choudhary, Sajjad Beygi, and Urbashi Mitra. Delay-Doppler Estimation via Structured Low-Rank Matrix Recovery. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016. to appear. [31] Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engi- neering, pages 185–212. Springer, 2011. [32] G.R. de Prony. Essai experimental et analytique: Sur les lois de la dilatabilite de fluides elastique et sur celles de la force expansive de la vapeur de lalkool. J. de lecole Polytechnique, 1(22):24–76, 1795. [33] W. Dong, G. Shi, X. Li, Y. Ma, and F. Huang. Compressive sensing via nonlocal low-rank regularization. 2014. [34] D.L. Donoho. Compressed sensing. 52(4):1289–1306, 2006. [35] F. Ebrahim Rezagah, S. Jalali, E. Erkip, and H. V. Poor. Using compression codes in compressed sensing. In IEEE Inf. Theory Work. (ITW), pages 444–448, 2016. [36] J Eckstein. Augmented lagrangian and alternating direction methods for convex opti- mization: A tutorial and some illustrative computational results. RUTCOR Research Reports, 32, 2012. [37] Yonina C Eldar and Moshe Mishali. Robust recovery of signals from a structured union of subspaces. IEEE Trans. Info. Theory, 55(11):5302–5316, 2009. [38] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statistical Assoc., 96(456):1348–1360, 2001. 171 [39] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning: Data mining, inference, and prediction. Springer Series in Statistics, 2009. [40] JeromeFriedman, TrevorHastie, andRobertTibshirani. Anoteonthegrouplassoand a sparse group lasso. Technical report, Department of Statistics, Stanford University, 2010. [41] Walter Gander. Algorithms for the polar decomposition. SIAM journal on scientific and statistical computing, 11(6):1102–1115, 1990. [42] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex pro- gramming, version 2.1. http://cvxr.com/cvx, March 2014. [43] Rémi Gribonval. Should penalized least squares regression be interpreted as maximum a posteriori estimation? IEEE Trans. Signal Process., 59(5):2405–2410, 2011. [44] Huaihai Guo, Ali Abdi, Aijun Song, and Mohsen Badiey. Delay and doppler spreads in underwater acoustic particle velocity channels. The Journal of the Acoustical Society of America, 129(4):2015–2025, 2011. [45] Justin P Haldar and Diego Hernando. Rank-constrained solutions to linear matrix equations using powerfactorization. IEEE Signal Processing Letters, 16(7):584–587, 2009. [46] Reinhard Heckel and Mahdi Soltanolkotabi. Generalized line spectral estimation via convex optimization. arXiv preprint arXiv:1609.08198, 2016. [47] Franz Hlawatsch and Gerald Matz. Wireless communications over rapidly time-varying channels. Academic Press, 2011. [48] Geoffrey A Hollinger, Sunav Choudhary, Parastoo Qarabaqi, Christopher Murphy, Urbashi Mitra, Gaurav S Sukhatme, Milica Stojanovic, Hanumant Singh, and Franz Hover. Underwater data collection using robotic sensor networks. IEEE Journal on Selected Areas in Communications, 30(5):899–911, 2012. [49] Mingyi Hong and Zhi-Quan Luo. On the linear convergence of the alternating direction method of multipliers. arXiv preprint arXiv:1208.3922, 2012. [50] Yingbo Hua and Tapan K Sarkar. Matrix pencil method for estimating parameters of exponentially damped/undamped sinusoids in noise. Acoustics, Speech and Signal Processing, IEEE Transactions on, 38(5):814–824, 1990. [51] S. Jalali and A. Maleki. From compression to compressed sensing. 40(2):352–385, 2016. [52] S. Jalali, A. Maleki, and R.G. Baraniuk. Minimum complexity pursuit for universal compressed sensing. 60(4):2253–2268, Apr. 2014. 172 [53] Shirin Jalali and Arian Maleki. New approach to bayesian high-dimensional linear regression. arXiv preprint arXiv:1607.02613, 2016. [54] Ye Jiang and Antonia Papandreou-Suppappola. Discrete time-scale characterization of wideband time-varying systems. Signal Processing, IEEE Transactions on, 54(4):1364– 1375, 2006. [55] Shashwat Jnawali, Sajjad Beygi, and Hamid-Reza Bahrami. RF impairments com- pensation and channel estimation in MIMO-OFDM systems. In 2011 IEEE Vehicular Technology Conference (VTC Fall), pages 1–5, 2011. [56] Nicolas F Josso, Jun Jason Zhang, Antonia Papandreou-Suppappola, Cornel Ioana, Jerome Mars, Cédric Gervaise, Yann Stéphan, et al. On the characterization of time- scale underwater acoustic signals using matching pursuit decomposition. In OCEANS 2009, MTS/IEEE Biloxi-Marine Technology for Our Future: Global and Local Chal- lenges, pages 1–6. IEEE, 2009. [57] Johan Karedal, Fredrik Tufvesson, Nicolai Czink, Alexander Paier, Charlotte Dumard, Thomas Zemen, Christoph F Mecklenbrauker, and Andreas F Molisch. A geometry- based stochastic MIMO model for vehicle-to-vehicle communications. IEEE Trans. Wireless Commun., 8(7):3646–3657, 2009. [58] T. Kawabata and A. Dembo. The rate-distortion dimension of sets and measures. 40(5):1564–1572, 1994. [59] Daniel B Kilfoyle and Arthur B Baggeroer. The state of the art in underwater acoustic telemetry. Oceanic Engineering, IEEE Journal of, 25(1):4–27, 2000. [60] Baosheng Li, Shengli Zhou, Milica Stojanovic, Lee Freitag, and Peter Willett. Mul- ticarrier communication over underwater acoustic channels with nonuniform doppler shifts. Oceanic Engineering, IEEE Journal of, 33(2):198–209, 2008. [61] Weichang Li and James C Preisig. Estimation of rapidly time-varying sparse channels. Oceanic Engineering, IEEE Journal of, 32(4):927–939, 2007. [62] Zhang Liu, Anders Hansson, and Lieven Vandenberghe. Nuclear norm system identi- fication with missing inputs and outputs. Systems & Control Letters, 62(8):605–612, 2013. [63] Xiaoli Ma and Georgios B Giannakis. Maximum-diversity transmissions over doubly selective wireless channels. Information Theory, IEEE Transactions on, 49(7):1832– 1840, 2003. [64] Adam R Margetts, Philip Schniter, and Ananthram Swami. Joint scale-lag diversity in wideband mobile direct sequence spread spectrum systems. Wireless Communications, IEEE Transactions on, 6(12):4308–4319, 2007. 173 [65] D. W. Matolak. V2V communication channels: State of knowledge, new results, and whats next. Commun. Tech. for Veh. Springer, pages 1–21, 2013. [66] Gerald Matz, H Bolcskei, and Franz Hlawatsch. Time-frequency foundations of com- munications: concepts and tools. 30(6):87–96, 2013. [67] C. A. Metzler, A. Maleki, and R. G. Baraniuk. From denoising to compressed sensing. 62(9):5117–5144, Sep. 2016. [68] Nicolò Michelusi, Urbashi Mitra, Andreas F Molisch, and Michele Zorzi. UWB sparse/diffuse channels, part i: Channel models and bayesian estimators. IEEE Trans. Signal Process., 60(10):5307–5319, 2012. [69] Nicolò Michelusi, Urbashi Mitra, Andreas F Molisch, and Michele Zorzi. UWB sparse/diffuse channels, part ii: Estimator analysis and practical channels. IEEE Trans. Signal Process., 60(10):5320–5333, 2012. [70] Webb Miller. Computational complexity and numerical stability. SIAM Journal on Computing, 4(2):97–107, 1975. [71] Andreas F Molisch. Wireless communications. John Wiley & Sons, 2007. [72] Andreas F Molisch, Aarne Mammela, and Desmond P Taylor. Wideband wireless digital communication. prentice hall PTR, 2000. [73] Andreas F Molisch, Fredrik Tufvesson, Johan Karedal, and Christoph F Mecklen- brauker. A survey on vehicle-to-vehicle propagation channels. IEEE Wireless Com- munications, 16(6):12–22, 2009. [74] John A Nelder and Roger Mead. A simplex method for function minimization. The computer journal, 7(4):308–313, 1965. [75] J Nuckelt, M Schack, and T Kürner. Deterministic and stochastic channel models implemented in a physical layer simulator for car-to-x communications. J. Adv. in Radio Science, 9(12):165–171, 2011. [76] Alexander Paier, Johan Karedal, Nicolai Czink, Charlotte Dumard, Thomas Zemen, Fredrik Tufvesson, Andreas F Molisch, and Christoph F Mecklenbräuker. Character- ization of vehicle-to-vehicle radio channels from measurements at 5.2 ghz. Wireless personal commun., 50(1):19–32, 2009. [77] Richard R Picard and R Dennis Cook. Cross-validation of regression models. Journal of the American Statistical Association, 79(387):575–583, 1984. [78] William H Press, Brian P Flannery, Saul A Teukolsky, William T Vetterling, and Peter B Kramer. Numerical recipes: the art of scientific computing. AIP, 1987. 174 [79] Parastoo Qarabaqi and Milica Stojanovic. Statistical characterization and computa- tionally efficient modeling of a class of underwater acoustic communication channels. Oceanic Engineering, IEEE Journal of, 38(4):701–717, 2013. [80] Theodore S Rappaport et al. Wireless communications: principles and practice, vol- ume 2. prentice hall PTR New Jersey, 1996. [81] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. SIAM Rev., 52(3):471–501, Apr. 2010. [82] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010. [83] Farideh Ebrahim Rezagah, Shirin Jalali, Elza Erkip, and H Vincent Poor. Compression-based compressed sensing. arXiv preprint arXiv:1601.01654, 2016. [84] Scott T Rickard, Radu V Balan, H Vincent Poor, Sergio Verdú, et al. Canonical time- frequency, time-scale, and frequency-scale representations of time-varying channels. Communications in Information & Systems, 5(2):197–226, 2005. [85] R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976. [86] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Berlin: Springer-Verlag, 2004. [87] Mark Rudelson. Recent developments in non-asymptotic theory of random matrices. Modern Aspects of Random Matrix Theory, 72:83, 2014. [88] Tapan K Sarkar and Odilon Pereira. Using the matrix pencil method to estimate the parameters of a sum of complex exponentials. Antennas and Propagation Magazine, IEEE, 37(1):48–55, 1995. [89] Akbar M Sayeed and Behnaam Aazhang. Joint multipath-doppler diversity in mobile wireless communications. Communications, IEEE Transactions on, 47(1):123–132, 1999. [90] P. Sprechmann, I Ramirez, G. Sapiro, and Y.C. Eldar. C-HiLasso: A collaborative hierarchical sparse modeling framework. IEEE Trans. Signal Process., 59(9):4183– 4198, Sept 2011. [91] William M Steedly, Ching-Hui J Ying, and Randolph L Moses. Statistical analysis of tls-based prony techniques. Automatica, 30(1):115–129, 1994. 175 [92] Mihailo Stojnic, Farzad Parvaresh, and Babak Hassibi. On the reconstruction of block- sparse signals with an optimal number of measurements. IEEE Trans. Signal Process., 57(8):3075–3085, 2009. [93] Gongguo Tang, Badri Narayan Bhaskar, and Benjamin Recht. Sparse recovery over continuous dictionaries-just discretize. In Signals, Systems and Computers, 2013 Asilo- mar Conference on, pages 1043–1047. IEEE, 2013. [94] Gongguo Tang, Badri Narayan Bhaskar, Parikshit Shah, and Benjamin Recht. Com- pressed sensing off the grid. Information Theory, IEEE Transactions on, 59(11):7465– 7490, 2013. [95] Georg Taubock, Franz Hlawatsch, Daniel Eiwen, and Holger Rauhut. Compressive estimation of doubly selective channels in multicarrier systems: Leakage effects and sparsity-enhancingprocessing. IEEE J. Sel. Topics Signal Process., 4(2):255–271, 2010. [96] A. Nikolayevich Tikhonov and V. Yakovlevich Arsenin. Solutions of ill-posed problems, volume 14. Winston Washington, DC, 1977. [97] David Tse and Pramod Viswanath. Fundamentals of wireless communication. Cam- bridge university press, 2005. [98] Jan-Jaap Van de Beek, Magnus Sandell, Per Ola Borjesson, et al. Ml estimation of time and frequency offset in ofdm systems. IEEE transactions on signal processing, 45(7):1800–1805, 1997. [99] Paul van Walree, Roald Otnes, et al. Ultrawideband underwater acoustic communica- tion channels. Oceanic Engineering, IEEE Journal of, 38(4):678–688, 2013. [100] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010. [101] Guanghan Xu, Hui Liu, Lang Tong, and T. Kailath. A Least-Squares Approach to Blind Channel Identification. 43(12):2982–2993, 1995. [102] Tao Xu, Zijian Tang, Geert Leus, and Urbashi Mitra. Multi-rate block transmission over wideband multi-scale multi-lag channels. Signal Processing, IEEE Transactions on, 61(4):964–979, 2013. [103] Srinivas Yerramalli and Urbashi Mitra. Optimal resampling of ofdm signals for multiscale–multilag underwater acoustic channels. Oceanic Engineering, IEEE Journal of, 36(1):126–138, 2011. [104] Srinivas Yerramalli, Urbashi Mitra, Zijian Tang, and Geert Leus. Channel estimation for multi-layer block transmissions over underwater acoustic channels. In Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on, pages 1530–1535. IEEE, 2012. 176 [105] Srinivas Yerramalli, Milica Stojanovic, and Urbashi Mitra. Partial fft demodulation: a detection method for highly doppler distorted ofdm systems. Signal Processing, IEEE Transactions on, 60(11):5906–5918, 2012. [106] Yao-Liang Yu. On decomposing the proximal map. In Advances in Neural Information Processing Systems, pages 91–99, 2013. [107] Thomas Zemen and Andreas F Molisch. Adaptive reduced-rank estimation of nonsta- tionary time-variant channels using subspace selection. IEEE Trans. Veh. Technol., 61(9):4042–4056, 2012. [108] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, pages 894–942, 2010. 177
Abstract (if available)
Abstract
In this thesis, we investigate and design new optimization techniques and algorithms to promote the target signal structures in their estimation/recovery process from an ill-posed linear system of equations. In particular, we consider wireless time-varying channels estimation problem and image/video signals acquisition problem from a small set of measurements. ❧ We present our proposed approaches to estimate two-dimensional (2D), time-varying, wireless communication channels by promoting their prior physical structures. We show that geometric information and intrinsic sparse structures of wireless communication channels can be exploited via proper regularization functions in the estimation problem to improve the accuracy of channel estimation with modest computational complexity. In particular, we studied the channel estimation problem for vehicle-to-vehicle (V2V), underwater acoustic (UWA) communication channels, and leaked time-varying narrowband communication channels. ❧ We show that V2V channels have a joint element- and group-wise sparsity structures. To exploit these structures, we propose a nested joint sparse recovery method. Our method solves the joint element/group sparse channel (signal) estimation problem using the proximity operators of a broad class of regularizers based on the alternating direction method of multipliers. Furthermore, key properties of the proposed objective functions are proven which ensure that the optimal solution is found by the new algorithm. We also investigate the underwater channel estimation problem. The underwater channel can be represented by a multi-scale multi-lag (MSML) channel model. We show that the data matrix for the transmitted signal, after passing through the MSML channel, exhibits a low-rank representation. In addition, we show that the MSML channel estimation problem can be represented as a spectral estimation problem. By exploiting the intrinsic low-rank structure of the received signal, the Prony algorithm is adapted to estimate the Doppler scales (close frequencies), delays and channel gains. Two strategies using convex and non-convex regularizers to remove noise from the corrupted signal were proposed. A bound on the reconstruction of the noise-less received signal provides guidance on the selection of the relaxation parameter in the convex optimizations. Another interesting problem is the estimation of a narrowband time-varying channel under the practical assumptions of finite block length and finite transmission bandwidth is investigated. We show that the signal, after passing through a time-varying narrowband channel reveals a particular parametric low-rank structure that can be represented as a bilinear form. To estimate the channel, we propose two structured methods. The first method exploits the low-rank bilinear structure of the channel via a non-convex strategy based on alternating direction optimization between delay and Doppler directions. Due to the non-convex nature of this approach, it is sensitive to local minima. Motivated by this issue, we propose a novel convex approach based on the minimization of the atomic norm using measurements of the signal at time domain. Furthermore, for convex approach, we characterize the optimality and uniqueness conditions, and theoretical guarantee for the noiseless channel estimation problem with a small number of measurements. ❧ In the next part of this thesis, we consider employing a compression code to build an efficient (polynomial time) compressed sensing recovery algorithm. Modern image and video compression codes employ elaborate structures in an effort to encode them using a small number of bits. Compressed sensing recovery algorithms, on the other hand, use such structures to recover the signals from a few linear observations. Despite the steady progress in the field of compressed sensing, the structures that are often used for signal recovery are still much simpler than those employed by state-of-the-art compression codes. The main goal of our study is to bridge this gap by answering the following question: Can one employ a compression code to build an efficient (polynomial time) compressed sensing recovery algorithm? In response to this question, the compression-based gradient descent (C-GD) algorithm is proposed. C-GD, which is a low-complexity iterative algorithm, is able to employ a generic compression code for compressed sensing and therefore enlarges the set of structures used in compressed sensing to those used by compression codes. We provide a convergence analysis of C-GD, a characterization of the required number of samples as a function of the rate-distortion function of the compression code
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Exploitation of sparse and low-rank structures for tracking and channel estimation
PDF
On the theory and applications of structured bilinear inverse problems to sparse blind deconvolution, active target localization, and delay-Doppler estimation
PDF
Learning and control for wireless networks via graph signal processing
PDF
Communication and cooperation in underwater acoustic networks
PDF
Signal processing for channel sounding: parameter estimation and calibration
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
New Lagrangian methods for constrained convex programs and their applications
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Communication and estimation in noisy networks
PDF
Estimation of graph Laplacian and covariance matrices
PDF
Data-driven optimization for indoor localization
PDF
Application-driven compressed sensing
PDF
Optimization methods and algorithms for constrained magnetic resonance imaging
PDF
Joint communication and sensing over state dependent channels
PDF
Sampling theory for graph signals with applications to semi-supervised learning
PDF
Active state tracking in heterogeneous sensor networks
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Landscape analysis and algorithms for large scale non-convex optimization
Asset Metadata
Creator
Beygiharchegani, Sajjad
(author)
Core Title
Novel optimization tools for structured signals recovery: channels estimation and compressible signal recovery
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/13/2018
Defense Date
05/03/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
channels estimation,compressed sensing,compressible signal recovery,OAI-PMH Harvest,optimization,structured signals recovery
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mitra, Urbashi (
committee chair
)
Creator Email
beygihar@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-472847
Unique identifier
UC11265746
Identifier
etd-Beygiharch-6031.pdf (filename),usctheses-c40-472847 (legacy record id)
Legacy Identifier
etd-Beygiharch-6031.pdf
Dmrecord
472847
Document Type
Dissertation
Rights
Beygiharchegani, Sajjad
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
channels estimation
compressed sensing
compressible signal recovery
optimization
structured signals recovery